diff options
author | ph10 <ph10@6239d852-aaf2-0410-a92c-79f79f948069> | 2019-08-07 17:21:02 +0000 |
---|---|---|
committer | ph10 <ph10@6239d852-aaf2-0410-a92c-79f79f948069> | 2019-08-07 17:21:02 +0000 |
commit | c277d094e4d01ae9afad8bdd4d7537033a695a4f (patch) | |
tree | be95494defe9a921ff3a45c3ffe166b44bc6b1b2 /doc/html/pcre2partial.html | |
parent | c54035e9187b182c5de4cd73c425a2360b9f5878 (diff) | |
download | pcre2-c277d094e4d01ae9afad8bdd4d7537033a695a4f.tar.gz |
Partial match documentation rewritten.
git-svn-id: svn://vcs.exim.org/pcre2/code/trunk@1156 6239d852-aaf2-0410-a92c-79f79f948069
Diffstat (limited to 'doc/html/pcre2partial.html')
-rw-r--r-- | doc/html/pcre2partial.html | 531 |
1 files changed, 238 insertions, 293 deletions
diff --git a/doc/html/pcre2partial.html b/doc/html/pcre2partial.html index a2faa76..e0f37ea 100644 --- a/doc/html/pcre2partial.html +++ b/doc/html/pcre2partial.html @@ -14,85 +14,123 @@ please consult the man page, in case the conversion went wrong. <br> <ul> <li><a name="TOC1" href="#SEC1">PARTIAL MATCHING IN PCRE2</a> -<li><a name="TOC2" href="#SEC2">PARTIAL MATCHING USING pcre2_match()</a> -<li><a name="TOC3" href="#SEC3">PARTIAL MATCHING USING pcre2_dfa_match()</a> -<li><a name="TOC4" href="#SEC4">PARTIAL MATCHING AND WORD BOUNDARIES</a> -<li><a name="TOC5" href="#SEC5">EXAMPLE OF PARTIAL MATCHING USING PCRE2TEST</a> +<li><a name="TOC2" href="#SEC2">REQUIREMENTS FOR A PARTIAL MATCH</a> +<li><a name="TOC3" href="#SEC3">PARTIAL MATCHING USING pcre2_match()</a> +<li><a name="TOC4" href="#SEC4">MULTI-SEGMENT MATCHING WITH pcre2_match()</a> +<li><a name="TOC5" href="#SEC5">PARTIAL MATCHING USING pcre2_dfa_match()</a> <li><a name="TOC6" href="#SEC6">MULTI-SEGMENT MATCHING WITH pcre2_dfa_match()</a> -<li><a name="TOC7" href="#SEC7">MULTI-SEGMENT MATCHING WITH pcre2_match()</a> -<li><a name="TOC8" href="#SEC8">ISSUES WITH MULTI-SEGMENT MATCHING</a> -<li><a name="TOC9" href="#SEC9">AUTHOR</a> -<li><a name="TOC10" href="#SEC10">REVISION</a> +<li><a name="TOC7" href="#SEC7">AUTHOR</a> +<li><a name="TOC8" href="#SEC8">REVISION</a> </ul> <br><a name="SEC1" href="#TOC1">PARTIAL MATCHING IN PCRE2</a><br> <P> -In normal use of PCRE2, if the subject string that is passed to a matching -function matches as far as it goes, but is too short to match the entire -pattern, PCRE2_ERROR_NOMATCH is returned. There are circumstances where it -might be helpful to distinguish this case from other cases in which there is no -match. +In normal use of PCRE2, if there is a match up to the end of a subject string, +but more characters are needed to match the entire pattern, PCRE2_ERROR_NOMATCH +is returned, just like any other failing match. There are circumstances where +it might be helpful to distinguish this "partial match" case. </P> <P> -Consider, for example, an application where a human is required to type in data -for a field with specific formatting requirements. An example might be a date -in the form <i>ddmmmyy</i>, defined by this pattern: -<pre> - ^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$ -</pre> -If the application sees the user's keystrokes one by one, and can check that -what has been typed so far is potentially valid, it is able to raise an error -as soon as a mistake is made, by beeping and not reflecting the character that -has been typed, for example. This immediate feedback is likely to be a better -user interface than a check that is delayed until the entire string has been -entered. Partial matching can also be useful when the subject string is very -long and is not all available at once, as discussed below. +One example is an application where the subject string is very long, and not +all available at once. The requirement here is to be able to do the matching +segment by segment, but special action is needed when a matched substring spans +the boundary between two segments. +</P> +<P> +Another example is checking a user input string as it is typed, to ensure that +it conforms to a required format. Invalid characters can be immediately +diagnosed and rejected, giving instant feedback. </P> <P> -PCRE2 supports partial matching by means of the PCRE2_PARTIAL_SOFT and -PCRE2_PARTIAL_HARD options, which can be set when calling a matching function. -The difference between the two options is whether or not a partial match is -preferred to an alternative complete match, though the details differ between -the two types of matching function. If both options are set, PCRE2_PARTIAL_HARD -takes precedence. +Partial matching is a PCRE2-specific feature; it is not Perl-compatible. It is +requested by setting one of the PCRE2_PARTIAL_HARD or PCRE2_PARTIAL_SOFT +options when calling a matching function. The difference between the two +options is whether or not a partial match is preferred to an alternative +complete match, though the details differ between the two types of matching +function. If both options are set, PCRE2_PARTIAL_HARD takes precedence. </P> <P> -If you want to use partial matching with just-in-time optimized code, you must -call <b>pcre2_jit_compile()</b> with one or both of these options: +If you want to use partial matching with just-in-time optimized code, as well +as setting a partial match option for the matching function, you must also call +<b>pcre2_jit_compile()</b> with one or both of these options: <pre> - PCRE2_JIT_PARTIAL_SOFT PCRE2_JIT_PARTIAL_HARD + PCRE2_JIT_PARTIAL_SOFT </pre> PCRE2_JIT_COMPLETE should also be set if you are going to run non-partial -matches on the same pattern. If the appropriate JIT mode has not been compiled, -interpretive matching code is used. +matches on the same pattern. Separate code is compiled for each mode. If the +appropriate JIT mode has not been compiled, interpretive matching code is used. </P> <P> Setting a partial matching option disables two of PCRE2's standard -optimizations. PCRE2 remembers the last literal code unit in a pattern, and -abandons matching immediately if it is not present in the subject string. This -optimization cannot be used for a subject string that might match only -partially. PCRE2 also knows the minimum length of a matching string, and does +optimization hints. PCRE2 remembers the last literal code unit in a pattern, +and abandons matching immediately if it is not present in the subject string. +This optimization cannot be used for a subject string that might match only +partially. PCRE2 also remembers a minimum length of a matching string, and does not bother to run the matching function on shorter strings. This optimization is also disabled for partial matching. </P> -<br><a name="SEC2" href="#TOC1">PARTIAL MATCHING USING pcre2_match()</a><br> +<br><a name="SEC2" href="#TOC1">REQUIREMENTS FOR A PARTIAL MATCH</a><br> +<P> +A possible partial match occurs during matching when the end of the subject +string is reached successfully, but either more characters are needed to +complete the match, or the addition of more characters might change what is +matched. +</P> +<P> +Example 1: if the pattern is /abc/ and the subject is "ab", more characters are +definitely needed to complete a match. In this case both hard and soft matching +options yield a partial match. +</P> +<P> +Example 2: if the pattern is /ab+/ and the subject is "ab", a complete match +can be found, but the addition of more characters might change what is +matched. In this case, only PCRE2_PARTIAL_HARD returns a partial match; +PCRE2_PARTIAL_SOFT returns the complete match. +</P> +<P> +On reaching the end of the subject, when PCRE2_PARTIAL_HARD is set, if the next +pattern item is \z, \Z, \b, \B, or $ there is always a partial match. +Otherwise, for both options, the next pattern item must be one that inspects a +character, and at least one of the following must be true: +</P> +<P> +(1) At least one character has already been inspected. An inspected character +need not form part of the final matched string; lookbehind assertions and the +\K escape sequence provide ways of inspecting characters before the start of a +matched string. +</P> <P> -A partial match occurs during a call to <b>pcre2_match()</b> when the end of the -subject string is reached successfully, but matching cannot continue because -more characters are needed, and in addition, either at least one character in -the subject has been inspected or the pattern contains a lookbehind, or (when -PCRE2_PARTIAL_HARD is set) the pattern could match an empty string. An -inspected character need not form part of the final matched string; lookbehind -assertions and the \K escape sequence provide ways of inspecting characters -before the start of a matched string. +(2) The pattern contains one or more lookbehind assertions. This condition +exists in case there is a lookbehind that inspects characters before the start +of the match. </P> <P> -The three additional requirements define the cases where adding more characters -to the existing subject may complete the same match that would occur if they -had all been present in the first place. Without these conditions there would -be a partial match of an empty string at the end of the subject for all -unanchored patterns (and also for anchored patterns if the subject itself is -empty). +(3) There is a special case when the whole pattern can match an empty string. +When the starting point is at the end of the subject, the empty string match is +a possibility, and if PCRE2_PARTIAL_SOFT is set and neither of the above +conditions is true, it is returned. However, because adding more characters +might result in a non-empty match, PCRE2_PARTIAL_HARD returns a partial match, +which in this case means "there is going to be a match at this point, but until +some more characters are added, we do not know if it will be an empty string or +something longer". +</P> +<br><a name="SEC3" href="#TOC1">PARTIAL MATCHING USING pcre2_match()</a><br> +<P> +When a partial matching option is set, the result of calling +<b>pcre2_match()</b> can be one of the following: +</P> +<P> +<b>A successful match</b> +A complete match has been found, starting and ending within this subject. +</P> +<P> +<b>PCRE2_ERROR_NOMATCH</b> +No match can start anywhere in this subject. +</P> +<P> +<b>PCRE2_ERROR_PARTIAL</b> +Adding more characters may result in a complete match that uses one or more +characters from the end of this subject. </P> <P> When a partial match is returned, the first two elements in the ovector point @@ -110,26 +148,6 @@ these characters are needed for a subsequent re-match with additional characters. </P> <P> -What happens when a partial match is identified depends on which of the two -partial matching options is set. -</P> -<br><b> -PCRE2_PARTIAL_SOFT WITH pcre2_match() -</b><br> -<P> -If PCRE2_PARTIAL_SOFT is set when <b>pcre2_match()</b> identifies a partial -match, the partial match is remembered, but matching continues as normal, and -other alternatives in the pattern are tried. If no complete match can be found, -PCRE2_ERROR_PARTIAL is returned instead of PCRE2_ERROR_NOMATCH. -</P> -<P> -This option is "soft" because it prefers a complete match over a partial match. -All the various matching items in a pattern behave as if the subject string is -potentially complete. For example, \z, \Z, and $ match at the end of the -subject, as normal, and for \b and \B the end of the subject is treated as a -non-alphanumeric. -</P> -<P> If there is more than one partial match, the first one that was found provides the data that is returned. Consider this pattern: <pre> @@ -138,26 +156,34 @@ the data that is returned. Consider this pattern: If this is matched against the subject string "abc123dog", both alternatives fail to match, but the end of the subject is reached during matching, so PCRE2_ERROR_PARTIAL is returned. The offsets are set to 3 and 9, identifying -"123dog" as the first partial match that was found. (In this example, there are -two partial matches, because "dog" on its own partially matches the second -alternative.) +"123dog" as the first partial match. (In this example, there are two partial +matches, because "dog" on its own partially matches the second alternative.) </P> <br><b> -PCRE2_PARTIAL_HARD WITH pcre2_match() +How a partial match is processed by pcre2_match() </b><br> <P> -If PCRE2_PARTIAL_HARD is set for <b>pcre2_match()</b>, PCRE2_ERROR_PARTIAL is -returned as soon as a partial match is found, without continuing to search for -possible complete matches. This option is "hard" because it prefers an earlier -partial match over a later complete match. For this reason, the assumption is -made that the end of the supplied subject string may not be the true end of the -available data, and so, if \z, \Z, \b, \B, or $ are encountered at the end -of the subject, the result is PCRE2_ERROR_PARTIAL, whether or not any -characters have been inspected. +What happens when a partial match is identified depends on which of the two +partial matching options is set. +</P> +<P> +If PCRE2_PARTIAL_HARD is set, PCRE2_ERROR_PARTIAL is returned as soon as a +partial match is found, without continuing to search for possible complete +matches. This option is "hard" because it prefers an earlier partial match over +a later complete match. For this reason, the assumption is made that the end of +the supplied subject string is not the true end of the available data, which is +why \z, \Z, \b, \B, and $ always give a partial match. +</P> +<P> +If PCRE2_PARTIAL_SOFT is set, the partial match is remembered, but matching +continues as normal, and other alternatives in the pattern are tried. If no +complete match can be found, PCRE2_ERROR_PARTIAL is returned instead of +PCRE2_ERROR_NOMATCH. This option is "soft" because it prefers a complete match +over a partial match. All the various matching items in a pattern behave as if +the subject string is potentially complete; \z, \Z, and $ match at the end of +the subject, as normal, and for \b and \B the end of the subject is treated +as a non-alphanumeric. </P> -<br><b> -Comparing hard and soft partial matching -</b><br> <P> The difference between the two partial matching options can be illustrated by a pattern such as: @@ -182,154 +208,85 @@ to follow this explanation by thinking of the two patterns like this: The second pattern will never match "dogsbody", because it will always find the shorter match first. </P> -<br><a name="SEC3" href="#TOC1">PARTIAL MATCHING USING pcre2_dfa_match()</a><br> -<P> -The DFA functions move along the subject string character by character, without -backtracking, searching for all possible matches simultaneously. If the end of -the subject is reached before the end of the pattern, there is the possibility -of a partial match, again provided that at least one character has been -inspected. -</P> -<P> -When PCRE2_PARTIAL_SOFT is set, PCRE2_ERROR_PARTIAL is returned only if there -have been no complete matches. Otherwise, the complete matches are returned. -However, if PCRE2_PARTIAL_HARD is set, a partial match takes precedence over -any complete matches. The portion of the string that was matched when the -longest partial match was found is set as the first matching string. -</P> -<P> -Because the DFA functions always search for all possible matches, and there is -no difference between greedy and ungreedy repetition, their behaviour is -different from the standard functions when PCRE2_PARTIAL_HARD is set. Consider -the string "dog" matched against the ungreedy pattern shown above: -<pre> - /dog(sbody)??/ -</pre> -Whereas the standard function stops as soon as it finds the complete match for -"dog", the DFA function also finds the partial match for "dogsbody", and so -returns that when PCRE2_PARTIAL_HARD is set. -</P> -<br><a name="SEC4" href="#TOC1">PARTIAL MATCHING AND WORD BOUNDARIES</a><br> +<br><b> +Example of partial matching using pcre2test +</b><br> <P> -If a pattern ends with one of sequences \b or \B, which test for word -boundaries, partial matching with PCRE2_PARTIAL_SOFT can give counter-intuitive -results. Consider this pattern: -<pre> - /\bcat\b/ -</pre> -This matches "cat", provided there is a word boundary at either end. If the -subject string is "the cat", the comparison of the final "t" with a following -character cannot take place, so a partial match is found. However, normal -matching carries on, and \b matches at the end of the subject when the last -character is a letter, so a complete match is found. The result, therefore, is -<i>not</i> PCRE2_ERROR_PARTIAL. Using PCRE2_PARTIAL_HARD in this case does yield -PCRE2_ERROR_PARTIAL, because then the partial match takes precedence. -</P> -<br><a name="SEC5" href="#TOC1">EXAMPLE OF PARTIAL MATCHING USING PCRE2TEST</a><br> -<P> -If the <b>partial_soft</b> (or <b>ps</b>) modifier is present on a -<b>pcre2test</b> data line, the PCRE2_PARTIAL_SOFT option is used for the match. -Here is a run of <b>pcre2test</b> that uses the date example quoted above: +The <b>pcre2test</b> data modifiers <b>partial_hard</b> (or <b>ph</b>) and +<b>partial_soft</b> (or <b>ps</b>) set PCRE2_PARTIAL_HARD and PCRE2_PARTIAL_SOFT, +respectively, when calling <b>pcre2_match()</b>. Here is a run of +<b>pcre2test</b> using a pattern that matches the whole subject in the form of a +date: <pre> re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/ - data> 25jun04\=ps - 0: 25jun04 - 1: jun - data> 25dec3\=ps + data> 25dec3\=ph Partial match: 23dec3 - data> 3ju\=ps + data> 3ju\=ph Partial match: 3ju - data> 3juj\=ps - No match - data> j\=ps + data> 3juj\=ph No match </pre> -The first data string is matched completely, so <b>pcre2test</b> shows the -matched substrings. The remaining four strings do not match the complete -pattern, but the first two are partial matches. Similar output is obtained -if DFA matching is used. -</P> -<P> -If the <b>partial_hard</b> (or <b>ph</b>) modifier is present on a -<b>pcre2test</b> data line, the PCRE2_PARTIAL_HARD option is set for the match. -</P> -<br><a name="SEC6" href="#TOC1">MULTI-SEGMENT MATCHING WITH pcre2_dfa_match()</a><br> -<P> -When a partial match has been found using a DFA matching function, it is -possible to continue the match by providing additional subject data and calling -the function again with the same compiled regular expression, this time setting -the PCRE2_DFA_RESTART option. You must pass the same working space as before, -because this is where details of the previous partial match are stored. Here is -an example using <b>pcre2test</b>: +This example gives the same results for both hard and soft partial matching +options. Here is an example where there is a difference: <pre> re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/ - data> 23ja\=dfa,ps - Partial match: 23ja - data> n05\=dfa,dfa_restart - 0: n05 + data> 25jun04\=ps + 0: 25jun04 + 1: jun + data> 25jun04\=ph + Partial match: 25jun04 </pre> -The first call has "23ja" as the subject, and requests partial matching; the -second call has "n05" as the subject for the continued (restarted) match. -Notice that when the match is complete, only the last part is shown; PCRE2 does -not retain the previously partially-matched string. It is up to the calling -program to do that if it needs to. -</P> -<P> -That means that, for an unanchored pattern, if a continued match fails, it is -not possible to try again at a new starting point. All this facility is capable -of doing is continuing with the previous match attempt. In the previous -example, if the second set of data is "ug23" the result is no match, even -though there would be a match for "aug23" if the entire string were given at -once. Depending on the application, this may or may not be what you want. -The only way to allow for starting again at the next character is to retain the -matched part of the subject and try a new complete match. +With PCRE2_PARTIAL_SOFT, the subject is matched completely. For +PCRE2_PARTIAL_HARD, however, the subject is assumed not to be complete, so +there is only a partial match. </P> +<br><a name="SEC4" href="#TOC1">MULTI-SEGMENT MATCHING WITH pcre2_match()</a><br> <P> -You can set the PCRE2_PARTIAL_SOFT or PCRE2_PARTIAL_HARD options with -PCRE2_DFA_RESTART to continue partial matching over multiple segments. This -facility can be used to pass very long subject strings to the DFA matching -functions. +PCRE was not originally designed with multi-segment matching in mind. However, +over time, features (including partial matching) that make multi-segment +matching possible have been added. The string is searched segment by segment by +calling <b>pcre2_match()</b> repeatedly, with the aim of achieving the same +results that would happen if the entire string was available for searching. </P> -<br><a name="SEC7" href="#TOC1">MULTI-SEGMENT MATCHING WITH pcre2_match()</a><br> <P> -Unlike the DFA function, it is not possible to restart the previous match with -a new segment of data when using <b>pcre2_match()</b>. Instead, new data must be -added to the previous subject string, and the entire match re-run, starting -from the point where the partial match occurred. Earlier data can be discarded. +Special logic must be implemented to handle a matched substring that spans a +segment boundary. PCRE2_PARTIAL_HARD should be used, because it returns a +partial match at the end of a segment whenever there is the possibility of +changing the match by adding more characters. The PCRE2_NOTBOL option should +also be set for all but the first segment. </P> <P> -It is best to use PCRE2_PARTIAL_HARD in this situation, because it does not -treat the end of a segment as the end of the subject when matching \z, \Z, -\b, \B, and $. Consider an unanchored pattern that matches dates: +When a partial match occurs, the next segment must be added to the current +subject and the match re-run, using the <i>startoffset</i> argument of +<b>pcre2_match()</b> to begin at the point where the partial match started. +Multi-segment matching is usually used to search for substrings in the middle +of very long sequences, so the patterns are normally not anchored. For example: <pre> re> /\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d/ - data> The date is 23ja\=ph + data> ...the date is 23ja\=ph Partial match: 23ja + data> ...the date is 23jan19 and on that day...\=offset=15 + 0: 23jan19 + 1: jan </pre> -At this stage, an application could discard the text preceding "23ja", add on -text from the next segment, and call the matching function again. Unlike the -DFA matching function, the entire matching string must always be available, -and the complete matching process occurs for each call, so more memory and more -processing time is needed. +Note the use of the <b>offset</b> modifier to start the new match where the +partial match was found. </P> -<br><a name="SEC8" href="#TOC1">ISSUES WITH MULTI-SEGMENT MATCHING</a><br> <P> -Certain types of pattern may give problems with multi-segment matching, -whichever matching function is used. +In this simple example, the next segment was just added to the one in which the +partial match was found. However, if there are memory constraints, it may be +necessary to discard text that precedes the partial match before adding the +next segment. In cases such as the above, where the pattern does not contain +any lookbehinds, it is sufficient to retain only the partially matched +substring. However, if a pattern contains a lookbehind assertion, characters +that precede the start of the partial match may have been inspected during the +matching process. </P> <P> -1. If the pattern contains a test for the beginning of a line, you need to pass -the PCRE2_NOTBOL option when the subject string for any call does start at the -beginning of a line. There is also a PCRE2_NOTEOL option, but in practice when -doing multi-segment matching you should be using PCRE2_PARTIAL_HARD, which -includes the effect of PCRE2_NOTEOL. -</P> -<P> -2. If a pattern contains a lookbehind assertion, characters that precede the -start of the partial match may have been inspected during the matching process. -When using <b>pcre2_match()</b>, sufficient characters must be retained for the -next match attempt. You can ensure that enough characters are retained by doing -the following: +The only lookbehind information that is available is the length of the longest +lookbehind in a pattern. This may not, of course, be at the start of the +pattern, but retaining that many characters before the partial match is +sufficient, if not always strictly necessary. The way to do this is as follows: </P> <P> Before doing any matching, find the length of the longest lookbehind in the @@ -339,13 +296,8 @@ partial match, moving back from the ovector[0] offset in the subject by the number of characters given for the maximum lookbehind gets you to the earliest character that must be retained. In a non-UTF or a 32-bit situation, moving back is just a subtraction, but in UTF-8 or UTF-16 you have to count characters -while moving back through the code units. -</P> -<P> -Characters before the point you have now reached can be discarded, and after -the next segment has been added to what is retained, you should run the next -match with the <b>startoffset</b> argument set so that the match begins at the -same point as before. +while moving back through the code units. Characters before the point you have +now reached can be discarded. </P> <P> For example, if the pattern "(?<=123)abc" is partially matched against the @@ -353,62 +305,67 @@ string "xx123ab", the ovector offsets are 5 and 7 ("ab"). The maximum lookbehind count is 3, so all characters before offset 2 can be discarded. The value of <b>startoffset</b> for the next match should be 3. When <b>pcre2test</b> displays a partial match, it indicates the lookbehind characters with '<' -characters if the "allusedtext" modifier is set: +characters if the <b>allusedtext</b> modifier is set: <pre> re> "(?<=123)abc" data> xx123ab\=ph,allusedtext Partial match: 123ab <<< </pre> -However, the "allusedtext" modifier is not available for JIT matching, because -JIT matching does not maintain the first and last consulted characters. -</P> -<P> -3. Matching a subject string that is split into multiple segments may not -always produce exactly the same result as matching over one single long string -when PCRE2_PARTIAL_SOFT is used. The section "Partial Matching and Word -Boundaries" above describes an issue that arises if the pattern ends with \b -or \B. Another kind of difference may occur when there are multiple matching -possibilities, because (for PCRE2_PARTIAL_SOFT) a partial match result is given -only when there are no completed matches. This means that as soon as the -shortest match has been found, continuation to a new subject segment is no -longer possible. Consider this <b>pcre2test</b> example: +Note that the \fPallusedtext\fP modifier is not available for JIT matching, +because JIT matching does not maintain the first and last consulted characters. +</P> +<br><a name="SEC5" href="#TOC1">PARTIAL MATCHING USING pcre2_dfa_match()</a><br> +<P> +The DFA function moves along the subject string character by character, without +backtracking, searching for all possible matches simultaneously. If the end of +the subject is reached before the end of the pattern, there is the possibility +of a partial match. +</P> +<P> +When PCRE2_PARTIAL_SOFT is set, PCRE2_ERROR_PARTIAL is returned only if there +have been no complete matches. Otherwise, the complete matches are returned. +If PCRE2_PARTIAL_HARD is set, a partial match takes precedence over any +complete matches. The portion of the string that was matched when the longest +partial match was found is set as the first matching string. +</P> +<P> +Because the DFA function always searches for all possible matches, and there is +no difference between greedy and ungreedy repetition, its behaviour is +different from the <b>pcre2_match()</b>. Consider the string "dog" matched +against this ungreedy pattern: <pre> - re> /dog(sbody)?/ - data> dogsb\=ps - 0: dog - data> do\=ps,dfa - Partial match: do - data> gsb\=ps,dfa,dfa_restart - 0: g - data> dogsbody\=dfa - 0: dogsbody - 1: dog + /dog(sbody)??/ </pre> -The first data line passes the string "dogsb" to a standard matching function, -setting the PCRE2_PARTIAL_SOFT option. Although the string is a partial match -for "dogsbody", the result is not PCRE2_ERROR_PARTIAL, because the shorter -string "dog" is a complete match. Similarly, when the subject is presented to -a DFA matching function in several parts ("do" and "gsb" being the first two) -the match stops when "dog" has been found, and it is not possible to continue. -On the other hand, if "dogsbody" is presented as a single string, a DFA -matching function finds both matches. -</P> -<P> -Because of these problems, it is best to use PCRE2_PARTIAL_HARD when matching -multi-segment data. The example above then behaves differently: +Whereas the standard function stops as soon as it finds the complete match for +"dog", the DFA function also finds the partial match for "dogsbody", and so +returns that when PCRE2_PARTIAL_HARD is set. +</P> +<br><a name="SEC6" href="#TOC1">MULTI-SEGMENT MATCHING WITH pcre2_dfa_match()</a><br> +<P> +When a partial match has been found using the DFA matching function, it is +possible to continue the match by providing additional subject data and calling +the function again with the same compiled regular expression, this time setting +the PCRE2_DFA_RESTART option. You must pass the same working space as before, +because this is where details of the previous partial match are stored. You can +set the PCRE2_PARTIAL_SOFT or PCRE2_PARTIAL_HARD options with PCRE2_DFA_RESTART +to continue partial matching over multiple segments. Here is an example using +<b>pcre2test</b>: <pre> - re> /dog(sbody)?/ - data> dogsb\=ph - Partial match: dogsb - data> do\=ps,dfa - Partial match: do - data> gsb\=ph,dfa,dfa_restart - Partial match: gsb + re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/ + data> 23ja\=dfa,ps + Partial match: 23ja + data> n05\=dfa,dfa_restart + 0: n05 </pre> -4. Patterns that contain alternatives at the top level which do not all start -with the same pattern item may not work as expected when PCRE2_DFA_RESTART is -used. For example, consider this pattern: +The first call has "23ja" as the subject, and requests partial matching; the +second call has "n05" as the subject for the continued (restarted) match. +Notice that when the match is complete, only the last part is shown; PCRE2 does +not retain the previously partially-matched string. It is up to the calling +program to do that if it needs to. This means that, for an unanchored pattern, +if a continued match fails, it is not possible to try again at a new starting +point. All this facility is capable of doing is continuing with the previous +match attempt. For example, consider this pattern: <pre> 1234|3789 </pre> @@ -417,30 +374,18 @@ alternative is found at offset 3. There is no partial match for the second alternative, because such a match does not start at the same point in the subject string. Attempting to continue with the string "7890" does not yield a match because only those alternatives that match at one point in the subject -are remembered. The problem arises because the start of the second alternative -matches within the first alternative. There is no problem with anchored -patterns or patterns such as: -<pre> - 1234|ABCD -</pre> -where no string can be a partial match for both alternatives. This is not a -problem if a standard matching function is used, because the entire match has -to be rerun each time: -<pre> - re> /1234|3789/ - data> ABC123\=ph - Partial match: 123 - data> 1237890 - 0: 3789 -</pre> -Of course, instead of using PCRE2_DFA_RESTART, the same technique of re-running -the entire match can also be used with the DFA matching function. Another -possibility is to work with two buffers. If a partial match at offset <i>n</i> -in the first buffer is followed by "no match" when PCRE2_DFA_RESTART is used on -the second buffer, you can then try a new match starting at offset <i>n+1</i> in -the first buffer. +are remembered. Depending on the application, this may or may not be what you +want. +</P> +<P> +If you do want to allow for starting again at the next character, one way of +doing it is to retain the matched part of the segment and try a new complete +match, as described for <b>pcre2_match()</b> above. Another possibility is to +work with two buffers. If a partial match at offset <i>n</i> in the first buffer +is followed by "no match" when PCRE2_DFA_RESTART is used on the second buffer, +you can then try a new match starting at offset <i>n+1</i> in the first buffer. </P> -<br><a name="SEC9" href="#TOC1">AUTHOR</a><br> +<br><a name="SEC7" href="#TOC1">AUTHOR</a><br> <P> Philip Hazel <br> @@ -449,9 +394,9 @@ University Computing Service Cambridge, England. <br> </P> -<br><a name="SEC10" href="#TOC1">REVISION</a><br> +<br><a name="SEC8" href="#TOC1">REVISION</a><br> <P> -Last updated: 22 July 2019 +Last updated: 07 August 2019 <br> Copyright © 1997-2019 University of Cambridge. <br> |