diff options
author | ph10 <ph10@6239d852-aaf2-0410-a92c-79f79f948069> | 2014-10-14 16:23:57 +0000 |
---|---|---|
committer | ph10 <ph10@6239d852-aaf2-0410-a92c-79f79f948069> | 2014-10-14 16:23:57 +0000 |
commit | 48a3f40008b68ce52439689fcd56a937543a9e1d (patch) | |
tree | efcc53992b1762c495ecf9bda25648c6715d15fc /doc/pcre2partial.3 | |
parent | 5394b7a2fdd59cecbb5ff09de5421f2167d23970 (diff) | |
download | pcre2-48a3f40008b68ce52439689fcd56a937543a9e1d.tar.gz |
Partial documentation and partial code tweaks.
git-svn-id: svn://vcs.exim.org/pcre2/code/trunk@110 6239d852-aaf2-0410-a92c-79f79f948069
Diffstat (limited to 'doc/pcre2partial.3')
-rw-r--r-- | doc/pcre2partial.3 | 433 |
1 files changed, 433 insertions, 0 deletions
diff --git a/doc/pcre2partial.3 b/doc/pcre2partial.3 new file mode 100644 index 0000000..faad43c --- /dev/null +++ b/doc/pcre2partial.3 @@ -0,0 +1,433 @@ +.TH PCRE2PARTIAL 3 "14 October 2014" "PCRE2 10.00" +.SH NAME +PCRE2 - Perl-compatible regular expressions +.SH "PARTIAL MATCHING IN PCRE2" +.rs +.sp +In normal use of PCRE2, if the subject string that is passed to a matching +function matches as far as it goes, but is too short to match the entire +pattern, PCRE2_ERROR_NOMATCH is returned. There are circumstances where it +might be helpful to distinguish this case from other cases in which there is no +match. +.P +Consider, for example, an application where a human is required to type in data +for a field with specific formatting requirements. An example might be a date +in the form \fIddmmmyy\fP, defined by this pattern: +.sp + ^\ed?\ed(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\ed\ed$ +.sp +If the application sees the user's keystrokes one by one, and can check that +what has been typed so far is potentially valid, it is able to raise an error +as soon as a mistake is made, by beeping and not reflecting the character that +has been typed, for example. This immediate feedback is likely to be a better +user interface than a check that is delayed until the entire string has been +entered. Partial matching can also be useful when the subject string is very +long and is not all available at once. +.P +PCRE2 supports partial matching by means of the PCRE2_PARTIAL_SOFT and +PCRE2_PARTIAL_HARD options, which can be set when calling a matching function. +The difference between the two options is whether or not a partial match is +preferred to an alternative complete match, though the details differ between +the two types of matching function. If both options are set, PCRE2_PARTIAL_HARD +takes precedence. +.P +If you want to use partial matching with just-in-time optimized code, you must +call \fBpcre2_jit_compile()\fP with one or both of these options: +.sp + PCRE2_JIT_PARTIAL_SOFT + PCRE2_JIT_PARTIAL_HARD +.sp +PCRE2_JIT_COMPLETE should also be set if you are going to run non-partial +matches on the same pattern. If the appropriate JIT mode has not been compiled, +interpretive matching code is used. +.P +Setting a partial matching option disables two of PCRE2's standard +optimizations. PCRE2 remembers the last literal code unit in a pattern, and +abandons matching immediately if it is not present in the subject string. This +optimization cannot be used for a subject string that might match only +partially. PCRE2 also knows the minimum length of a matching string, and does +not bother to run the matching function on shorter strings. This optimization +is also disabled for partial matching. +. +. +.SH "PARTIAL MATCHING USING pcre2_match()" +.rs +.sp +A partial match occurs during a call to \fBpcre2_match()\fP when the end of the +subject string is reached successfully, but matching cannot continue because +more characters are needed. However, at least one character in the subject must +have been inspected. This character need not form part of the final matched +string; lookbehind assertions and the \eK escape sequence provide ways of +inspecting characters before the start of a matched string. The requirement for +inspecting at least one character exists because an empty string can always be +matched; without such a restriction there would always be a partial match of an +empty string at the end of the subject. +.P +When a partial match is returned, the first two elements in the ovector point +to the portion of the subject that was matched. The appearance of \eK in the +pattern has no effect for a partial match. Consider this pattern: +.sp + /abc\eK123/ +.sp +If it is matched against "456abc123xyz" the result is a complete match, and the +ovector defines the matched string as "123", because \eK resets the "start of +match" point. However, if a partial match is requested and the subject string +is "456abc12", a partial match is found for the string "abc12", because all +these characters are needed for a subsequent re-match with additional +characters. +.P +What happens when a partial match is identified depends on which of the two +partial matching options are set. +. +. +.SS "PCRE2_PARTIAL_SOFT WITH pcre2_match()" +.rs +.sp +If PCRE2_PARTIAL_SOFT is set when \fBpcre2_match()\fP identifies a partial +match, the partial match is remembered, but matching continues as normal, and +other alternatives in the pattern are tried. If no complete match can be found, +PCRE2_ERROR_PARTIAL is returned instead of PCRE2_ERROR_NOMATCH. +.P +This option is "soft" because it prefers a complete match over a partial match. +All the various matching items in a pattern behave as if the subject string is +potentially complete. For example, \ez, \eZ, and $ match at the end of the +subject, as normal, and for \eb and \eB the end of the subject is treated as a +non-alphanumeric. +.P +If there is more than one partial match, the first one that was found provides +the data that is returned. Consider this pattern: +.sp + /123\ew+X|dogY/ +.sp +If this is matched against the subject string "abc123dog", both +alternatives fail to match, but the end of the subject is reached during +matching, so PCRE2_ERROR_PARTIAL is returned. The offsets are set to 3 and 9, +identifying "123dog" as the first partial match that was found. (In this +example, there are two partial matches, because "dog" on its own partially +matches the second alternative.) +. +. +.SS "PCRE2_PARTIAL_HARD WITH pcre2_match()" +.rs +.sp +If PCRE2_PARTIAL_HARD is set for \fBpcre2_match()\fP, PCRE2_ERROR_PARTIAL is +returned as soon as a partial match is found, without continuing to search for +possible complete matches. This option is "hard" because it prefers an earlier +partial match over a later complete match. For this reason, the assumption is +made that the end of the supplied subject string may not be the true end of the +available data, and so, if \ez, \eZ, \eb, \eB, or $ are encountered at the end +of the subject, the result is PCRE2_ERROR_PARTIAL, provided that at least one +character in the subject has been inspected. +. +. +.SS "Comparing hard and soft partial matching" +.rs +.sp +The difference between the two partial matching options can be illustrated by a +pattern such as: +.sp + /dog(sbody)?/ +.sp +This matches either "dog" or "dogsbody", greedily (that is, it prefers the +longer string if possible). If it is matched against the string "dog" with +PCRE2_PARTIAL_SOFT, it yields a complete match for "dog". However, if +PCRE2_PARTIAL_HARD is set, the result is PCRE2_ERROR_PARTIAL. On the other +hand, if the pattern is made ungreedy the result is different: +.sp + /dog(sbody)??/ +.sp +In this case the result is always a complete match because that is found first, +and matching never continues after finding a complete match. It might be easier +to follow this explanation by thinking of the two patterns like this: +.sp + /dog(sbody)?/ is the same as /dogsbody|dog/ + /dog(sbody)??/ is the same as /dog|dogsbody/ +.sp +The second pattern will never match "dogsbody", because it will always find the +shorter match first. +. +. +.SH "PARTIAL MATCHING USING pcre2_dfa_match()" +.rs +.sp +The DFA functions move along the subject string character by character, without +backtracking, searching for all possible matches simultaneously. If the end of +the subject is reached before the end of the pattern, there is the possibility +of a partial match, again provided that at least one character has been +inspected. +.P +When PCRE2_PARTIAL_SOFT is set, PCRE2_ERROR_PARTIAL is returned only if there +have been no complete matches. Otherwise, the complete matches are returned. +However, if PCRE2_PARTIAL_HARD is set, a partial match takes precedence over +any complete matches. The portion of the string that was matched when the +longest partial match was found is set as the first matching string. +.P +Because the DFA functions always search for all possible matches, and there is +no difference between greedy and ungreedy repetition, their behaviour is +different from the standard functions when PCRE2_PARTIAL_HARD is set. Consider +the string "dog" matched against the ungreedy pattern shown above: +.sp + /dog(sbody)??/ +.sp +Whereas the standard functions stop as soon as they find the complete match for +"dog", the DFA functions also find the partial match for "dogsbody", and so +return that when PCRE2_PARTIAL_HARD is set. +. +. +.SH "PARTIAL MATCHING AND WORD BOUNDARIES" +.rs +.sp +If a pattern ends with one of sequences \eb or \eB, which test for word +boundaries, partial matching with PCRE2_PARTIAL_SOFT can give counter-intuitive +results. Consider this pattern: +.sp + /\ebcat\eb/ +.sp +This matches "cat", provided there is a word boundary at either end. If the +subject string is "the cat", the comparison of the final "t" with a following +character cannot take place, so a partial match is found. However, normal +matching carries on, and \eb matches at the end of the subject when the last +character is a letter, so a complete match is found. The result, therefore, is +\fInot\fP PCRE2_ERROR_PARTIAL. Using PCRE2_PARTIAL_HARD in this case does yield +PCRE2_ERROR_PARTIAL, because then the partial match takes precedence. +. +. +.SH "EXAMPLE OF PARTIAL MATCHING USING PCRE2TEST" +.rs +.sp +If the \fBpartial_soft\fP (or \fBps\fP) modifier is present on a +\fBpcre2test\fP data line, the PCRE2_PARTIAL_SOFT option is used for the match. +Here is a run of \fBpcre2test\fP that uses the date example quoted above: +.sp + re> /^\ed?\ed(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\ed\ed$/ + data> 25jun04\e=ps + 0: 25jun04 + 1: jun + data> 25dec3\e=ps + Partial match: 23dec3 + data> 3ju\e=ps + Partial match: 3ju + data> 3juj\e=ps + No match + data> j\e=ps + No match +.sp +The first data string is matched completely, so \fBpcre2test\fP shows the +matched substrings. The remaining four strings do not match the complete +pattern, but the first two are partial matches. Similar output is obtained +if DFA matching is used. +.P +If the \fBpartial_hard\fP (or \fBph\fP) modifier is present on a +\fBpcre2test\fP data line, the PCRE2_PARTIAL_HARD option is set for the match. +. +. +.SH "MULTI-SEGMENT MATCHING WITH pcre2_dfa_match()" +.rs +.sp +When a partial match has been found using a DFA matching function, it is +possible to continue the match by providing additional subject data and calling +the function again with the same compiled regular expression, this time setting +the PCRE2_DFA_RESTART option. You must pass the same working space as before, +because this is where details of the previous partial match are stored. Here is +an example using \fBpcre2test\fP: +.sp + re> /^\ed?\ed(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\ed\ed$/ + data> 23ja\e=dfa,ps + Partial match: 23ja + data> n05\e=dfa,dfa_restart + 0: n05 +.sp +The first call has "23ja" as the subject, and requests partial matching; the +second call has "n05" as the subject for the continued (restarted) match. +Notice that when the match is complete, only the last part is shown; PCRE2 does +not retain the previously partially-matched string. It is up to the calling +program to do that if it needs to. +.P +That means that, for an unanchored pattern, if a continued match fails, it is +not possible to try again at a new starting point. All this facility is capable +of doing is continuing with the previous match attempt. In the previous +example, if the second set of data is "ug23" the result is no match, even +though there would be a match for "aug23" if the entire string were given at +once. Depending on the application, this may or may not be what you want. +The only way to allow for starting again at the next character is to retain the +matched part of the subject and try a new complete match. +.P +You can set the PCRE2_PARTIAL_SOFT or PCRE2_PARTIAL_HARD options with +PCRE2_DFA_RESTART to continue partial matching over multiple segments. This +facility can be used to pass very long subject strings to the DFA matching +functions. +. +. +.SH "MULTI-SEGMENT MATCHING WITH pcre2_match()" +.rs +.sp +Unlike the DFA function, it is not possible to restart the previous match with +a new segment of data when using \fBpcre2_match()\fP. Instead, new data must be +added to the previous subject string, and the entire match re-run, starting +from the point where the partial match occurred. Earlier data can be discarded. +.P +It is best to use PCRE2_PARTIAL_HARD in this situation, because it does not +treat the end of a segment as the end of the subject when matching \ez, \eZ, +\eb, \eB, and $. Consider an unanchored pattern that matches dates: +.sp + re> /\ed?\ed(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\ed\ed/ + data> The date is 23ja\e=ph + Partial match: 23ja +.sp +At this stage, an application could discard the text preceding "23ja", add on +text from the next segment, and call the matching function again. Unlike the +DFA matching function, the entire matching string must always be available, +and the complete matching process occurs for each call, so more memory and more +processing time is needed. +. +. +.SH "ISSUES WITH MULTI-SEGMENT MATCHING" +.rs +.sp +Certain types of pattern may give problems with multi-segment matching, +whichever matching function is used. +.P +1. If the pattern contains a test for the beginning of a line, you need to pass +the PCRE2_NOTBOL option when the subject string for any call does start at the +beginning of a line. There is also a PCRE2_NOTEOL option, but in practice when +doing multi-segment matching you should be using PCRE2_PARTIAL_HARD, which +includes the effect of PCRE2_NOTEOL. +.P +2. If a pattern contains a lookbehind assertion, characters that precede the +start of the partial match may have been inspected during the matching process. +When using \fBpcre2_match()\fP, sufficient characters must be retained for the +next match attempt. You can ensure that enough characters are retained by doing +the following: +.P +Before doing any matching, find the length of the longest lookbehind in the +pattern by calling \fBpcre2_pattern_info()\fP with the PCRE2_INFO_MAXLOOKBEHIND +option. Note that the resulting count is in characters, not code units. After a +partial match, moving back from the ovector[0] offset in the subject by the +number of characters given for the maximum lookbehind gets you to the earliest +character that must be retained. In a non-UTF or a 32-bit situation, moving +back is just a subtraction, but in UTF-8 or UTF-16 you have to count characters +while moving back through the code units. +.P +Characters before the point you have now reached can be discarded, and after +the next segment has been added to what is retained, you should run the next +match with the \fBstartoffset\fP argument set so that the match begins at the +same point as before. +.P +For example, if the pattern "(?<=123)abc" is partially matched against the +string "xx123ab", the ovector offsets are 5 and 7 ("ab"). The maximum +lookbehind count is 3, so all characters before offset 2 can be discarded. The +value of \fBstartoffset\fP for the next match should be 3. When \fBpcre2test\fP +displays a partial match, it indicates the lookbehind characters with '<' +characters: +.sp + re> "(?<=123)abc" + data> xx123ab\e=ph + Partial match: 123ab + <<< +.P +3. Because a partial match must always contain at least one character, what +might be considered a partial match of an empty string actually gives a "no +match" result. For example: +.sp + re> /c(?<=abc)x/ + data> ab\e=ps + No match +.sp +If the next segment begins "cx", a match should be found, but this will only +happen if characters from the previous segment are retained. For this reason, a +"no match" result should be interpreted as "partial match of an empty string" +when the pattern contains lookbehinds. +.P +4. Matching a subject string that is split into multiple segments may not +always produce exactly the same result as matching over one single long string, +especially when PCRE2_PARTIAL_SOFT is used. The section "Partial Matching and +Word Boundaries" above describes an issue that arises if the pattern ends with +\eb or \eB. Another kind of difference may occur when there are multiple +matching possibilities, because (for PCRE2_PARTIAL_SOFT) a partial match result +is given only when there are no completed matches. This means that as soon as +the shortest match has been found, continuation to a new subject segment is no +longer possible. Consider this \fBpcre2test\fP example: +.sp + re> /dog(sbody)?/ + data> dogsb\e=ps + 0: dog + data> do\e=ps,dfa + Partial match: do + data> gsb\e=ps,dfa,dfa_restart + 0: g + data> dogsbody\e=dfa + 0: dogsbody + 1: dog +.sp +The first data line passes the string "dogsb" to a standard matching function, +setting the PCRE2_PARTIAL_SOFT option. Although the string is a partial match +for "dogsbody", the result is not PCRE2_ERROR_PARTIAL, because the shorter +string "dog" is a complete match. Similarly, when the subject is presented to +a DFA matching function in several parts ("do" and "gsb" being the first two) +the match stops when "dog" has been found, and it is not possible to continue. +On the other hand, if "dogsbody" is presented as a single string, a DFA +matching function finds both matches. +.P +Because of these problems, it is best to use PCRE2_PARTIAL_HARD when matching +multi-segment data. The example above then behaves differently: +.sp + re> /dog(sbody)?/ + data> dogsb\e=ph + Partial match: dogsb + data> do\e=ps,dfa + Partial match: do + data> gsb\e=ph,dfa,dfa_restart + Partial match: gsb +.sp +5. Patterns that contain alternatives at the top level which do not all start +with the same pattern item may not work as expected when PCRE2_DFA_RESTART is +used. For example, consider this pattern: +.sp + 1234|3789 +.sp +If the first part of the subject is "ABC123", a partial match of the first +alternative is found at offset 3. There is no partial match for the second +alternative, because such a match does not start at the same point in the +subject string. Attempting to continue with the string "7890" does not yield a +match because only those alternatives that match at one point in the subject +are remembered. The problem arises because the start of the second alternative +matches within the first alternative. There is no problem with anchored +patterns or patterns such as: +.sp + 1234|ABCD +.sp +where no string can be a partial match for both alternatives. This is not a +problem if a standard matching function is used, because the entire match has +to be rerun each time: +.sp + re> /1234|3789/ + data> ABC123\e=ph + Partial match: 123 + data> 1237890 + 0: 3789 +.sp +Of course, instead of using PCRE2_DFA_RESTART, the same technique of re-running +the entire match can also be used with the DFA matching function. Another +possibility is to work with two buffers. If a partial match at offset \fIn\fP +in the first buffer is followed by "no match" when PCRE2_DFA_RESTART is used on +the second buffer, you can then try a new match starting at offset \fIn+1\fP in +the first buffer. +. +. +.SH AUTHOR +.rs +.sp +.nf +Philip Hazel +University Computing Service +Cambridge CB2 3QH, England. +.fi +. +. +.SH REVISION +.rs +.sp +.nf +Last updated: 14 October 2014 +Copyright (c) 1997-2014 University of Cambridge. +.fi |