diff options
author | ph10 <ph10@6239d852-aaf2-0410-a92c-79f79f948069> | 2019-09-04 18:14:54 +0000 |
---|---|---|
committer | ph10 <ph10@6239d852-aaf2-0410-a92c-79f79f948069> | 2019-09-04 18:14:54 +0000 |
commit | d291ead6117132882f9f34207818b87dd64de66f (patch) | |
tree | 8a8453983730199fb9655aa015f28554a35da5de /doc/html/pcre2partial.html | |
parent | 57d867b34f1dceb380fb4b5a54cc7744ff4968e7 (diff) | |
download | pcre2-d291ead6117132882f9f34207818b87dd64de66f.tar.gz |
Back off failed attempt to handle nested lookbehinds for estimating how much of
a partial match to retain for multi-segment matching. Document the current
difficulty if the whole first segment cannot be retained.
git-svn-id: svn://vcs.exim.org/pcre2/code/trunk@1163 6239d852-aaf2-0410-a92c-79f79f948069
Diffstat (limited to 'doc/html/pcre2partial.html')
-rw-r--r-- | doc/html/pcre2partial.html | 115 |
1 files changed, 59 insertions, 56 deletions
diff --git a/doc/html/pcre2partial.html b/doc/html/pcre2partial.html index e0f37ea..1a88c1d 100644 --- a/doc/html/pcre2partial.html +++ b/doc/html/pcre2partial.html @@ -49,7 +49,7 @@ complete match, though the details differ between the two types of matching function. If both options are set, PCRE2_PARTIAL_HARD takes precedence. </P> <P> -If you want to use partial matching with just-in-time optimized code, as well +If you want to use partial matching with just-in-time optimized code, as well as setting a partial match option for the matching function, you must also call <b>pcre2_jit_compile()</b> with one or both of these options: <pre> @@ -101,7 +101,7 @@ matched string. </P> <P> (2) The pattern contains one or more lookbehind assertions. This condition -exists in case there is a lookbehind that inspects characters before the start +exists in case there is a lookbehind that inspects characters before the start of the match. </P> <P> @@ -171,7 +171,7 @@ If PCRE2_PARTIAL_HARD is set, PCRE2_ERROR_PARTIAL is returned as soon as a partial match is found, without continuing to search for possible complete matches. This option is "hard" because it prefers an earlier partial match over a later complete match. For this reason, the assumption is made that the end of -the supplied subject string is not the true end of the available data, which is +the supplied subject string is not the true end of the available data, which is why \z, \Z, \b, \B, and $ always give a partial match. </P> <P> @@ -226,7 +226,7 @@ date: data> 3juj\=ph No match </pre> -This example gives the same results for both hard and soft partial matching +This example gives the same results for both hard and soft partial matching options. Here is an example where there is a difference: <pre> re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/ @@ -234,7 +234,7 @@ options. Here is an example where there is a difference: 0: 25jun04 1: jun data> 25jun04\=ph - Partial match: 25jun04 + Partial match: 25jun04 </pre> With PCRE2_PARTIAL_SOFT, the subject is matched completely. For PCRE2_PARTIAL_HARD, however, the subject is assumed not to be complete, so @@ -244,9 +244,12 @@ there is only a partial match. <P> PCRE was not originally designed with multi-segment matching in mind. However, over time, features (including partial matching) that make multi-segment -matching possible have been added. The string is searched segment by segment by -calling <b>pcre2_match()</b> repeatedly, with the aim of achieving the same -results that would happen if the entire string was available for searching. +matching possible have been added. A very long string can be searched segment +by segment by calling <b>pcre2_match()</b> repeatedly, with the aim of achieving +the same results that would happen if the entire string was available for +searching all the time. Normally, the strings that are being sought are much +shorter than each individual segment, and are in the middle of very long +strings, so the pattern is normally not anchored. </P> <P> Special logic must be implemented to handle a matched substring that spans a @@ -256,11 +259,10 @@ changing the match by adding more characters. The PCRE2_NOTBOL option should also be set for all but the first segment. </P> <P> -When a partial match occurs, the next segment must be added to the current -subject and the match re-run, using the <i>startoffset</i> argument of +When a partial match occurs, the next segment must be added to the current +subject and the match re-run, using the <i>startoffset</i> argument of <b>pcre2_match()</b> to begin at the point where the partial match started. -Multi-segment matching is usually used to search for substrings in the middle -of very long sequences, so the patterns are normally not anchored. For example: +For example: <pre> re> /\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d/ data> ...the date is 23ja\=ph @@ -269,51 +271,52 @@ of very long sequences, so the patterns are normally not anchored. For example: 0: 23jan19 1: jan </pre> -Note the use of the <b>offset</b> modifier to start the new match where the -partial match was found. -</P> -<P> -In this simple example, the next segment was just added to the one in which the -partial match was found. However, if there are memory constraints, it may be -necessary to discard text that precedes the partial match before adding the -next segment. In cases such as the above, where the pattern does not contain -any lookbehinds, it is sufficient to retain only the partially matched -substring. However, if a pattern contains a lookbehind assertion, characters +Note the use of the <b>offset</b> modifier to start the new match where the +partial match was found. In this example, the next segment was added to the one +in which the partial match was found. This is the most straightforward +approach, typically using a memory buffer that is twice the size of each +segment. After a partial match, the first half of the buffer is discarded, the +second half is moved to the start of the buffer, and a new segment is added +before repeating the match as in the example above. After a no match, the +entire buffer can be discarded. +</P> +<P> +If there are memory constraints, you may want to discard text that precedes a +partial match before adding the next segment. Unfortunately, this is not at +present straightforward. In cases such as the above, where the pattern does not +contain any lookbehinds, it is sufficient to retain only the partially matched +substring. However, if the pattern contains a lookbehind assertion, characters that precede the start of the partial match may have been inspected during the -matching process. -</P> -<P> -The only lookbehind information that is available is the length of the longest -lookbehind in a pattern. This may not, of course, be at the start of the -pattern, but retaining that many characters before the partial match is -sufficient, if not always strictly necessary. The way to do this is as follows: -</P> -<P> -Before doing any matching, find the length of the longest lookbehind in the -pattern by calling <b>pcre2_pattern_info()</b> with the PCRE2_INFO_MAXLOOKBEHIND -option. Note that the resulting count is in characters, not code units. After a -partial match, moving back from the ovector[0] offset in the subject by the -number of characters given for the maximum lookbehind gets you to the earliest -character that must be retained. In a non-UTF or a 32-bit situation, moving -back is just a subtraction, but in UTF-8 or UTF-16 you have to count characters -while moving back through the code units. Characters before the point you have -now reached can be discarded. -</P> -<P> -For example, if the pattern "(?<=123)abc" is partially matched against the -string "xx123ab", the ovector offsets are 5 and 7 ("ab"). The maximum -lookbehind count is 3, so all characters before offset 2 can be discarded. The -value of <b>startoffset</b> for the next match should be 3. When <b>pcre2test</b> -displays a partial match, it indicates the lookbehind characters with '<' -characters if the <b>allusedtext</b> modifier is set: +matching process. When <b>pcre2test</b> displays a partial match, it indicates +these characters with '<' if the <b>allusedtext</b> modifier is set: <pre> re> "(?<=123)abc" data> xx123ab\=ph,allusedtext Partial match: 123ab <<< </pre> -Note that the \fPallusedtext\fP modifier is not available for JIT matching, -because JIT matching does not maintain the first and last consulted characters. +However, the \fPallusedtext\fP modifier is not available for JIT matching, +because JIT matching does not record the first (or last) consulted characters. +For this reason, this information is not available via the API. It is therefore +not possible in general to obtain the exact number of characters that must be +retained in order to get the right match result. If you cannot retain the +entire segment, you must find some heuristic way of choosing. +</P> +<P> +If you know the approximate length of the matching substrings, you can use that +to decide how much text to retain. The only lookbehind information that is +currently available via the API is the length of the longest individual +lookbehind in a pattern, but this can be misleading if there are nested +lookbehinds. The value returned by calling <b>pcre2_pattern_info()</b> with the +PCRE2_INFO_MAXLOOKBEHIND option is the maximum number of characters (not code +units) that any individual lookbehind moves back when it is processed. A +pattern such as "(?<=(?<!b)a)" has a maximum lookbehind value of one, but +inspects two characters before its starting point. +</P> +<P> +In a non-UTF or a 32-bit case, moving back is just a subtraction, but in +UTF-8 or UTF-16 you have to count characters while moving back through the code +units. </P> <br><a name="SEC5" href="#TOC1">PARTIAL MATCHING USING pcre2_dfa_match()</a><br> <P> @@ -379,11 +382,11 @@ want. </P> <P> If you do want to allow for starting again at the next character, one way of -doing it is to retain the matched part of the segment and try a new complete -match, as described for <b>pcre2_match()</b> above. Another possibility is to -work with two buffers. If a partial match at offset <i>n</i> in the first buffer -is followed by "no match" when PCRE2_DFA_RESTART is used on the second buffer, -you can then try a new match starting at offset <i>n+1</i> in the first buffer. +doing it is to retain some or all of the segment and try a new complete match, +as described for <b>pcre2_match()</b> above. Another possibility is to work with +two buffers. If a partial match at offset <i>n</i> in the first buffer is +followed by "no match" when PCRE2_DFA_RESTART is used on the second buffer, you +can then try a new match starting at offset <i>n+1</i> in the first buffer. </P> <br><a name="SEC7" href="#TOC1">AUTHOR</a><br> <P> @@ -396,7 +399,7 @@ Cambridge, England. </P> <br><a name="SEC8" href="#TOC1">REVISION</a><br> <P> -Last updated: 07 August 2019 +Last updated: 04 September 2019 <br> Copyright © 1997-2019 University of Cambridge. <br> |