Back off failed attempt to handle nested lookbehinds for estimating how much of

a partial match to retain for multi-segment matching. Document the current difficulty if the whole first segment cannot be retained. git-svn-id: svn://vcs.exim.org/pcre2/code/trunk@1163 6239d852-aaf2-0410-a92c-79f79f948069
author: ph10 <ph10@6239d852-aaf2-0410-a92c-79f79f948069> 2019-09-04 18:14:54 +0000
committer: ph10 <ph10@6239d852-aaf2-0410-a92c-79f79f948069> 2019-09-04 18:14:54 +0000
commit: d291ead6117132882f9f34207818b87dd64de66f (patch)
tree: 8a8453983730199fb9655aa015f28554a35da5de /doc/html/pcre2partial.html
parent: 57d867b34f1dceb380fb4b5a54cc7744ff4968e7 (diff)
download: pcre2-d291ead6117132882f9f34207818b87dd64de66f.tar.gz
1 files changed, 59 insertions, 56 deletions
diff --git a/doc/html/pcre2partial.html b/doc/html/pcre2partial.html
index e0f37ea..1a88c1d 100644
--- a/doc/html/pcre2partial.html
+++ b/doc/html/pcre2partial.html
@@ -49,7 +49,7 @@ complete match, though the details differ between the two types of matching
 function. If both options are set, PCRE2_PARTIAL_HARD takes precedence.
 </P>
 <P>
-If you want to use partial matching with just-in-time optimized code, as well 
+If you want to use partial matching with just-in-time optimized code, as well
 as setting a partial match option for the matching function, you must also call
 <b>pcre2_jit_compile()</b> with one or both of these options:
 <pre>
@@ -101,7 +101,7 @@ matched string.
 </P>
 <P>
 (2) The pattern contains one or more lookbehind assertions. This condition
-exists in case there is a lookbehind that inspects characters before the start 
+exists in case there is a lookbehind that inspects characters before the start
 of the match.
 </P>
 <P>
@@ -171,7 +171,7 @@ If PCRE2_PARTIAL_HARD is set, PCRE2_ERROR_PARTIAL is returned as soon as a
 partial match is found, without continuing to search for possible complete
 matches. This option is "hard" because it prefers an earlier partial match over
 a later complete match. For this reason, the assumption is made that the end of
-the supplied subject string is not the true end of the available data, which is 
+the supplied subject string is not the true end of the available data, which is
 why \z, \Z, \b, \B, and $ always give a partial match.
 </P>
 <P>
@@ -226,7 +226,7 @@ date:
   data&#62; 3juj\=ph
   No match
 </pre>
-This example gives the same results for both hard and soft partial matching 
+This example gives the same results for both hard and soft partial matching
 options. Here is an example where there is a difference:
 <pre>
     re&#62; /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
@@ -234,7 +234,7 @@ options. Here is an example where there is a difference:
    0: 25jun04
    1: jun
   data&#62; 25jun04\=ph
-  Partial match: 25jun04 
+  Partial match: 25jun04
 </pre>
 With PCRE2_PARTIAL_SOFT, the subject is matched completely. For
 PCRE2_PARTIAL_HARD, however, the subject is assumed not to be complete, so
@@ -244,9 +244,12 @@ there is only a partial match.
 <P>
 PCRE was not originally designed with multi-segment matching in mind. However,
 over time, features (including partial matching) that make multi-segment
-matching possible have been added. The string is searched segment by segment by
-calling <b>pcre2_match()</b> repeatedly, with the aim of achieving the same 
-results that would happen if the entire string was available for searching.
+matching possible have been added. A very long string can be searched segment
+by segment by calling <b>pcre2_match()</b> repeatedly, with the aim of achieving
+the same results that would happen if the entire string was available for
+searching all the time. Normally, the strings that are being sought are much
+shorter than each individual segment, and are in the middle of very long
+strings, so the pattern is normally not anchored.
 </P>
 <P>
 Special logic must be implemented to handle a matched substring that spans a
@@ -256,11 +259,10 @@ changing the match by adding more characters. The PCRE2_NOTBOL option should
 also be set for all but the first segment.
 </P>
 <P>
-When a partial match occurs, the next segment must be added to the current 
-subject and the match re-run, using the <i>startoffset</i> argument of 
+When a partial match occurs, the next segment must be added to the current
+subject and the match re-run, using the <i>startoffset</i> argument of
 <b>pcre2_match()</b> to begin at the point where the partial match started.
-Multi-segment matching is usually used to search for substrings in the middle
-of very long sequences, so the patterns are normally not anchored. For example:
+For example:
 <pre>
     re&#62; /\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d/
   data&#62; ...the date is 23ja\=ph
@@ -269,51 +271,52 @@ of very long sequences, so the patterns are normally not anchored. For example:
    0: 23jan19
    1: jan
 </pre>
-Note the use of the <b>offset</b> modifier to start the new match where the 
-partial match was found.
-</P>
-<P>
-In this simple example, the next segment was just added to the one in which the 
-partial match was found. However, if there are memory constraints, it may be 
-necessary to discard text that precedes the partial match before adding the 
-next segment. In cases such as the above, where the pattern does not contain
-any lookbehinds, it is sufficient to retain only the partially matched
-substring. However, if a pattern contains a lookbehind assertion, characters
+Note the use of the <b>offset</b> modifier to start the new match where the
+partial match was found. In this example, the next segment was added to the one
+in which the partial match was found. This is the most straightforward
+approach, typically using a memory buffer that is twice the size of each
+segment. After a partial match, the first half of the buffer is discarded, the 
+second half is moved to the start of the buffer, and a new segment is added 
+before repeating the match as in the example above. After a no match, the 
+entire buffer can be discarded.
+</P>
+<P>
+If there are memory constraints, you may want to discard text that precedes a
+partial match before adding the next segment. Unfortunately, this is not at
+present straightforward. In cases such as the above, where the pattern does not
+contain any lookbehinds, it is sufficient to retain only the partially matched
+substring. However, if the pattern contains a lookbehind assertion, characters
 that precede the start of the partial match may have been inspected during the
-matching process.
-</P>
-<P>
-The only lookbehind information that is available is the length of the longest
-lookbehind in a pattern. This may not, of course, be at the start of the
-pattern, but retaining that many characters before the partial match is
-sufficient, if not always strictly necessary. The way to do this is as follows:
-</P>
-<P>
-Before doing any matching, find the length of the longest lookbehind in the
-pattern by calling <b>pcre2_pattern_info()</b> with the PCRE2_INFO_MAXLOOKBEHIND
-option. Note that the resulting count is in characters, not code units. After a
-partial match, moving back from the ovector[0] offset in the subject by the
-number of characters given for the maximum lookbehind gets you to the earliest
-character that must be retained. In a non-UTF or a 32-bit situation, moving
-back is just a subtraction, but in UTF-8 or UTF-16 you have to count characters
-while moving back through the code units. Characters before the point you have
-now reached can be discarded.
-</P>
-<P>
-For example, if the pattern "(?&#60;=123)abc" is partially matched against the
-string "xx123ab", the ovector offsets are 5 and 7 ("ab"). The maximum
-lookbehind count is 3, so all characters before offset 2 can be discarded. The
-value of <b>startoffset</b> for the next match should be 3. When <b>pcre2test</b>
-displays a partial match, it indicates the lookbehind characters with '&#60;'
-characters if the <b>allusedtext</b> modifier is set:
+matching process. When <b>pcre2test</b> displays a partial match, it indicates
+these characters with '&#60;' if the <b>allusedtext</b> modifier is set:
 <pre>
     re&#62; "(?&#60;=123)abc"
   data&#62; xx123ab\=ph,allusedtext
   Partial match: 123ab
                  &#60;&#60;&#60;
 </pre>
-Note that the \fPallusedtext\fP modifier is not available for JIT matching,
-because JIT matching does not maintain the first and last consulted characters.
+However, the \fPallusedtext\fP modifier is not available for JIT matching,
+because JIT matching does not record the first (or last) consulted characters.
+For this reason, this information is not available via the API. It is therefore
+not possible in general to obtain the exact number of characters that must be
+retained in order to get the right match result. If you cannot retain the
+entire segment, you must find some heuristic way of choosing.
+</P>
+<P>
+If you know the approximate length of the matching substrings, you can use that
+to decide how much text to retain. The only lookbehind information that is
+currently available via the API is the length of the longest individual
+lookbehind in a pattern, but this can be misleading if there are nested
+lookbehinds. The value returned by calling <b>pcre2_pattern_info()</b> with the
+PCRE2_INFO_MAXLOOKBEHIND option is the maximum number of characters (not code
+units) that any individual lookbehind moves back when it is processed. A
+pattern such as "(?&#60;=(?&#60;!b)a)" has a maximum lookbehind value of one, but
+inspects two characters before its starting point.
+</P>
+<P>
+In a non-UTF or a 32-bit case, moving back is just a subtraction, but in
+UTF-8 or UTF-16 you have to count characters while moving back through the code
+units.
 </P>
 <br><a name="SEC5" href="#TOC1">PARTIAL MATCHING USING pcre2_dfa_match()</a><br>
 <P>
@@ -379,11 +382,11 @@ want.
 </P>
 <P>
 If you do want to allow for starting again at the next character, one way of
-doing it is to retain the matched part of the segment and try a new complete
-match, as described for <b>pcre2_match()</b> above. Another possibility is to
-work with two buffers. If a partial match at offset <i>n</i> in the first buffer
-is followed by "no match" when PCRE2_DFA_RESTART is used on the second buffer,
-you can then try a new match starting at offset <i>n+1</i> in the first buffer.
+doing it is to retain some or all of the segment and try a new complete match,
+as described for <b>pcre2_match()</b> above. Another possibility is to work with
+two buffers. If a partial match at offset <i>n</i> in the first buffer is
+followed by "no match" when PCRE2_DFA_RESTART is used on the second buffer, you
+can then try a new match starting at offset <i>n+1</i> in the first buffer.
 </P>
 <br><a name="SEC7" href="#TOC1">AUTHOR</a><br>
 <P>
@@ -396,7 +399,7 @@ Cambridge, England.
 </P>
 <br><a name="SEC8" href="#TOC1">REVISION</a><br>
 <P>
-Last updated: 07 August 2019
+Last updated: 04 September 2019
 <br>
 Copyright &copy; 1997-2019 University of Cambridge.
 <br>
author	ph10 <ph10@6239d852-aaf2-0410-a92c-79f79f948069>	2019-09-04 18:14:54 +0000
committer	ph10 <ph10@6239d852-aaf2-0410-a92c-79f79f948069>	2019-09-04 18:14:54 +0000
commit	d291ead6117132882f9f34207818b87dd64de66f (patch)
tree	8a8453983730199fb9655aa015f28554a35da5de /doc/html/pcre2partial.html
parent	57d867b34f1dceb380fb4b5a54cc7744ff4968e7 (diff)
download	pcre2-d291ead6117132882f9f34207818b87dd64de66f.tar.gz