summaryrefslogtreecommitdiff
path: root/doc/pcre2partial.3
diff options
context:
space:
mode:
authorph10 <ph10@6239d852-aaf2-0410-a92c-79f79f948069>2019-07-21 16:48:13 +0000
committerph10 <ph10@6239d852-aaf2-0410-a92c-79f79f948069>2019-07-21 16:48:13 +0000
commitecfcb91119a8c10bd0f4a790ccff1075538760c8 (patch)
tree9a31747e4fdaa64c485247b1651052733d982370 /doc/pcre2partial.3
parent361e123dd630523e542dfca697ca84f23dc148f8 (diff)
downloadpcre2-ecfcb91119a8c10bd0f4a790ccff1075538760c8.tar.gz
Update definition of partial match and fix \z and \Z (as documented).
git-svn-id: svn://vcs.exim.org/pcre2/code/trunk@1140 6239d852-aaf2-0410-a92c-79f79f948069
Diffstat (limited to 'doc/pcre2partial.3')
-rw-r--r--doc/pcre2partial.385
1 files changed, 36 insertions, 49 deletions
diff --git a/doc/pcre2partial.3 b/doc/pcre2partial.3
index 7b3bf24..7af75e2 100644
--- a/doc/pcre2partial.3
+++ b/doc/pcre2partial.3
@@ -1,4 +1,4 @@
-.TH PCRE2PARTIAL 3 "21 June 2019" "PCRE2 10.34"
+.TH PCRE2PARTIAL 3 "21 July 2019" "PCRE2 10.34"
.SH NAME
PCRE2 - Perl-compatible regular expressions
.SH "PARTIAL MATCHING IN PCRE2"
@@ -22,7 +22,7 @@ as soon as a mistake is made, by beeping and not reflecting the character that
has been typed, for example. This immediate feedback is likely to be a better
user interface than a check that is delayed until the entire string has been
entered. Partial matching can also be useful when the subject string is very
-long and is not all available at once.
+long and is not all available at once, as discussed below.
.P
PCRE2 supports partial matching by means of the PCRE2_PARTIAL_SOFT and
PCRE2_PARTIAL_HARD options, which can be set when calling a matching function.
@@ -55,13 +55,17 @@ is also disabled for partial matching.
.sp
A partial match occurs during a call to \fBpcre2_match()\fP when the end of the
subject string is reached successfully, but matching cannot continue because
-more characters are needed. However, at least one character in the subject must
-have been inspected. This character need not form part of the final matched
-string; lookbehind assertions and the \eK escape sequence provide ways of
-inspecting characters before the start of a matched string. The requirement for
-inspecting at least one character exists because an empty string can always be
-matched; without such a restriction there would always be a partial match of an
-empty string at the end of the subject.
+more characters are needed, and in addition, either at least one character in
+the subject has been inspected or the pattern contains a lookbehind. An
+inspected character need not form part of the final matched string; lookbehind
+assertions and the \eK escape sequence provide ways of inspecting characters
+before the start of a matched string.
+.P
+The two additional requirements define the cases where adding more characters
+to the existing subject may complete the match. Without these conditions there
+would be a partial match of an empty string at the end of the subject for all
+unanchored patterns (and also for anchored patterns if the subject itself is
+empty).
.P
When a partial match is returned, the first two elements in the ovector point
to the portion of the subject that was matched, but the values in the rest of
@@ -78,7 +82,7 @@ these characters are needed for a subsequent re-match with additional
characters.
.P
What happens when a partial match is identified depends on which of the two
-partial matching options are set.
+partial matching options is set.
.
.
.SS "PCRE2_PARTIAL_SOFT WITH pcre2_match()"
@@ -100,12 +104,12 @@ the data that is returned. Consider this pattern:
.sp
/123\ew+X|dogY/
.sp
-If this is matched against the subject string "abc123dog", both
-alternatives fail to match, but the end of the subject is reached during
-matching, so PCRE2_ERROR_PARTIAL is returned. The offsets are set to 3 and 9,
-identifying "123dog" as the first partial match that was found. (In this
-example, there are two partial matches, because "dog" on its own partially
-matches the second alternative.)
+If this is matched against the subject string "abc123dog", both alternatives
+fail to match, but the end of the subject is reached during matching, so
+PCRE2_ERROR_PARTIAL is returned. The offsets are set to 3 and 9, identifying
+"123dog" as the first partial match that was found. (In this example, there are
+two partial matches, because "dog" on its own partially matches the second
+alternative.)
.
.
.SS "PCRE2_PARTIAL_HARD WITH pcre2_match()"
@@ -117,8 +121,8 @@ possible complete matches. This option is "hard" because it prefers an earlier
partial match over a later complete match. For this reason, the assumption is
made that the end of the supplied subject string may not be the true end of the
available data, and so, if \ez, \eZ, \eb, \eB, or $ are encountered at the end
-of the subject, the result is PCRE2_ERROR_PARTIAL, provided that at least one
-character in the subject has been inspected.
+of the subject, the result is PCRE2_ERROR_PARTIAL, whether or not any
+characters have been inspected.
.
.
.SS "Comparing hard and soft partial matching"
@@ -319,40 +323,23 @@ string "xx123ab", the ovector offsets are 5 and 7 ("ab"). The maximum
lookbehind count is 3, so all characters before offset 2 can be discarded. The
value of \fBstartoffset\fP for the next match should be 3. When \fBpcre2test\fP
displays a partial match, it indicates the lookbehind characters with '<'
-characters:
+characters if the "allusedtext" modifier is set:
.sp
re> "(?<=123)abc"
- data> xx123ab\e=ph
+ data> xx123ab\e=ph,allusedtext
Partial match: 123ab
<<<
+However, the "allusedtext" modifier is not available for JIT matching, because
+JIT matching does not maintain the first and last consulted characters.
.P
-3. The maximum lookbehind count is also important when the result of a partial
-match attempt is "no match". In this case, the maximum lookbehind characters
-from the end of the current segment must be retained at the start of the next
-segment, in case the lookbehind is at the start of the pattern. Matching the
-next segment must then start at the appropriate offset.
-.P
-4. Because a partial match must always contain at least one character, what
-might be considered a partial match of an empty string actually gives a "no
-match" result. For example:
-.sp
- re> /c(?<=abc)x/
- data> ab\e=ps
- No match
-.sp
-If the next segment begins "cx", a match should be found, but this will only
-happen if characters from the previous segment are retained. For this reason, a
-"no match" result should be interpreted as "partial match of an empty string"
-when the pattern contains lookbehinds.
-.P
-5. Matching a subject string that is split into multiple segments may not
-always produce exactly the same result as matching over one single long string,
-especially when PCRE2_PARTIAL_SOFT is used. The section "Partial Matching and
-Word Boundaries" above describes an issue that arises if the pattern ends with
-\eb or \eB. Another kind of difference may occur when there are multiple
-matching possibilities, because (for PCRE2_PARTIAL_SOFT) a partial match result
-is given only when there are no completed matches. This means that as soon as
-the shortest match has been found, continuation to a new subject segment is no
+3. Matching a subject string that is split into multiple segments may not
+always produce exactly the same result as matching over one single long string
+when PCRE2_PARTIAL_SOFT is used. The section "Partial Matching and Word
+Boundaries" above describes an issue that arises if the pattern ends with \eb
+or \eB. Another kind of difference may occur when there are multiple matching
+possibilities, because (for PCRE2_PARTIAL_SOFT) a partial match result is given
+only when there are no completed matches. This means that as soon as the
+shortest match has been found, continuation to a new subject segment is no
longer possible. Consider this \fBpcre2test\fP example:
.sp
re> /dog(sbody)?/
@@ -386,7 +373,7 @@ multi-segment data. The example above then behaves differently:
data> gsb\e=ph,dfa,dfa_restart
Partial match: gsb
.sp
-6. Patterns that contain alternatives at the top level which do not all start
+4. Patterns that contain alternatives at the top level which do not all start
with the same pattern item may not work as expected when PCRE2_DFA_RESTART is
used. For example, consider this pattern:
.sp
@@ -435,6 +422,6 @@ Cambridge, England.
.rs
.sp
.nf
-Last updated: 21 June 2019
+Last updated: 21 July 2019
Copyright (c) 1997-2019 University of Cambridge.
.fi