summaryrefslogtreecommitdiff
path: root/doc/html
diff options
context:
space:
mode:
authorph10 <ph10@2f5784b3-3f2a-0410-8824-cb99058d5e15>2010-11-17 17:55:57 +0000
committerph10 <ph10@2f5784b3-3f2a-0410-8824-cb99058d5e15>2010-11-17 17:55:57 +0000
commit2d7951ff79f04b55111e98a79584acf4403749c7 (patch)
tree6ba7c66d320bd7836488fbe3988a2f78e4fad0c9 /doc/html
parent5074457046234fb2ab52c28c5f811fabead33b45 (diff)
downloadpcre-2d7951ff79f04b55111e98a79584acf4403749c7.tar.gz
Documentation updates and tidies.
git-svn-id: svn://vcs.exim.org/pcre/code/trunk@572 2f5784b3-3f2a-0410-8824-cb99058d5e15
Diffstat (limited to 'doc/html')
-rw-r--r--doc/html/pcre.html24
-rw-r--r--doc/html/pcreapi.html125
-rw-r--r--doc/html/pcregrep.html55
-rw-r--r--doc/html/pcrematching.html16
-rw-r--r--doc/html/pcrepartial.html14
-rw-r--r--doc/html/pcrepattern.html93
-rw-r--r--doc/html/pcretest.html13
7 files changed, 199 insertions, 141 deletions
diff --git a/doc/html/pcre.html b/doc/html/pcre.html
index b7e37f0..f2ef9dd 100644
--- a/doc/html/pcre.html
+++ b/doc/html/pcre.html
@@ -30,11 +30,10 @@ support for one or two .NET and Oniguruma syntax items, and there is an option
for requesting some minor changes that give better JavaScript compatibility.
</P>
<P>
-The current implementation of PCRE corresponds approximately with Perl
-5.10/5.11, including support for UTF-8 encoded strings and Unicode general
-category properties. However, UTF-8 and Unicode support has to be explicitly
-enabled; it is not the default. The Unicode tables correspond to Unicode
-release 5.2.0.
+The current implementation of PCRE corresponds approximately with Perl 5.12,
+including support for UTF-8 encoded strings and Unicode general category
+properties. However, UTF-8 and Unicode support has to be explicitly enabled; it
+is not the default. The Unicode tables correspond to Unicode release 5.2.0.
</P>
<P>
In addition to the Perl-compatible matching function, PCRE contains an
@@ -276,9 +275,9 @@ documentation.
low-valued characters, unless the PCRE_UCP option is set.
</P>
<P>
-8. However, the Perl 5.10 horizontal and vertical whitespace matching escapes
-(\h, \H, \v, and \V) do match all the appropriate Unicode characters,
-whether or not PCRE_UCP is set.
+8. However, the horizontal and vertical whitespace matching escapes (\h, \H,
+\v, and \V) do match all the appropriate Unicode characters, whether or not
+PCRE_UCP is set.
</P>
<P>
9. Case-insensitive matching applies only to characters whose values are less
@@ -286,10 +285,9 @@ than 128, unless PCRE is built with Unicode property support. Even when Unicode
property support is available, PCRE still uses its own character tables when
checking the case of low-valued characters, so as not to degrade performance.
The Unicode property information is used only for characters with higher
-values. Even when Unicode property support is available, PCRE supports
-case-insensitive matching only when there is a one-to-one mapping between a
-letter's cases. There are a small number of many-to-one mappings in Unicode;
-these are not supported by PCRE.
+values. Furthermore, PCRE supports case-insensitive matching only when there is
+a one-to-one mapping between a letter's cases. There are a small number of
+many-to-one mappings in Unicode; these are not supported by PCRE.
</P>
<br><a name="SEC5" href="#TOC1">AUTHOR</a><br>
<P>
@@ -307,7 +305,7 @@ two digits 10, at the domain cam.ac.uk.
</P>
<br><a name="SEC6" href="#TOC1">REVISION</a><br>
<P>
-Last updated: 22 October 2010
+Last updated: 13 November 2010
<br>
Copyright &copy; 1997-2010 University of Cambridge.
<br>
diff --git a/doc/html/pcreapi.html b/doc/html/pcreapi.html
index b849f52..707eef1 100644
--- a/doc/html/pcreapi.html
+++ b/doc/html/pcreapi.html
@@ -443,12 +443,17 @@ If <i>errptr</i> is NULL, <b>pcre_compile()</b> returns NULL immediately.
Otherwise, if compilation of a pattern fails, <b>pcre_compile()</b> returns
NULL, and sets the variable pointed to by <i>errptr</i> to point to a textual
error message. This is a static string that is part of the library. You must
-not try to free it. The byte offset from the start of the pattern to the
-character that was being processed when the error was discovered is placed in
-the variable pointed to by <i>erroffset</i>, which must not be NULL. If it is,
-an immediate error is given. Some errors are not detected until checks are
-carried out when the whole pattern has been scanned; in this case the offset is
-set to the end of the pattern.
+not try to free it. The offset from the start of the pattern to the byte that
+was being processed when the error was discovered is placed in the variable
+pointed to by <i>erroffset</i>, which must not be NULL. If it is, an immediate
+error is given. Some errors are not detected until checks are carried out when
+the whole pattern has been scanned; in this case the offset is set to the end
+of the pattern.
+</P>
+<P>
+Note that the offset is in bytes, not characters, even in UTF-8 mode. It may
+point into the middle of a UTF-8 character (for example, when
+PCRE_ERROR_BADUTF8 is returned for an invalid UTF-8 string).
</P>
<P>
If <b>pcre_compile2()</b> is used instead of <b>pcre_compile()</b>, and the
@@ -528,12 +533,13 @@ pattern.
<pre>
PCRE_DOTALL
</pre>
-If this bit is set, a dot metacharater in the pattern matches all characters,
-including those that indicate newline. Without it, a dot does not match when
-the current position is at a newline. This option is equivalent to Perl's /s
-option, and it can be changed within a pattern by a (?s) option setting. A
-negative class such as [^a] always matches newline characters, independent of
-the setting of this option.
+If this bit is set, a dot metacharacter in the pattern matches a character of
+any value, including one that indicates a newline. However, it only ever
+matches one character, even if newlines are coded as CRLF. Without this option,
+a dot does not match when the current position is at a newline. This option is
+equivalent to Perl's /s option, and it can be changed within a pattern by a
+(?s) option setting. A negative class such as [^a] always matches newline
+characters, independent of the setting of this option.
<pre>
PCRE_DUPNAMES
</pre>
@@ -554,10 +560,19 @@ ignored. This is equivalent to Perl's /x option, and it can be changed within a
pattern by a (?x) option setting.
</P>
<P>
+Which characters are interpreted as newlines
+is controlled by the options passed to <b>pcre_compile()</b> or by a special
+sequence at the start of the pattern, as described in the section entitled
+<a href="pcrepattern.html#newlines">"Newline conventions"</a>
+in the <b>pcrepattern</b> documentation. Note that the end of this type of
+comment is a literal newline sequence in the pattern; escape sequences that
+happen to represent a newline do not count.
+</P>
+<P>
This option makes it possible to include comments inside complicated patterns.
Note, however, that this applies only to data characters. Whitespace characters
may never appear within special character sequences in a pattern, for example
-within the sequence (?( which introduces a conditional subpattern.
+within the sequence (?( that introduces a conditional subpattern.
<pre>
PCRE_EXTRA
</pre>
@@ -637,12 +652,12 @@ PCRE_NEWLINE_CR with PCRE_NEWLINE_LF is equivalent to PCRE_NEWLINE_CRLF, but
other combinations may yield unused numbers and cause an error.
</P>
<P>
-The only time that a line break is specially recognized when compiling a
-pattern is if PCRE_EXTENDED is set, and an unescaped # outside a character
-class is encountered. This indicates a comment that lasts until after the next
-line break sequence. In other circumstances, line break sequences are treated
-as literal data, except that in PCRE_EXTENDED mode, both CR and LF are treated
-as whitespace characters and are therefore ignored.
+The only time that a line break in a pattern is specially recognized when
+compiling is when PCRE_EXTENDED is set. CR and LF are whitespace characters,
+and so are ignored in this mode. Also, an unescaped # outside a character class
+indicates a comment that lasts until after the next line break sequence. In
+other circumstances, line break sequences in patterns are treated as literal
+data.
</P>
<P>
The newline option that is set at compile time becomes the default that is used
@@ -658,10 +673,10 @@ in Perl.
<pre>
PCRE_UCP
</pre>
-This option changes the way PCRE processes \b, \d, \s, \w, and some of the
-POSIX character classes. By default, only ASCII characters are recognized, but
-if PCRE_UCP is set, Unicode properties are used instead to classify characters.
-More details are given in the section on
+This option changes the way PCRE processes \B, \b, \D, \d, \S, \s, \W,
+\w, and some of the POSIX character classes. By default, only ASCII characters
+are recognized, but if PCRE_UCP is set, Unicode properties are used instead to
+classify characters. More details are given in the section on
<a href="pcre.html#genericchartypes">generic character types</a>
in the
<a href="pcrepattern.html"><b>pcrepattern</b></a>
@@ -851,8 +866,8 @@ matching.
The two optimizations just described can be disabled by setting the
PCRE_NO_START_OPTIMIZE option when calling <b>pcre_exec()</b> or
<b>pcre_dfa_exec()</b>. You might want to do this if your pattern contains
-callouts, or make use of (*MARK), and you make use of these in cases where
-matching fails. See the discussion of PCRE_NO_START_OPTIMIZE
+callouts or (*MARK), and you want to make use of these facilities in cases
+where matching fails. See the discussion of PCRE_NO_START_OPTIMIZE
<a href="#execoptions">below.</a>
<a name="localesupport"></a></P>
<br><a name="SEC10" href="#TOC1">LOCALE SUPPORT</a><br>
@@ -1443,8 +1458,8 @@ if that fails, by advancing the starting offset (see below) and trying an
ordinary match again. There is some code that demonstrates how to do this in
the
<a href="pcredemo.html"><b>pcredemo</b></a>
-sample program. In the most general case, you have to check to see if the
-newline convention recognizes CRLF as a newline, and if so, and the current
+sample program. In the most general case, you have to check to see if the
+newline convention recognizes CRLF as a newline, and if so, and the current
character is CR followed by LF, advance the starting offset by two characters
instead of one.
<pre>
@@ -1504,9 +1519,11 @@ strings in the
in the main
<a href="pcre.html"><b>pcre</b></a>
page. If an invalid UTF-8 sequence of bytes is found, <b>pcre_exec()</b> returns
-the error PCRE_ERROR_BADUTF8. If <i>startoffset</i> contains a value that does
-not point to the start of a UTF-8 character (or to the end of the subject),
-PCRE_ERROR_BADUTF8_OFFSET is returned.
+the error PCRE_ERROR_BADUTF8 or, if PCRE_PARTIAL_HARD is set and the problem is
+a truncated UTF-8 character at the end of the subject, PCRE_ERROR_SHORTUTF8. If
+<i>startoffset</i> contains a value that does not point to the start of a UTF-8
+character (or to the end of the subject), PCRE_ERROR_BADUTF8_OFFSET is
+returned.
</P>
<P>
If you already know that your subject is valid, and you want to skip these
@@ -1536,7 +1553,7 @@ but only if no complete match can be found.
If PCRE_PARTIAL_HARD is set, it overrides PCRE_PARTIAL_SOFT. In this case, if a
partial match is found, <b>pcre_exec()</b> immediately returns
PCRE_ERROR_PARTIAL, without considering any other alternatives. In other words,
-when PCRE_PARTIAL_HARD is set, a partial match is considered to be more
+when PCRE_PARTIAL_HARD is set, a partial match is considered to be more
important that an alternative complete match.
</P>
<P>
@@ -1552,15 +1569,12 @@ The string to be matched by <b>pcre_exec()</b>
<P>
The subject string is passed to <b>pcre_exec()</b> as a pointer in
<i>subject</i>, a length (in bytes) in <i>length</i>, and a starting byte offset
-in <i>startoffset</i>. If this is negative or greater than the length of the
-subject, <b>pcre_exec()</b> returns PCRE_ERROR_BADOFFSET.
-</P>
-<P>
-In UTF-8 mode, the byte offset must point to the start of a UTF-8 character (or
-the end of the subject). Unlike the pattern string, the subject may contain
-binary zero bytes. When the starting offset is zero, the search for a match
-starts at the beginning of the subject, and this is by far the most common
-case.
+in <i>startoffset</i>. If this is negative or greater than the length of the
+subject, <b>pcre_exec()</b> returns PCRE_ERROR_BADOFFSET. When the starting
+offset is zero, the search for a match starts at the beginning of the subject,
+and this is by far the most common case. In UTF-8 mode, the byte offset must
+point to the start of a UTF-8 character (or the end of the subject). Unlike the
+pattern string, the subject may contain binary zero bytes.
</P>
<P>
A non-zero starting offset is useful when searching for another match in the
@@ -1589,8 +1603,8 @@ PCRE_ANCHORED options, and then if that fails, advancing the starting offset
and trying an ordinary match again. There is some code that demonstrates how to
do this in the
<a href="pcredemo.html"><b>pcredemo</b></a>
-sample program. In the most general case, you have to check to see if the
-newline convention recognizes CRLF as a newline, and if so, and the current
+sample program. In the most general case, you have to check to see if the
+newline convention recognizes CRLF as a newline, and if so, and the current
character is CR followed by LF, advance the starting offset by two characters
instead of one.
</P>
@@ -1675,9 +1689,15 @@ Offset values that correspond to unused subpatterns at the end of the
expression are also set to -1. For example, if the string "abc" is matched
against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 are not matched. The
return from the function is 2, because the highest used capturing subpattern
-number is 1. However, you can refer to the offsets for the second and third
-capturing subpatterns if you wish (assuming the vector is large enough, of
-course).
+number is 1, and the offsets for for the second and third capturing subpatterns
+(assuming the vector is large enough, of course) are set to -1.
+</P>
+<P>
+<b>Note</b>: Elements of <i>ovector</i> that do not correspond to capturing
+parentheses in the pattern are never changed. That is, if a pattern contains
+<i>n</i> capturing parentheses, no more than <i>ovector[0]</i> to
+<i>ovector[2n+1]</i> are set by <b>pcre_exec()</b>. The other elements retain
+whatever values they previously had.
</P>
<P>
Some convenience functions are provided for extracting the captured substrings
@@ -1752,11 +1772,14 @@ documentation for details.
PCRE_ERROR_BADUTF8 (-10)
</pre>
A string that contains an invalid UTF-8 byte sequence was passed as a subject.
+However, if PCRE_PARTIAL_HARD is set and the problem is a truncated UTF-8
+character at the end of the subject, PCRE_ERROR_SHORTUTF8 is used instead.
<pre>
PCRE_ERROR_BADUTF8_OFFSET (-11)
</pre>
The UTF-8 byte sequence that was passed as a subject was valid, but the value
-of <i>startoffset</i> did not point to the beginning of a UTF-8 character.
+of <i>startoffset</i> did not point to the beginning of a UTF-8 character or the
+end of the subject.
<pre>
PCRE_ERROR_PARTIAL (-12)
</pre>
@@ -1792,8 +1815,14 @@ An invalid combination of PCRE_NEWLINE_<i>xxx</i> options was given.
<pre>
PCRE_ERROR_BADOFFSET (-24)
</pre>
-The value of <i>startoffset</i> was negative or greater than the length of the
+The value of <i>startoffset</i> was negative or greater than the length of the
subject, that is, the value in <i>length</i>.
+<pre>
+ PCRE_ERROR_SHORTUTF8 (-25)
+</pre>
+The subject string ended with an incomplete (truncated) UTF-8 character, and
+the PCRE_PARTIAL_HARD option was set. Without this option, PCRE_ERROR_BADUTF8
+is returned in this situation.
</P>
<P>
Error numbers -16 to -20 and -22 are not used by <b>pcre_exec()</b>.
@@ -2203,7 +2232,7 @@ Cambridge CB2 3QH, England.
</P>
<br><a name="SEC22" href="#TOC1">REVISION</a><br>
<P>
-Last updated: 06 November 2010
+Last updated: 13 November 2010
<br>
Copyright &copy; 1997-2010 University of Cambridge.
<br>
diff --git a/doc/html/pcregrep.html b/doc/html/pcregrep.html
index 0dcd738..1adbee9 100644
--- a/doc/html/pcregrep.html
+++ b/doc/html/pcregrep.html
@@ -224,20 +224,20 @@ that matched.
When <b>pcregrep</b> is searching the files in a directory as a consequence of
the <b>-r</b> (recursive search) option, any regular files whose names match the
pattern are excluded. Subdirectories are not excluded by this option; they are
-searched recursively, subject to the <b>--exclude_dir</b> and
+searched recursively, subject to the <b>--exclude-dir</b> and
<b>--include_dir</b> options. The pattern is a PCRE regular expression, and is
matched against the final component of the file name (not the entire path). If
a file name matches both <b>--include</b> and <b>--exclude</b>, it is excluded.
There is no short form for this option.
</P>
<P>
-<b>--exclude_dir</b>=<i>pattern</i>
+<b>--exclude-dir</b>=<i>pattern</i>
When <b>pcregrep</b> is searching the contents of a directory as a consequence
of the <b>-r</b> (recursive search) option, any subdirectories whose names match
the pattern are excluded. (Note that the \fP--exclude\fP option does not affect
subdirectories.) The pattern is a PCRE regular expression, and is matched
against the final component of the name (not the entire path). If a
-subdirectory name matches both <b>--include_dir</b> and <b>--exclude_dir</b>, it
+subdirectory name matches both <b>--include-dir</b> and <b>--exclude-dir</b>, it
is excluded. There is no short form for this option.
</P>
<P>
@@ -299,20 +299,20 @@ Ignore upper/lower case distinctions during comparisons.
When <b>pcregrep</b> is searching the files in a directory as a consequence of
the <b>-r</b> (recursive search) option, only those regular files whose names
match the pattern are included. Subdirectories are always included and searched
-recursively, subject to the \fP--include_dir\fP and <b>--exclude_dir</b>
+recursively, subject to the \fP--include-dir\fP and <b>--exclude-dir</b>
options. The pattern is a PCRE regular expression, and is matched against the
final component of the file name (not the entire path). If a file name matches
both <b>--include</b> and <b>--exclude</b>, it is excluded. There is no short
form for this option.
</P>
<P>
-<b>--include_dir</b>=<i>pattern</i>
+<b>--include-dir</b>=<i>pattern</i>
When <b>pcregrep</b> is searching the contents of a directory as a consequence
of the <b>-r</b> (recursive search) option, only those subdirectories whose
names match the pattern are included. (Note that the <b>--include</b> option
does not affect subdirectories.) The pattern is a PCRE regular expression, and
is matched against the final component of the name (not the entire path). If a
-subdirectory name matches both <b>--include_dir</b> and <b>--exclude_dir</b>, it
+subdirectory name matches both <b>--include-dir</b> and <b>--exclude-dir</b>, it
is excluded. There is no short form for this option.
</P>
<P>
@@ -529,25 +529,38 @@ convert this to an appropriate sequence if the output is sent to a file.
</P>
<br><a name="SEC7" href="#TOC1">OPTIONS COMPATIBILITY</a><br>
<P>
-The majority of short and long forms of <b>pcregrep</b>'s options are the same
-as in the GNU <b>grep</b> program. Any long option of the form
+Many of the short and long forms of <b>pcregrep</b>'s options are the same
+as in the GNU <b>grep</b> program (version 2.5.4). Any long option of the form
<b>--xxx-regexp</b> (GNU terminology) is also available as <b>--xxx-regex</b>
-(PCRE terminology). However, the <b>--locale</b>, <b>-M</b>, <b>--multiline</b>,
-<b>-u</b>, and <b>--utf-8</b> options are specific to <b>pcregrep</b>. If both the
+(PCRE terminology). However, the <b>--file-offsets</b>, <b>--include-dir</b>,
+<b>--line-offsets</b>, <b>--locale</b>, <b>--match-limit</b>, <b>-M</b>,
+<b>--multiline</b>, <b>-N</b>, <b>--newline</b>, <b>--recursion-limit</b>,
+<b>-u</b>, and <b>--utf-8</b> options are specific to <b>pcregrep</b>, as is the
+use of the <b>--only-matching</b> option with a capturing parentheses number.
+</P>
+<P>
+Although most of the common options work the same way, a few are different in
+<b>pcregrep</b>. For example, the <b>--include</b> option's argument is a glob
+for GNU <b>grep</b>, but a regular expression for <b>pcregrep</b>. If both the
<b>-c</b> and <b>-l</b> options are given, GNU grep lists only file names,
without counts, but <b>pcregrep</b> gives the counts.
</P>
<br><a name="SEC8" href="#TOC1">OPTIONS WITH DATA</a><br>
<P>
There are four different ways in which an option with data can be specified.
-If a short form option is used, the data may follow immediately, or in the next
-command line item. For example:
+If a short form option is used, the data may follow immediately, or (with one
+exception) in the next command line item. For example:
<pre>
-f/some/file
-f /some/file
</pre>
+The exception is the <b>-o</b> option, which may appear with or without data.
+Because of this, if data is present, it must follow immediately in the same
+item, for example -o3.
+</P>
+<P>
If a long form option is used, the data may appear in the same command line
-item, separated by an equals character, or (with one exception) it may appear
+item, separated by an equals character, or (with two exceptions) it may appear
in the next command line item. For example:
<pre>
--file=/some/file
@@ -559,10 +572,10 @@ separate the file name from the option, because the shell does not treat ~
specially unless it is at the start of an item.
</P>
<P>
-The exception to the above is the <b>--colour</b> (or <b>--color</b>) option,
-for which the data is optional. If this option does have data, it must be given
-in the first form, using an equals character. Otherwise it will be assumed that
-it has no data.
+The exceptions to the above are the <b>--colour</b> (or <b>--color</b>) and
+<b>--only-matching</b> options, for which the data is optional. If one of these
+options does have data, it must be given in the first form, using an equals
+character. Otherwise \fBpcregrep\P will assume that it has no data.
</P>
<br><a name="SEC9" href="#TOC1">MATCHING ERRORS</a><br>
<P>
@@ -574,6 +587,12 @@ in these circumstances. If this happens, <b>pcregrep</b> outputs an error
message and the line that caused the problem to the standard error stream. If
there are more than 20 such errors, <b>pcregrep</b> gives up.
</P>
+<P>
+The <b>--match-limit</b> option of <b>pcregrep</b> can be used to set the overall
+resource limit; there is a second option called <b>--recursion-limit</b> that
+sets a limit on the amount of memory (usually stack) that is used (see the
+discussion of these options above).
+</P>
<br><a name="SEC10" href="#TOC1">DIAGNOSTICS</a><br>
<P>
Exit status is 0 if any matches were found, 1 if no matches were found, and 2
@@ -597,7 +616,7 @@ Cambridge CB2 3QH, England.
</P>
<br><a name="SEC13" href="#TOC1">REVISION</a><br>
<P>
-Last updated: 31 October 2010
+Last updated: 16 November 2010
<br>
Copyright &copy; 1997-2010 University of Cambridge.
<br>
diff --git a/doc/html/pcrematching.html b/doc/html/pcrematching.html
index 4d0b3fb..80945ca 100644
--- a/doc/html/pcrematching.html
+++ b/doc/html/pcrematching.html
@@ -106,17 +106,18 @@ The scan continues until either the end of the subject is reached, or there are
no more unterminated paths. At this point, terminated paths represent the
different matching possibilities (if there are none, the match has failed).
Thus, if there is more than one possible match, this algorithm finds all of
-them, and in particular, it finds the longest. There is an option to stop the
-algorithm after the first match (which is necessarily the shortest) is found.
+them, and in particular, it finds the longest. The matches are returned in
+decreasing order of length. There is an option to stop the algorithm after the
+first match (which is necessarily the shortest) is found.
</P>
<P>
Note that all the matches that are found start at the same point in the
subject. If the pattern
<pre>
- cat(er(pillar)?)
+ cat(er(pillar)?)?
</pre>
is matched against the string "the caterpillar catchment", the result will be
-the three strings "cat", "cater", and "caterpillar" that start at the fourth
+the three strings "caterpillar", "cater", and "cat" that start at the fifth
character of the subject. The algorithm does not automatically move on to find
matches that start at later positions.
</P>
@@ -185,8 +186,9 @@ callouts.
2. Because the alternative algorithm scans the subject string just once, and
never needs to backtrack, it is possible to pass very long subject strings to
the matching function in several pieces, checking for partial matching each
-time. It is possible to do multi-segment matching using <b>pcre_exec()</b> (by
-retaining partially matched substrings), but it is more complicated. The
+time. Although it is possible to do multi-segment matching using the standard
+algorithm (<b>pcre_exec()</b>), by retaining partially matched substrings, it is
+more complicated. The
<a href="pcrepartial.html"><b>pcrepartial</b></a>
documentation gives details of partial matching and discusses multi-segment
matching.
@@ -218,7 +220,7 @@ Cambridge CB2 3QH, England.
</P>
<br><a name="SEC8" href="#TOC1">REVISION</a><br>
<P>
-Last updated: 22 October 2010
+Last updated: 17 November 2010
<br>
Copyright &copy; 1997-2010 University of Cambridge.
<br>
diff --git a/doc/html/pcrepartial.html b/doc/html/pcrepartial.html
index 48473c7..d9229c0 100644
--- a/doc/html/pcrepartial.html
+++ b/doc/html/pcrepartial.html
@@ -143,6 +143,13 @@ assumption is made that the end of the supplied subject string may not be the
true end of the available data, and so, if \z, \Z, \b, \B, or $ are
encountered at the end of the subject, the result is PCRE_ERROR_PARTIAL.
</P>
+<P>
+Setting PCRE_PARTIAL_HARD also affects the way <b>pcre_exec()</b> checks UTF-8
+subject strings for validity. Normally, an invalid UTF-8 sequence causes the
+error PCRE_ERROR_BADUTF8. However, in the special case of a truncated UTF-8
+character at the end of the subject, PCRE_ERROR_SHORTUTF8 is returned when
+PCRE_PARTIAL_HARD is set.
+</P>
<br><b>
Comparing hard and soft partial matching
</b><br>
@@ -380,10 +387,7 @@ multi-segment data. The example above then behaves differently:
Partial match: do
data&#62; gsb\R\P\P\D
Partial match: gsb
-
-</PRE>
-</P>
-<P>
+</pre>
4. Patterns that contain alternatives at the top level which do not all
start with the same pattern item may not work as expected when
PCRE_DFA_RESTART is used with <b>pcre_dfa_exec()</b>. For example, consider this
@@ -430,7 +434,7 @@ Cambridge CB2 3QH, England.
</P>
<br><a name="SEC11" href="#TOC1">REVISION</a><br>
<P>
-Last updated: 22 October 2010
+Last updated: 07 November 2010
<br>
Copyright &copy; 1997-2010 University of Cambridge.
<br>
diff --git a/doc/html/pcrepattern.html b/doc/html/pcrepattern.html
index 076c4a0..7ab17be 100644
--- a/doc/html/pcrepattern.html
+++ b/doc/html/pcrepattern.html
@@ -421,10 +421,11 @@ any Unicode letter, and underscore. Note also that PCRE_UCP affects \b, and
is noticeably slower when PCRE_UCP is set.
</P>
<P>
-The sequences \h, \H, \v, and \V are Perl 5.10 features. In contrast to the
-other sequences, which match only ASCII characters by default, these always
-match certain high-valued codepoints in UTF-8 mode, whether or not PCRE_UCP is
-set. The horizontal space characters are:
+The sequences \h, \H, \v, and \V are features that were added to Perl at
+release 5.10. In contrast to the other sequences, which match only ASCII
+characters by default, these always match certain high-valued codepoints in
+UTF-8 mode, whether or not PCRE_UCP is set. The horizontal space characters
+are:
<pre>
U+0009 Horizontal tab
U+0020 Space
@@ -462,8 +463,7 @@ Newline sequences
</b><br>
<P>
Outside a character class, by default, the escape sequence \R matches any
-Unicode newline sequence. This is a Perl 5.10 feature. In non-UTF-8 mode \R is
-equivalent to the following:
+Unicode newline sequence. In non-UTF-8 mode \R is equivalent to the following:
<pre>
(?&#62;\r\n|\n|\x0b|\f|\r|\x85)
</pre>
@@ -769,9 +769,8 @@ same characters as Xan, plus underscore.
Resetting the match start
</b><br>
<P>
-The escape sequence \K, which is a Perl 5.10 feature, causes any previously
-matched characters not to be included in the final matched sequence. For
-example, the pattern:
+The escape sequence \K causes any previously matched characters not to be
+included in the final matched sequence. For example, the pattern:
<pre>
foo\Kbar
</pre>
@@ -941,17 +940,17 @@ dollar, the only relationship being that they both involve newlines. Dot has no
special meaning in a character class.
</P>
<P>
-The escape sequence \N always behaves as a dot does when PCRE_DOTALL is not
-set. In other words, it matches any one character except one that signifies the
-end of a line.
+The escape sequence \N behaves like a dot, except that it is not affected by
+the PCRE_DOTALL option. In other words, it matches any character except one
+that signifies the end of a line.
</P>
<br><a name="SEC7" href="#TOC1">MATCHING A SINGLE BYTE</a><br>
<P>
Outside a character class, the escape sequence \C matches any one byte, both
in and out of UTF-8 mode. Unlike a dot, it always matches any line-ending
characters. The feature is provided in Perl in order to match individual bytes
-in UTF-8 mode. Because it breaks up UTF-8 characters into individual bytes,
-what remains in the string may be a malformed UTF-8 string. For this reason,
+in UTF-8 mode. Because it breaks up UTF-8 characters into individual bytes, the
+rest of the string may start with a malformed UTF-8 character. For this reason,
the \C escape sequence is best avoided.
</P>
<P>
@@ -1166,7 +1165,7 @@ extracted by the <b>pcre_fullinfo()</b> function).
</P>
<P>
An option change within a subpattern (see below for a description of
-subpatterns) affects only that part of the current pattern that follows it, so
+subpatterns) affects only that part of the subpattern that follows it, so
<pre>
(a(?i)b)c
</pre>
@@ -1203,18 +1202,16 @@ Turning part of a pattern into a subpattern does two things:
<pre>
cat(aract|erpillar|)
</pre>
-matches one of the words "cat", "cataract", or "caterpillar". Without the
-parentheses, it would match "cataract", "erpillar" or an empty string.
+matches "cataract", "caterpillar", or "cat". Without the parentheses, it would
+match "cataract", "erpillar" or an empty string.
<br>
<br>
2. It sets up the subpattern as a capturing subpattern. This means that, when
the whole pattern matches, that portion of the subject string that matched the
subpattern is passed back to the caller via the <i>ovector</i> argument of
<b>pcre_exec()</b>. Opening parentheses are counted from left to right (starting
-from 1) to obtain numbers for the capturing subpatterns.
-</P>
-<P>
-For example, if the string "the red king" is matched against the pattern
+from 1) to obtain numbers for the capturing subpatterns. For example, if the
+string "the red king" is matched against the pattern
<pre>
the ((red|white) (king|queen))
</pre>
@@ -1262,10 +1259,9 @@ at captured substring number one, whichever alternative matched. This construct
is useful when you want to capture part, but not all, of one of a number of
alternatives. Inside a (?| group, parentheses are numbered as usual, but the
number is reset at the start of each branch. The numbers of any capturing
-buffers that follow the subpattern start after the highest number used in any
-branch. The following example is taken from the Perl documentation.
-The numbers underneath show in which buffer the captured content will be
-stored.
+parentheses that follow the subpattern start after the highest number used in
+any branch. The following example is taken from the Perl documentation. The
+numbers underneath show in which buffer the captured content will be stored.
<pre>
# before ---------------branch-reset----------- after
/ ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
@@ -1377,7 +1373,7 @@ items:
the \C escape sequence
the \X escape sequence (in UTF-8 mode with Unicode properties)
the \R escape sequence
- an escape such as \d that matches a single character
+ an escape such as \d or \pL that matches a single character
a character class
a back reference (see next section)
a parenthesized subpattern (unless it is an assertion)
@@ -1418,8 +1414,10 @@ The quantifier {0} is permitted, causing the expression to behave as if the
previous item and the quantifier were not present. This may be useful for
subpatterns that are referenced as
<a href="#subpatternsassubroutines">subroutines</a>
-from elsewhere in the pattern. Items other than subpatterns that have a {0}
-quantifier are omitted from the compiled pattern.
+from elsewhere in the pattern (but see also the section entitled
+<a href="#subdefine">"Defining subpatterns for use by reference only"</a>
+below). Items other than subpatterns that have a {0} quantifier are omitted
+from the compiled pattern.
</P>
<P>
For convenience, the three most common quantifiers have single-character
@@ -1655,9 +1653,9 @@ subpattern is possible using named parentheses (see below).
</P>
<P>
Another way of avoiding the ambiguity inherent in the use of digits following a
-backslash is to use the \g escape sequence, which is a feature introduced in
-Perl 5.10. This escape must be followed by an unsigned number or a negative
-number, optionally enclosed in braces. These examples are all identical:
+backslash is to use the \g escape sequence. This escape must be followed by an
+unsigned number or a negative number, optionally enclosed in braces. These
+examples are all identical:
<pre>
(ring), \1
(ring), \g1
@@ -1804,8 +1802,8 @@ lookbehind assertion is needed to achieve the other effect.
If you want to force a matching failure at some point in a pattern, the most
convenient way to do it is with (?!) because an empty string always matches, so
an assertion that requires there not to be an empty string must always fail.
-The Perl 5.10 backtracking control verb (*FAIL) or (*F) is essentially a
-synonym for (?!).
+The backtracking control verb (*FAIL) or (*F) is essentially a synonym for
+(?!).
<a name="lookbehind"></a></P>
<br><b>
Lookbehind assertions
@@ -1829,8 +1827,8 @@ is permitted, but
</pre>
causes an error at compile time. Branches that match different length strings
are permitted only at the top level of a lookbehind assertion. This is an
-extension compared with Perl (5.8 and 5.10), which requires all branches to
-match the same length of string. An assertion such as
+extension compared with Perl, which requires all branches to match the same
+length of string. An assertion such as
<pre>
(?&#60;=ab(c|de))
</pre>
@@ -1840,7 +1838,7 @@ branches:
<pre>
(?&#60;=abc|abde)
</pre>
-In some cases, the Perl 5.10 escape sequence \K
+In some cases, the escape sequence \K
<a href="#resetmatchstart">(see above)</a>
can be used instead of a lookbehind assertion to get round the fixed-length
restriction.
@@ -2035,7 +2033,7 @@ the most recent recursion.
At "top level", all these recursion test conditions are false.
<a href="#recursion">The syntax for recursive patterns</a>
is described below.
-</P>
+<a name="subdefine"></a></P>
<br><b>
Defining subpatterns for use by reference only
</b><br>
@@ -2094,11 +2092,11 @@ this case continues to immediately after the next newline character or
character sequence in the pattern. Which characters are interpreted as newlines
is controlled by the options passed to <b>pcre_compile()</b> or by a special
sequence at the start of the pattern, as described in the section entitled
-<a href="#recursion">"Newline conventions"</a>
-above. Note that end of this type of comment is a literal newline sequence in
-the pattern; escape sequences that happen to represent a newline do not count.
-For example, consider this pattern when PCRE_EXTENDED is set, and the default
-newline convention is in force:
+<a href="#newlines">"Newline conventions"</a>
+above. Note that the end of this type of comment is a literal newline sequence
+in the pattern; escape sequences that happen to represent a newline do not
+count. For example, consider this pattern when PCRE_EXTENDED is set, and the
+default newline convention is in force:
<pre>
abc #comment \n still comment
</pre>
@@ -2163,11 +2161,10 @@ them instead of the whole pattern.
</P>
<P>
In a larger pattern, keeping track of parenthesis numbers can be tricky. This
-is made easier by the use of relative references (a Perl 5.10 feature).
-Instead of (?1) in the pattern above you can write (?-2) to refer to the second
-most recently opened parentheses preceding the recursion. In other words, a
-negative number counts capturing parentheses leftwards from the point at which
-it is encountered.
+is made easier by the use of relative references. Instead of (?1) in the
+pattern above you can write (?-2) to refer to the second most recently opened
+parentheses preceding the recursion. In other words, a negative number counts
+capturing parentheses leftwards from the point at which it is encountered.
</P>
<P>
It is also possible to refer to subsequently opened parentheses, by writing
@@ -2676,7 +2673,7 @@ Cambridge CB2 3QH, England.
</P>
<br><a name="SEC28" href="#TOC1">REVISION</a><br>
<P>
-Last updated: 31 October 2010
+Last updated: 17 November 2010
<br>
Copyright &copy; 1997-2010 University of Cambridge.
<br>
diff --git a/doc/html/pcretest.html b/doc/html/pcretest.html
index 8317ba8..a48a79f 100644
--- a/doc/html/pcretest.html
+++ b/doc/html/pcretest.html
@@ -385,7 +385,8 @@ recognized:
\t tab (\x09)
\v vertical tab (\x0b)
\nnn octal character (up to 3 octal digits)
- \xhh hexadecimal character (up to 2 hex digits)
+ always a byte unless &#62; 255 in UTF-8 mode
+ \xhh hexadecimal byte (up to 2 hex digits)
\x{hh...} hexadecimal character, any number of digits in UTF-8 mode
\A pass the PCRE_ANCHORED option to <b>pcre_exec()</b> or <b>pcre_dfa_exec()</b>
\B pass the PCRE_NOTBOL option to <b>pcre_exec()</b> or <b>pcre_dfa_exec()</b>
@@ -423,6 +424,14 @@ recognized:
\&#60;anycrlf&#62; pass the PCRE_NEWLINE_ANYCRLF option to <b>pcre_exec()</b> or <b>pcre_dfa_exec()</b>
\&#60;any&#62; pass the PCRE_NEWLINE_ANY option to <b>pcre_exec()</b> or <b>pcre_dfa_exec()</b>
</pre>
+Note that \xhh always specifies one byte, even in UTF-8 mode; this makes it
+possible to construct invalid UTF-8 sequences for testing purposes. On the
+other hand, \x{hh} is interpreted as a UTF-8 character in UTF-8 mode,
+generating more than one byte if the value is greater than 127. When not in
+UTF-8 mode, it generates one byte for values less than 256, and causes an error
+for greater values.
+</P>
+<P>
The escapes that specify line ending sequences are literal strings, exactly as
shown. No more than one newline setting should be present in any data line.
</P>
@@ -747,7 +756,7 @@ Cambridge CB2 3QH, England.
</P>
<br><a name="SEC15" href="#TOC1">REVISION</a><br>
<P>
-Last updated: 06 November 2010
+Last updated: 07 November 2010
<br>
Copyright &copy; 1997-2010 University of Cambridge.
<br>