summaryrefslogtreecommitdiff
path: root/doc/html
diff options
context:
space:
mode:
authorph10 <ph10@6239d852-aaf2-0410-a92c-79f79f948069>2015-06-18 16:39:25 +0000
committerph10 <ph10@6239d852-aaf2-0410-a92c-79f79f948069>2015-06-18 16:39:25 +0000
commite9a99c8b4a2cce0cdf8b3f8e4e87649d703fdd16 (patch)
tree15ea422f2f5886fd0db4c9d93ced760544351d1b /doc/html
parent1c894d888dbae3a4972c7b98c7a722dabb6ead09 (diff)
downloadpcre2-e9a99c8b4a2cce0cdf8b3f8e4e87649d703fdd16.tar.gz
Source and document file tidies for 10.20-RC1.
git-svn-id: svn://vcs.exim.org/pcre2/code/trunk@288 6239d852-aaf2-0410-a92c-79f79f948069
Diffstat (limited to 'doc/html')
-rw-r--r--doc/html/README.txt5
-rw-r--r--doc/html/pcre2.html20
-rw-r--r--doc/html/pcre2_callout_enumerate.html8
-rw-r--r--doc/html/pcre2_compile.html2
-rw-r--r--doc/html/pcre2api.html46
-rw-r--r--doc/html/pcre2build.html38
-rw-r--r--doc/html/pcre2callout.html26
-rw-r--r--doc/html/pcre2pattern.html150
-rw-r--r--doc/html/pcre2syntax.html53
-rw-r--r--doc/html/pcre2test.html35
10 files changed, 253 insertions, 130 deletions
diff --git a/doc/html/README.txt b/doc/html/README.txt
index 508fd1e..7367924 100644
--- a/doc/html/README.txt
+++ b/doc/html/README.txt
@@ -294,6 +294,9 @@ library. They are also documented in the pcre2build man page.
which specifies that the code value for the EBCDIC NL character is 0x25
instead of the default 0x15.
+. If you specify --enable-debug, additional debugging code is included in the
+ build. This option is intended for use by the PCRE2 maintainers.
+
. In environments where valgrind is installed, if you specify
--enable-valgrind
@@ -829,4 +832,4 @@ The distribution should contain the files listed below.
Philip Hazel
Email local part: ph10
Email domain: cam.ac.uk
-Last updated: 26 January 2015
+Last updated: 24 April 2015
diff --git a/doc/html/pcre2.html b/doc/html/pcre2.html
index 2c2b106..e94b355 100644
--- a/doc/html/pcre2.html
+++ b/doc/html/pcre2.html
@@ -108,8 +108,14 @@ lose performance.
<P>
One way of guarding against this possibility is to use the
<b>pcre2_pattern_info()</b> function to check the compiled pattern's options for
-UTF. Alternatively, you can set the PCRE2_NEVER_UTF option at compile time.
-This causes an compile time error if a pattern contains a UTF-setting sequence.
+PCRE2_UTF. Alternatively, you can set the PCRE2_NEVER_UTF option when calling
+<b>pcre2_compile()</b>. This causes an compile time error if a pattern contains
+a UTF-setting sequence.
+</P>
+<P>
+The use of Unicode properties for character types such as \d can also be
+enabled from within the pattern, by specifying "(*UCP)". This feature can be
+disallowed by setting the PCRE2_NEVER_UCP option.
</P>
<P>
If your application is one that supports UTF, be aware that validity checking
@@ -118,6 +124,12 @@ the PCRE2_NO_UTF_CHECK option for the second and subsequent matches to avoid
running redundant checks.
</P>
<P>
+The use of the \C escape sequence in a UTF-8 or UTF-16 pattern can lead to
+problems, because it may leave the current matching point in the middle of a
+multi-code-unit character. The PCRE2_NEVER_BACKSLASH_C option can be used to
+lock out the use of \C, causing a compile-time error if it is encountered.
+</P>
+<P>
Another way that performance can be hit is by running a pattern that has a very
large search tree against a string that will never match. Nested unlimited
repeats in a pattern are a common example. PCRE2 provides some protection
@@ -175,9 +187,9 @@ use my two initials, followed by the two digits 10, at the domain cam.ac.uk.
</P>
<br><a name="SEC5" href="#TOC1">REVISION</a><br>
<P>
-Last updated: 18 November 2014
+Last updated: 13 April 2015
<br>
-Copyright &copy; 1997-2014 University of Cambridge.
+Copyright &copy; 1997-2015 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
diff --git a/doc/html/pcre2_callout_enumerate.html b/doc/html/pcre2_callout_enumerate.html
index 8344ead..6c2cdb8 100644
--- a/doc/html/pcre2_callout_enumerate.html
+++ b/doc/html/pcre2_callout_enumerate.html
@@ -33,7 +33,7 @@ for success and non-zero otherwise. The arguments are:
<pre>
<i>code</i> Points to the compiled pattern
<i>callback</i> The callback function
- <i>callout_data</i> User data that is passed to the callback
+ <i>callout_data</i> User data that is passed to the callback
</pre>
The <i>callback()</i> function is passed a pointer to a data block containing
the following fields:
@@ -46,9 +46,9 @@ the following fields:
<i>callout_string_length</i> Length of callout string
<i>callout_string</i> Points to callout string or is NULL
</pre>
-The second argument is the callout data that was passed to
-<b>pcre2_callout_enumerate()</b>. The <b>callback()</b> function must return zero
-for success. Any other value causes the pattern scan to stop, with the value
+The second argument is the callout data that was passed to
+<b>pcre2_callout_enumerate()</b>. The <b>callback()</b> function must return zero
+for success. Any other value causes the pattern scan to stop, with the value
being passed back as the result of <b>pcre2_callout_enumerate()</b>.
</P>
<P>
diff --git a/doc/html/pcre2_compile.html b/doc/html/pcre2_compile.html
index d833ebd..544f4fe 100644
--- a/doc/html/pcre2_compile.html
+++ b/doc/html/pcre2_compile.html
@@ -49,6 +49,7 @@ or provide an external function for stack size checking. The option bits are:
<pre>
PCRE2_ANCHORED Force pattern anchoring
PCRE2_ALT_BSUX Alternative handling of \u, \U, and \x
+ PCRE2_ALT_CIRCUMFLEX Alternative handling of ^ in multiline mode
PCRE2_AUTO_CALLOUT Compile automatic callouts
PCRE2_CASELESS Do caseless matching
PCRE2_DOLLAR_ENDONLY $ not to match newline at end
@@ -58,6 +59,7 @@ or provide an external function for stack size checking. The option bits are:
PCRE2_FIRSTLINE Force matching to be before newline
PCRE2_MATCH_UNSET_BACKREF Match unset back references
PCRE2_MULTILINE ^ and $ match newlines within data
+ PCRE2_NEVER_BACKSLASH_C Lock out the use of \C in patterns
PCRE2_NEVER_UCP Lock out PCRE2_UCP, e.g. via (*UCP)
PCRE2_NEVER_UTF Lock out PCRE2_UTF, e.g. via (*UTF)
PCRE2_NO_AUTO_CAPTURE Disable numbered capturing paren-
diff --git a/doc/html/pcre2api.html b/doc/html/pcre2api.html
index 6fbf183..60d2bf5 100644
--- a/doc/html/pcre2api.html
+++ b/doc/html/pcre2api.html
@@ -1075,6 +1075,15 @@ to match. By default, as in Perl, a hexadecimal number is always expected after
\x, but it may have zero, one, or two digits (so, for example, \xz matches a
binary zero character followed by z).
<pre>
+ PCRE2_ALT_CIRCUMFLEX
+</pre>
+In multiline mode (when PCRE2_MULTILINE is set), the circumflex metacharacter
+matches at the start of the subject (unless PCRE2_NOTBOL is set), and also
+after any internal newline. However, it does not match after a newline at the
+end of the subject, for compatibility with Perl. If you want a multiline
+circumflex also to match after a terminating newline, you must set
+PCRE2_ALT_CIRCUMFLEX.
+<pre>
PCRE2_AUTO_CALLOUT
</pre>
If this bit is set, <b>pcre2_compile()</b> automatically inserts callout items,
@@ -1174,8 +1183,19 @@ When PCRE2_MULTILINE it is set, the "start of line" and "end of line"
constructs match immediately following or immediately before internal newlines
in the subject string, respectively, as well as at the very start and end. This
is equivalent to Perl's /m option, and it can be changed within a pattern by a
-(?m) option setting. If there are no newlines in a subject string, or no
-occurrences of ^ or $ in a pattern, setting PCRE2_MULTILINE has no effect.
+(?m) option setting. Note that the "start of line" metacharacter does not match
+after a newline at the end of the subject, for compatibility with Perl.
+However, you can change this by setting the PCRE2_ALT_CIRCUMFLEX option. If
+there are no newlines in a subject string, or no occurrences of ^ or $ in a
+pattern, setting PCRE2_MULTILINE has no effect.
+<pre>
+ PCRE2_NEVER_BACKSLASH_C
+</pre>
+This option locks out the use of \C in the pattern that is being compiled.
+This escape can cause unpredictable behaviour in UTF-8 or UTF-16 modes, because
+it may leave the current matching point in the middle of a multi-code-unit
+character. This option may be useful in applications that process patterns from
+external sources.
<pre>
PCRE2_NEVER_UCP
</pre>
@@ -1183,17 +1203,17 @@ This option locks out the use of Unicode properties for handling \B, \b, \D,
\d, \S, \s, \W, \w, and some of the POSIX character classes, as described
for the PCRE2_UCP option below. In particular, it prevents the creator of the
pattern from enabling this facility by starting the pattern with (*UCP). This
-may be useful in applications that process patterns from external sources. The
-option combination PCRE_UCP and PCRE_NEVER_UCP causes an error.
+option may be useful in applications that process patterns from external
+sources. The option combination PCRE_UCP and PCRE_NEVER_UCP causes an error.
<pre>
PCRE2_NEVER_UTF
</pre>
This option locks out interpretation of the pattern as UTF-8, UTF-16, or
UTF-32, depending on which library is in use. In particular, it prevents the
creator of the pattern from switching to UTF interpretation by starting the
-pattern with (*UTF). This may be useful in applications that process patterns
-from external sources. The combination of PCRE2_UTF and PCRE2_NEVER_UTF causes
-an error.
+pattern with (*UTF). This option may be useful in applications that process
+patterns from external sources. The combination of PCRE2_UTF and
+PCRE2_NEVER_UTF causes an error.
<pre>
PCRE2_NO_AUTO_CAPTURE
</pre>
@@ -1735,14 +1755,14 @@ compiler does not alter the value returned by this option.
<b> void *<i>user_data</i>);</b>
<br>
<br>
-A script language that supports the use of string arguments in callouts might
-like to scan all the callouts in a pattern before running the match. This can
-be done by calling <b>pcre2_callout_enumerate()</b>. The first argument is a
+A script language that supports the use of string arguments in callouts might
+like to scan all the callouts in a pattern before running the match. This can
+be done by calling <b>pcre2_callout_enumerate()</b>. The first argument is a
pointer to a compiled pattern, the second points to a callback function, and
the third is arbitrary user data. The callback function is called for every
callout in the pattern in the order in which they appear. Its first argument is
a pointer to a callout enumeration block, and its second argument is the
-<i>user_data</i> value that was passed to <b>pcre2_callout_enumerate()</b>. The
+<i>user_data</i> value that was passed to <b>pcre2_callout_enumerate()</b>. The
contents of the callout enumeration block are described in the
<a href="pcre2callout.html"><b>pcre2callout</b></a>
documentation, which also gives further details about callouts.
@@ -2273,7 +2293,7 @@ of the subject.
PCRE2_ERROR_CALLOUT
</pre>
This error is never generated by <b>pcre2_match()</b> itself. It is provided for
-use by callout functions that want to cause <b>pcre2_match()</b> or
+use by callout functions that want to cause <b>pcre2_match()</b> or
<b>pcre2_callout_enumerate()</b> to return a distinctive error code. See the
<a href="pcre2callout.html"><b>pcre2callout</b></a>
documentation for details.
@@ -2863,7 +2883,7 @@ Cambridge, England.
</P>
<br><a name="SEC40" href="#TOC1">REVISION</a><br>
<P>
-Last updated: 23 March 2015
+Last updated: 22 April 2015
<br>
Copyright &copy; 1997-2015 University of Cambridge.
<br>
diff --git a/doc/html/pcre2build.html b/doc/html/pcre2build.html
index 13204d4..8d9f9ce 100644
--- a/doc/html/pcre2build.html
+++ b/doc/html/pcre2build.html
@@ -29,11 +29,12 @@ please consult the man page, in case the conversion went wrong.
<li><a name="TOC14" href="#SEC14">PCRE2GREP OPTIONS FOR COMPRESSED FILE SUPPORT</a>
<li><a name="TOC15" href="#SEC15">PCRE2GREP BUFFER SIZE</a>
<li><a name="TOC16" href="#SEC16">PCRE2TEST OPTION FOR LIBREADLINE SUPPORT</a>
-<li><a name="TOC17" href="#SEC17">DEBUGGING WITH VALGRIND SUPPORT</a>
-<li><a name="TOC18" href="#SEC18">CODE COVERAGE REPORTING</a>
-<li><a name="TOC19" href="#SEC19">SEE ALSO</a>
-<li><a name="TOC20" href="#SEC20">AUTHOR</a>
-<li><a name="TOC21" href="#SEC21">REVISION</a>
+<li><a name="TOC17" href="#SEC17">INCLUDING DEBUGGING CODE</a>
+<li><a name="TOC18" href="#SEC18">DEBUGGING WITH VALGRIND SUPPORT</a>
+<li><a name="TOC19" href="#SEC19">CODE COVERAGE REPORTING</a>
+<li><a name="TOC20" href="#SEC20">SEE ALSO</a>
+<li><a name="TOC21" href="#SEC21">AUTHOR</a>
+<li><a name="TOC22" href="#SEC22">REVISION</a>
</ul>
<br><a name="SEC1" href="#TOC1">BUILDING PCRE2</a><br>
<P>
@@ -147,6 +148,12 @@ properties. The application can request that they do by setting the PCRE2_UCP
option. Unless the application has set PCRE2_NEVER_UCP, a pattern may also
request this by starting with (*UCP).
</P>
+<P>
+The \C escape sequence, which matches a single code unit, even in a UTF mode,
+can cause unpredictable behaviour because it may leave the current matching
+point in the middle of a multi-code-unit character. It can be locked out by
+setting the PCRE2_NEVER_BACKSLASH_C option.
+</P>
<br><a name="SEC6" href="#TOC1">JUST-IN-TIME COMPILER SUPPORT</a><br>
<P>
Just-in-time compiler support is included in the build by specifying
@@ -397,7 +404,16 @@ automatically included, you may need to add something like
</pre>
immediately before the <b>configure</b> command.
</P>
-<br><a name="SEC17" href="#TOC1">DEBUGGING WITH VALGRIND SUPPORT</a><br>
+<br><a name="SEC17" href="#TOC1">INCLUDING DEBUGGING CODE</a><br>
+<P>
+If you add
+<pre>
+ --enable-debug
+</pre>
+to the <b>configure</b> command, additional debugging code is included in the
+build. This feature is intended for use by the PCRE2 maintainers.
+</P>
+<br><a name="SEC18" href="#TOC1">DEBUGGING WITH VALGRIND SUPPORT</a><br>
<P>
If you add
<pre>
@@ -407,7 +423,7 @@ to the <b>configure</b> command, PCRE2 will use valgrind annotations to mark
certain memory regions as unaddressable. This allows it to detect invalid
memory accesses, and is mostly useful for debugging PCRE2 itself.
</P>
-<br><a name="SEC18" href="#TOC1">CODE COVERAGE REPORTING</a><br>
+<br><a name="SEC19" href="#TOC1">CODE COVERAGE REPORTING</a><br>
<P>
If your C compiler is gcc, you can build a version of PCRE2 that can generate a
code coverage report for its test suite. To enable this, you must install
@@ -464,11 +480,11 @@ This cleans all coverage data including the generated coverage report. For more
information about code coverage, see the <b>gcov</b> and <b>lcov</b>
documentation.
</P>
-<br><a name="SEC19" href="#TOC1">SEE ALSO</a><br>
+<br><a name="SEC20" href="#TOC1">SEE ALSO</a><br>
<P>
<b>pcre2api</b>(3), <b>pcre2-config</b>(3).
</P>
-<br><a name="SEC20" href="#TOC1">AUTHOR</a><br>
+<br><a name="SEC21" href="#TOC1">AUTHOR</a><br>
<P>
Philip Hazel
<br>
@@ -477,9 +493,9 @@ University Computing Service
Cambridge, England.
<br>
</P>
-<br><a name="SEC21" href="#TOC1">REVISION</a><br>
+<br><a name="SEC22" href="#TOC1">REVISION</a><br>
<P>
-Last updated: 26 January 2015
+Last updated: 24 April 2015
<br>
Copyright &copy; 1997-2015 University of Cambridge.
<br>
diff --git a/doc/html/pcre2callout.html b/doc/html/pcre2callout.html
index 5815d00..7e85c9a 100644
--- a/doc/html/pcre2callout.html
+++ b/doc/html/pcre2callout.html
@@ -219,11 +219,11 @@ documentation). The callout block structure contains the following fields:
PCRE2_SIZE <i>pattern_position</i>;
PCRE2_SIZE <i>next_item_length</i>;
PCRE2_SIZE <i>callout_string_offset</i>;
- PCRE2_SIZE <i>callout_string_length</i>;
- PCRE2_SPTR <i>callout_string</i>;
+ PCRE2_SIZE <i>callout_string_length</i>;
+ PCRE2_SPTR <i>callout_string</i>;
</pre>
The <i>version</i> field contains the version number of the block format. The
-current version is 1; the three callout string fields were added for this
+current version is 1; the three callout string fields were added for this
version. If you are writing an application that might use an earlier release of
PCRE2, you should check the version number before accessing any of these
fields. The version number will increase in future if more fields are added,
@@ -263,7 +263,7 @@ need to report errors in the callout string within the pattern.
Fields for all callouts
</b><br>
<P>
-The remaining fields in the callout block are the same for both kinds of
+The remaining fields in the callout block are the same for both kinds of
callout.
</P>
<P>
@@ -306,7 +306,7 @@ always the case for the DFA matching functions.
</P>
<P>
The <i>pattern_position</i> field contains the offset in the pattern string to
-the next item to be matched.
+the next item to be matched.
</P>
<P>
The <i>next_item_length</i> field contains the length of the next item to be
@@ -318,8 +318,8 @@ of the entire subpattern.
<P>
The <i>pattern_position</i> and <i>next_item_length</i> fields are intended to
help in distinguishing between different automatic callouts, which all have the
-same callout number. However, they are set for all callouts, and are used by
-<b>pcre2test</b> to show the next item to be matched when displaying callout
+same callout number. However, they are set for all callouts, and are used by
+<b>pcre2test</b> to show the next item to be matched when displaying callout
information.
</P>
<P>
@@ -351,9 +351,9 @@ functions; it will never be used by PCRE2 itself.
<b> void *<i>user_data</i>);</b>
<br>
<br>
-A script language that supports the use of string arguments in callouts might
-like to scan all the callouts in a pattern before running the match. This can
-be done by calling <b>pcre2_callout_enumerate()</b>. The first argument is a
+A script language that supports the use of string arguments in callouts might
+like to scan all the callouts in a pattern before running the match. This can
+be done by calling <b>pcre2_callout_enumerate()</b>. The first argument is a
pointer to a compiled pattern, the second points to a callback function, and
the third is arbitrary user data. The callback function is called for every
callout in the pattern in the order in which they appear. Its first argument is
@@ -369,7 +369,7 @@ data block contains the following fields:
<i>callout_string_length</i> Length of callout string
<i>callout_string</i> Points to callout string or is NULL
</pre>
-The version number is currently 0. It will increase if new fields are ever
+The version number is currently 0. It will increase if new fields are ever
added to the block. The remaining fields are the same as their namesakes in the
<b>pcre2_callout</b> block that is used for callouts during matching, as
described
@@ -384,8 +384,8 @@ pattern. For example, a pattern such as /(a){2}/ is compiled as if it were
with the same value for <i>pattern_position</i> in each case.
</P>
<P>
-The callback function should normally return zero. If it returns a non-zero
-value, scanning the pattern stops, and that value is returned from
+The callback function should normally return zero. If it returns a non-zero
+value, scanning the pattern stops, and that value is returned from
<b>pcre2_callout_enumerate()</b>.
</P>
<br><a name="SEC7" href="#TOC1">AUTHOR</a><br>
diff --git a/doc/html/pcre2pattern.html b/doc/html/pcre2pattern.html
index 184ee25..a9ca60e 100644
--- a/doc/html/pcre2pattern.html
+++ b/doc/html/pcre2pattern.html
@@ -357,10 +357,11 @@ A second use of backslash provides a way of encoding non-printing characters
in patterns in a visible manner. There is no restriction on the appearance of
non-printing characters in a pattern, but when a pattern is being prepared by
text editing, it is often easier to use one of the following escape sequences
-than the binary character it represents:
+than the binary character it represents. In an ASCII or Unicode environment,
+these escapes are as follows:
<pre>
\a alarm, that is, the BEL character (hex 07)
- \cx "control-x", where x is any ASCII character
+ \cx "control-x", where x is any printable ASCII character
\e escape (hex 1B)
\f form feed (hex 0C)
\n linefeed (hex 0A)
@@ -377,23 +378,38 @@ The precise effect of \cx on ASCII characters is as follows: if x is a lower
case letter, it is converted to upper case. Then bit 6 of the character (hex
40) is inverted. Thus \cA to \cZ become hex 01 to hex 1A (A is 41, Z is 5A),
but \c{ becomes hex 3B ({ is 7B), and \c; becomes hex 7B (; is 3B). If the
-code unit following \c has a value greater than 127, a compile-time error
-occurs. This locks out non-ASCII characters in all modes.
+code unit following \c has a value less than 32 or greater than 126, a
+compile-time error occurs. This locks out non-printable ASCII characters in all
+modes.
</P>
<P>
-The \c facility was designed for use with ASCII characters, but with the
-extension to Unicode it is even less useful than it once was. It is, however,
-recognized when PCRE2 is compiled in EBCDIC mode, where data items are always
-bytes. In this mode, all values are valid after \c. If the next character is a
-lower case letter, it is converted to upper case. Then the 0xc0 bits of the
-byte are inverted. Thus \cA becomes hex 01, as in ASCII (A is C1), but because
-the EBCDIC letters are disjoint, \cZ becomes hex 29 (Z is E9), and other
-characters also generate different values.
+When PCRE2 is compiled in EBCDIC mode, \a, \e, \f, \n, \r, and \t
+generate the appropriate EBCDIC code values. The \c escape is processed
+as specified for Perl in the <b>perlebcdic</b> document. The only characters
+that are allowed after \c are A-Z, a-z, or one of @, [, \, ], ^, _, or ?. Any
+other character provokes a compile-time error. The sequence \@ encodes
+character code 0; the letters (in either case) encode characters 1-26 (hex 01
+to hex 1A); [, \, ], ^, and _ encode characters 27-31 (hex 1B to hex 1F), and
+\? becomes either 255 (hex FF) or 95 (hex 5F).
+</P>
+<P>
+Thus, apart from \?, these escapes generate the same character code values as
+they do in an ASCII environment, though the meanings of the values mostly
+differ. For example, \G always generates code value 7, which is BEL in ASCII
+but DEL in EBCDIC.
+</P>
+<P>
+The sequence \? generates DEL (127, hex 7F) in an ASCII environment, but
+because 127 is not a control character in EBCDIC, Perl makes it generate the
+APC character. Unfortunately, there are several variants of EBCDIC. In most of
+them the APC character has the value 255 (hex FF), but in the one Perl calls
+POSIX-BC its value is 95 (hex 5F). If certain other characters have POSIX-BC
+values, PCRE2 makes \? generate 95; otherwise it generates 255.
</P>
<P>
After \0 up to two further octal digits are read. If there are fewer than two
-digits, just those that are present are used. Thus the sequence \0\x\07
-specifies two binary zeros followed by a BEL character (code value 7). Make
+digits, just those that are present are used. Thus the sequence \0\x\015
+specifies two binary zeros followed by a CR character (code value 13). Make
sure you supply two digits after the initial zero if the pattern character that
follows is itself an octal digit.
</P>
@@ -412,21 +428,24 @@ describe the old, ambiguous syntax.
</P>
<P>
The handling of a backslash followed by a digit other than 0 is complicated,
-and Perl has changed in recent releases, causing PCRE2 also to change. Outside
-a character class, PCRE2 reads the digit and any following digits as a decimal
-number. If the number is less than 8, or if there have been at least that many
-previous capturing left parentheses in the expression, the entire sequence is
-taken as a <i>back reference</i>. A description of how this works is given
+and Perl has changed over time, causing PCRE2 also to change.
+</P>
+<P>
+Outside a character class, PCRE2 reads the digit and any following digits as a
+decimal number. If the number is less than 10, begins with the digit 8 or 9, or
+if there are at least that many previous capturing left parentheses in the
+expression, the entire sequence is taken as a <i>back reference</i>. A
+description of how this works is given
<a href="#backreferences">later,</a>
following the discussion of
<a href="#subpattern">parenthesized subpatterns.</a>
+Otherwise, up to three octal digits are read to form a character code.
</P>
<P>
-Inside a character class, or if the decimal number following \ is greater than
-7 and there have not been that many capturing subpatterns, PCRE2 handles \8
-and \9 as the literal characters "8" and "9", and otherwise re-reads up to
-three octal digits following the backslash, using them to generate a data
-character. Any subsequent digits stand for themselves. For example:
+Inside a character class, PCRE2 handles \8 and \9 as the literal characters
+"8" and "9", and otherwise reads up to three octal digits following the
+backslash, using them to generate a data character. Any subsequent digits stand
+for themselves. For example, outside a character class:
<pre>
\040 is another way of writing an ASCII space
\40 is the same, provided there are fewer than 40 previous capturing subpatterns
@@ -436,7 +455,7 @@ character. Any subsequent digits stand for themselves. For example:
\0113 is a tab followed by the character "3"
\113 might be a back reference, otherwise the character with octal code 113
\377 might be a back reference, otherwise the value 255 (decimal)
- \81 is either a back reference, or the two characters "8" and "1"
+ \81 is always a back reference .sp
</pre>
Note that octal values of 100 or greater that are specified using this syntax
must not be introduced by a leading zero, because no more than three octal
@@ -1105,15 +1124,19 @@ regular expression.
<P>
The circumflex and dollar metacharacters are zero-width assertions. That is,
they test for a particular condition being true without consuming any
-characters from the subject string.
+characters from the subject string. These two metacharacters are concerned with
+matching the starts and ends of lines. If the newline convention is set so that
+only the two-character sequence CRLF is recognized as a newline, isolated CR
+and LF characters are treated as ordinary data characters, and are not
+recognized as newlines.
</P>
<P>
Outside a character class, in the default matching mode, the circumflex
character is an assertion that is true only if the current matching point is at
the start of the subject string. If the <i>startoffset</i> argument of
-<b>pcre2_match()</b> is non-zero, circumflex can never match if the
-PCRE2_MULTILINE option is unset. Inside a character class, circumflex has an
-entirely different meaning
+<b>pcre2_match()</b> is non-zero, or if PCRE2_NOTBOL is set, circumflex can
+never match if the PCRE2_MULTILINE option is unset. Inside a character class,
+circumflex has an entirely different meaning
<a href="#characterclass">(see below).</a>
</P>
<P>
@@ -1128,10 +1151,11 @@ to be anchored.)
<P>
The dollar character is an assertion that is true only if the current matching
point is at the end of the subject string, or immediately before a newline at
-the end of the string (by default). Note, however, that it does not actually
-match the newline. Dollar need not be the last character of the pattern if a
-number of alternatives are involved, but it should be the last item in any
-branch in which it appears. Dollar has no special meaning in a character class.
+the end of the string (by default), unless PCRE2_NOTEOL is set. Note, however,
+that it does not actually match the newline. Dollar need not be the last
+character of the pattern if a number of alternatives are involved, but it
+should be the last item in any branch in which it appears. Dollar has no
+special meaning in a character class.
</P>
<P>
The meaning of dollar can be changed so that it matches only at the very end of
@@ -1139,13 +1163,13 @@ the string, by setting the PCRE2_DOLLAR_ENDONLY option at compile time. This
does not affect the \Z assertion.
</P>
<P>
-The meanings of the circumflex and dollar characters are changed if the
-PCRE2_MULTILINE option is set. When this is the case, a circumflex matches
-immediately after internal newlines as well as at the start of the subject
-string. It does not match after a newline that ends the string. A dollar
-matches before any newlines in the string, as well as at the very end, when
-PCRE2_MULTILINE is set. When newline is specified as the two-character
-sequence CRLF, isolated CR and LF characters do not indicate newlines.
+The meanings of the circumflex and dollar metacharacters are changed if the
+PCRE2_MULTILINE option is set. When this is the case, a dollar character
+matches before any newlines in the string, as well as at the very end, and a
+circumflex matches immediately after internal newlines as well as at the start
+of the subject string. It does not match after a newline that ends the string,
+for compatibility with Perl. However, this can be changed by setting the
+PCRE2_ALT_CIRCUMFLEX option.
</P>
<P>
For example, the pattern /^abc$/ matches the subject string "def\nabc" (where
@@ -1198,12 +1222,16 @@ whether or not a UTF mode is set. In the 8-bit library, one code unit is one
byte; in the 16-bit library it is a 16-bit unit; in the 32-bit library it is a
32-bit unit. Unlike a dot, \C always matches line-ending characters. The
feature is provided in Perl in order to match individual bytes in UTF-8 mode,
-but it is unclear how it can usefully be used. Because \C breaks up characters
-into individual code units, matching one unit with \C in a UTF mode means that
-the rest of the string may start with a malformed UTF character. This has
-undefined results, because PCRE2 assumes that it is dealing with valid UTF
-strings (and by default it checks this at the start of processing unless the
-PCRE2_NO_UTF_CHECK option is used).
+but it is unclear how it can usefully be used.
+</P>
+<P>
+Because \C breaks up characters into individual code units, matching one unit
+with \C in UTF-8 or UTF-16 mode means that the rest of the string may start
+with a malformed UTF character. This has undefined results, because PCRE2
+assumes that it is matching character by character in a valid UTF string (by
+default it checks the subject string's validity at the start of processing
+unless the PCRE2_NO_UTF_CHECK option is used). An application can lock out the
+use of \C by setting the PCRE2_NEVER_BACKSLASH_C option.
</P>
<P>
PCRE2 does not allow \C to appear in lookbehind assertions
@@ -1475,7 +1503,8 @@ unset these options by preceding the letter with a hyphen, and a combined
setting and unsetting such as (?im-sx), which sets PCRE2_CASELESS and
PCRE2_MULTILINE while unsetting PCRE2_DOTALL and PCRE2_EXTENDED, is also
permitted. If a letter appears both before and after the hyphen, the option is
-unset.
+unset. An empty options setting "(?)" is allowed. Needless to say, it has no
+effect.
</P>
<P>
The PCRE2-specific options PCRE2_DUPNAMES and PCRE2_UNGREEDY can be changed in
@@ -1508,11 +1537,20 @@ option settings happen at compile time. There would be some very weird
behaviour otherwise.
</P>
<P>
+As a convenient shorthand, if any option settings are required at the start of
+a non-capturing subpattern (see the next section), the option letters may
+appear between the "?" and the ":". Thus the two patterns
+<pre>
+ (?i:saturday|sunday)
+ (?:(?i)saturday|sunday)
+</pre>
+match exactly the same set of strings.
+</P>
+<P>
<b>Note:</b> There are other PCRE2-specific options that can be set by the
-application when the compiling function is called.
-The pattern can contain special leading sequences such as (*CRLF) to override
-what the application has set or what has been defaulted. Details are given in
-the section entitled
+application when the compiling function is called. The pattern can contain
+special leading sequences such as (*CRLF) to override what the application has
+set or what has been defaulted. Details are given in the section entitled
<a href="#newlineseq">"Newline sequences"</a>
above. There are also the (*UTF) and (*UCP) leading sequences that can be used
to set UTF and Unicode property modes; they are equivalent to setting the
@@ -2841,10 +2879,10 @@ condition.
Callouts with string arguments
</b><br>
<P>
-A delimited string may be used instead of a number as a callout argument. The
-starting delimiter must be one of ` ' " ^ % # $ { and the ending delimiter is
-the same as the start, except for {, where the ending delimiter is }. If the
-ending delimiter is needed within the string, it must be doubled. For
+A delimited string may be used instead of a number as a callout argument. The
+starting delimiter must be one of ` ' " ^ % # $ { and the ending delimiter is
+the same as the start, except for {, where the ending delimiter is }. If the
+ending delimiter is needed within the string, it must be doubled. For
example:
<pre>
(?C'ab ''c'' d')xyz(?C{any text})pqr
@@ -3285,7 +3323,7 @@ Cambridge, England.
</P>
<br><a name="SEC30" href="#TOC1">REVISION</a><br>
<P>
-Last updated: 15 March 2015
+Last updated: 13 June 2015
<br>
Copyright &copy; 1997-2015 University of Cambridge.
<br>
diff --git a/doc/html/pcre2syntax.html b/doc/html/pcre2syntax.html
index a0a1b99..28ba023 100644
--- a/doc/html/pcre2syntax.html
+++ b/doc/html/pcre2syntax.html
@@ -15,7 +15,7 @@ please consult the man page, in case the conversion went wrong.
<ul>
<li><a name="TOC1" href="#SEC1">PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY</a>
<li><a name="TOC2" href="#SEC2">QUOTING</a>
-<li><a name="TOC3" href="#SEC3">CHARACTERS</a>
+<li><a name="TOC3" href="#SEC3">ESCAPED CHARACTERS</a>
<li><a name="TOC4" href="#SEC4">CHARACTER TYPES</a>
<li><a name="TOC5" href="#SEC5">GENERAL CATEGORY PROPERTIES FOR \p and \P</a>
<li><a name="TOC6" href="#SEC6">PCRE2 SPECIAL CATEGORY PROPERTIES FOR \p and \P</a>
@@ -55,11 +55,12 @@ documentation. This document contains a quick-reference summary of the syntax.
\Q...\E treat enclosed characters as literal
</PRE>
</P>
-<br><a name="SEC3" href="#TOC1">CHARACTERS</a><br>
+<br><a name="SEC3" href="#TOC1">ESCAPED CHARACTERS</a><br>
<P>
+This table applies to ASCII and Unicode environments.
<pre>
\a alarm, that is, the BEL character (hex 07)
- \cx "control-x", where x is any ASCII character
+ \cx "control-x", where x is any ASCII printing character
\e escape (hex 1B)
\f form feed (hex 0C)
\n newline (hex 0A)
@@ -68,18 +69,32 @@ documentation. This document contains a quick-reference summary of the syntax.
\0dd character with octal code 0dd
\ddd character with octal code ddd, or backreference
\o{ddd..} character with octal code ddd..
+ \U "U" if PCRE2_ALT_BSUX is set (otherwise is an error)
+ \uhhhh character with hex code hhhh (if PCRE2_ALT_BSUX is set)
\xhh character with hex code hh
\x{hhh..} character with hex code hhh..
</pre>
-Note that \0dd is always an octal code, and that \8 and \9 are the literal
-characters "8" and "9".
+Note that \0dd is always an octal code. The treatment of backslash followed by
+a non-zero digit is complicated; for details see the section
+<a href="pcre2pattern.html#digitsafterbackslash">"Non-printing characters"</a>
+in the
+<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
+documentation, where details of escape processing in EBCDIC environments are
+also given.
+</P>
+<P>
+When \x is not followed by {, from zero to two hexadecimal digits are read,
+but if PCRE2_ALT_BSUX is set, \x must be followed by two hexadecimal digits to
+be recognized as a hexadecimal escape; otherwise it matches a literal "x".
+Likewise, if \u (in ALT_BSUX mode) is not followed by four hexadecimal digits,
+it matches a literal "u".
</P>
<br><a name="SEC4" href="#TOC1">CHARACTER TYPES</a><br>
<P>
<pre>
. any character except newline;
in dotall mode, any character whatsoever
- \C one data unit, even in UTF mode (best avoided)
+ \C one code unit, even in UTF mode (best avoided)
\d a decimal digit
\D a character that is not a decimal digit
\h a horizontal white space character
@@ -96,6 +111,11 @@ characters "8" and "9".
\W a "non-word" character
\X a Unicode extended grapheme cluster
</pre>
+The application can lock out the use of \C by setting the
+PCRE2_NEVER_BACKSLASH_C option. It is dangerous because it may leave the
+current matching point in the middle of a UTF-8 or UTF-16 character.
+</P>
+<P>
By default, \d, \s, and \w match only ASCII characters, even in UTF-8 mode
or in the 16-bit and 32-bit libraries. However, if locale-specific matching is
happening, \s and \w may also match characters with code points in the range
@@ -348,13 +368,14 @@ but some of them use Unicode properties if PCRE2_UCP is set. You can use
\b word boundary
\B not a word boundary
^ start of subject
- also after internal newline in multiline mode
+ also after an internal newline in multiline mode
+ (after any newline if PCRE2_ALT_CIRCUMFLEX is set)
\A start of subject
$ end of subject
- also before newline at end of subject
- also before internal newline in multiline mode
+ also before newline at end of subject
+ also before internal newline in multiline mode
\Z end of subject
- also before newline at end of subject
+ also before newline at end of subject
\z end of subject
\G first matching position in subject
</PRE>
@@ -423,7 +444,9 @@ appear.
(*UCP) set PCRE2_UCP (use Unicode properties for \d etc)
</pre>
Note that LIMIT_MATCH and LIMIT_RECURSION can only reduce the value of the
-limits set by the caller of pcre2_match(), not increase them.
+limits set by the caller of pcre2_match(), not increase them. The application
+can lock out the use of (*UTF) and (*UCP) by setting the PCRE2_NEVER_UTF or
+PCRE2_NEVER_UCP options, respectively, at compile time.
</P>
<br><a name="SEC17" href="#TOC1">NEWLINE CONVENTION</a><br>
<P>
@@ -539,9 +562,9 @@ pattern is not anchored.
(?Cn) callout with numerical data n
(?C"text") callout with string data
</pre>
-The allowed string delimiters are ` ' " ^ % # $ (which are the same for the
-start and the end), and the starting delimiter { matched with the ending
-delimiter }. To encode the ending delimiter within the string, double it.
+The allowed string delimiters are ` ' " ^ % # $ (which are the same for the
+start and the end), and the starting delimiter { matched with the ending
+delimiter }. To encode the ending delimiter within the string, double it.
</P>
<br><a name="SEC25" href="#TOC1">SEE ALSO</a><br>
<P>
@@ -559,7 +582,7 @@ Cambridge, England.
</P>
<br><a name="SEC27" href="#TOC1">REVISION</a><br>
<P>
-Last updated: 15 March 2015
+Last updated: 13 June 2015
<br>
Copyright &copy; 1997-2015 University of Cambridge.
<br>
diff --git a/doc/html/pcre2test.html b/doc/html/pcre2test.html
index aee6edc..5165c1e 100644
--- a/doc/html/pcre2test.html
+++ b/doc/html/pcre2test.html
@@ -94,7 +94,7 @@ below). The input is processed using using C's string functions, so must not
contain binary zeroes, even though in Unix-like environments, <b>fgets()</b>
treats any bytes other than newline as data characters. In some Windows
environments character 26 (hex 1A) causes an immediate end of file, and no
-further data is read.
+further data is read.
</P>
<P>
For maximum portability, therefore, it is safest to avoid non-printing
@@ -284,13 +284,20 @@ following commands are recognized:
#forbid_utf
</pre>
Subsequent patterns automatically have the PCRE2_NEVER_UTF and PCRE2_NEVER_UCP
-options set, which locks out the use of UTF and Unicode property features. This
-is a trigger guard that is used in test files to ensure that UTF or Unicode
-property tests are not accidentally added to files that are used when Unicode
-support is not included in the library. This effect can also be obtained by the
-use of <b>#pattern</b>; the difference is that <b>#forbid_utf</b> cannot be
-unset, and the automatic options are not displayed in pattern information, to
-avoid cluttering up test output.
+options set, which locks out the use of the PCRE2_UTF and PCRE2_UCP options and
+the use of (*UTF) and (*UCP) at the start of patterns. This command also forces
+an error if a subsequent pattern contains any occurrences of \P, \p, or \X,
+which are still supported when PCRE2_UTF is not set, but which require Unicode
+property support to be included in the library.
+</P>
+<P>
+This is a trigger guard that is used in test files to ensure that UTF or
+Unicode property tests are not accidentally added to files that are used when
+Unicode support is not included in the library. Setting PCRE2_NEVER_UTF and
+PCRE2_NEVER_UCP as a default can also be obtained by the use of <b>#pattern</b>;
+the difference is that <b>#forbid_utf</b> cannot be unset, and the automatic
+options are not displayed in pattern information, to avoid cluttering up test
+output.
<pre>
#load &#60;filename&#62;
</pre>
@@ -471,6 +478,7 @@ for a description of their effects.
<pre>
allow_empty_class set PCRE2_ALLOW_EMPTY_CLASS
alt_bsux set PCRE2_ALT_BSUX
+ alt_circumflex set PCRE2_ALT_CIRCUMFLEX
anchored set PCRE2_ANCHORED
auto_callout set PCRE2_AUTO_CALLOUT
/i caseless set PCRE2_CASELESS
@@ -481,6 +489,7 @@ for a description of their effects.
firstline set PCRE2_FIRSTLINE
match_unset_backref set PCRE2_MATCH_UNSET_BACKREF
/m multiline set PCRE2_MULTILINE
+ never_backslash_c set PCRE2_NEVER_BACKSLASH_C
never_ucp set PCRE2_NEVER_UCP
never_utf set PCRE2_NEVER_UTF
no_auto_capture set PCRE2_NO_AUTO_CAPTURE
@@ -506,7 +515,7 @@ about the pattern:
<pre>
bsr=[anycrlf|unicode] specify \R handling
/B bincode show binary code without lengths
- callout_info show callout information
+ callout_info show callout information
debug same as info,fullbincode
fullbincode show binary code with lengths
/I info show info about compiled pattern
@@ -589,9 +598,9 @@ not necessarily the last character. These lines are omitted if no starting or
ending code units are recorded.
</P>
<P>
-The <b>callout_info</b> modifier requests information about all the callouts in
-the pattern. A list of them is output at the end of any other information that
-is requested. For each callout, either its number or string is given, followed
+The <b>callout_info</b> modifier requests information about all the callouts in
+the pattern. A list of them is output at the end of any other information that
+is requested. For each callout, either its number or string is given, followed
by the item that follows it in the pattern.
</P>
<br><b>
@@ -1460,7 +1469,7 @@ Cambridge, England.
</P>
<br><a name="SEC21" href="#TOC1">REVISION</a><br>
<P>
-Last updated: 22 March 2015
+Last updated: 20 May 2015
<br>
Copyright &copy; 1997-2015 University of Cambridge.
<br>