summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorph10 <ph10@6239d852-aaf2-0410-a92c-79f79f948069>2017-12-22 15:56:27 +0000
committerph10 <ph10@6239d852-aaf2-0410-a92c-79f79f948069>2017-12-22 15:56:27 +0000
commit1c19b1fe61481390f7c5b33d5a67cd7b9978f4ba (patch)
treeb5d9ef472dc977ae6bdbf731b2c0a2d90635a2e8
parenta0ed1419b31b7a3c778223d6ab45bec4dc491bda (diff)
downloadpcre2-1c19b1fe61481390f7c5b33d5a67cd7b9978f4ba.tar.gz
Add callout_flags to callout blocks, and set bits within it from pcre2_match()
interpretation. git-svn-id: svn://vcs.exim.org/pcre2/code/trunk@893 6239d852-aaf2-0410-a92c-79f79f948069
-rw-r--r--ChangeLog54
-rw-r--r--doc/html/pcre2_match.html8
-rw-r--r--doc/html/pcre2_pattern_info.html4
-rw-r--r--doc/html/pcre2api.html62
-rw-r--r--doc/html/pcre2callout.html57
-rw-r--r--doc/html/pcre2grep.html56
-rw-r--r--doc/html/pcre2test.html161
-rw-r--r--doc/pcre2.txt1398
-rw-r--r--doc/pcre2api.32
-rw-r--r--doc/pcre2callout.356
-rw-r--r--doc/pcre2grep.txt809
-rw-r--r--doc/pcre2test.1154
-rw-r--r--doc/pcre2test.txt170
-rw-r--r--src/pcre2.h7
-rw-r--r--src/pcre2.h.in7
-rw-r--r--src/pcre2_dfa_match.c6
-rw-r--r--src/pcre2_jit_compile.c3
-rw-r--r--src/pcre2_match.c15
-rw-r--r--src/pcre2test.c65
-rw-r--r--testdata/testinput212
-rw-r--r--testdata/testoutput1524
-rw-r--r--testdata/testoutput2146
-rw-r--r--testdata/testoutput54
-rw-r--r--testdata/testoutput622
24 files changed, 1898 insertions, 1404 deletions
diff --git a/ChangeLog b/ChangeLog
index 3c4442d..da4f2e3 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -16,10 +16,10 @@ that is called by both pcre2_match() and pcre2_dfa_match().
4. Add new pcre2_config() options: PCRE2_CONFIG_NEVER_BACKSLASH_C and
PCRE2_CONFIG_COMPILED_WIDTHS.
-5. Cut out \C tests in the JIT regression tests when NEVER_BACKSLASH_C is
+5. Cut out \C tests in the JIT regression tests when NEVER_BACKSLASH_C is
defined (e.g. by --enable-never-backslash-C).
-6. Defined public names for all the pcre2_compile() error numbers, and used
+6. Defined public names for all the pcre2_compile() error numbers, and used
the public names in pcre2_convert.c.
7. Fixed a small memory leak in pcre2test (convert contexts).
@@ -30,8 +30,8 @@ the public names in pcre2_convert.c.
PCRE2GREP_RC to the exit status, because VMS does not distinguish between
exit(0) and exit(1).
-10. Added the -LM (list modifiers) option to pcre2test. Also made -C complain
-about a bad option only if the following argument item does not start with a
+10. Added the -LM (list modifiers) option to pcre2test. Also made -C complain
+about a bad option only if the following argument item does not start with a
hyphen.
11. pcre2grep was truncating components of file names to 128 characters when
@@ -39,20 +39,20 @@ processing files with the -r option, and also (some very odd code) truncating
path names to 512 characters. There is now a check on the absolute length of
full path file names, which may be up to 2047 characters long.
-12. When an assertion contained (*ACCEPT) it caused all open capturing groups
-to be closed (as for a non-assertion ACCEPT), which was wrong and could lead to
-misbehaviour for subsequent references to groups that started outside the
-recursion. ACCEPT in an assertion now closes only those groups that were
+12. When an assertion contained (*ACCEPT) it caused all open capturing groups
+to be closed (as for a non-assertion ACCEPT), which was wrong and could lead to
+misbehaviour for subsequent references to groups that started outside the
+recursion. ACCEPT in an assertion now closes only those groups that were
started within that assertion. Fixes oss-fuzz issues 3852 and 3891.
-13. Multiline matching in pcre2grep was misbehaving if the pattern matched
-within a line, and then matched again at the end of the line and over into
-subsequent lines. Behaviour was different with and without colouring, and
-sometimes context lines were incorrectly printed and/or line endings were lost.
+13. Multiline matching in pcre2grep was misbehaving if the pattern matched
+within a line, and then matched again at the end of the line and over into
+subsequent lines. Behaviour was different with and without colouring, and
+sometimes context lines were incorrectly printed and/or line endings were lost.
All these issues should now be fixed.
-14. If --line-buffered was specified for pcre2grep when input was from a
-compressed file (.gz or .bz2) a segfault occurred. (Line buffering should be
+14. If --line-buffered was specified for pcre2grep when input was from a
+compressed file (.gz or .bz2) a segfault occurred. (Line buffering should be
ignored for compressed files.)
15. Although pcre2_jit_match checks whether the pattern is compiled
@@ -60,26 +60,26 @@ in a given mode, it was also expected that at least one mode is available.
This is fixed and pcre2_jit_match returns with PCRE2_ERROR_JIT_BADOPTION
when the pattern is not optimized by JIT at all.
-16. The line number and related variables such as match counts in pcre2grep
-were all int variables, causing overflow when files with more than 2147483647
-lines were processed (assuming 32-bit ints). They have all been changed to
+16. The line number and related variables such as match counts in pcre2grep
+were all int variables, causing overflow when files with more than 2147483647
+lines were processed (assuming 32-bit ints). They have all been changed to
unsigned long ints.
-17. If a backreference with a minimum repeat count of zero was first in a
-pattern, apart from assertions, an incorrect first matching character could be
-recorded. For example, for the pattern /(?=(a))\1?b/, "b" was incorrectly set
+17. If a backreference with a minimum repeat count of zero was first in a
+pattern, apart from assertions, an incorrect first matching character could be
+recorded. For example, for the pattern /(?=(a))\1?b/, "b" was incorrectly set
as the first character of a match.
18. Characters in a leading positive assertion are considered for recording a
first character of a match when the rest of the pattern does not provide one.
However, a character in a non-assertive group within a leading assertion such
-as in the pattern /(?=(a))\1?b/ caused this process to fail. This was an
-infelicity rather than an outright bug, because it did not affect the result of
-a match, just its speed. (In fact, in this case, the starting 'a' was
+as in the pattern /(?=(a))\1?b/ caused this process to fail. This was an
+infelicity rather than an outright bug, because it did not affect the result of
+a match, just its speed. (In fact, in this case, the starting 'a' was
subsequently picked up in the study.)
19. A minor tidy in pcre2_match(): making all PCRE2_ERROR_ returns use "return"
-instead of "RRETURN" saves unwinding the backtracks in these cases (only one
+instead of "RRETURN" saves unwinding the backtracks in these cases (only one
didn't).
20. Allocate a single callout block on the stack at the start of pcre2_match()
@@ -89,6 +89,12 @@ and set its never-changing fields once only.
compiled pattern (they were not previously saved), add PCRE2_INFO_EXTRAOPTIONS
to retrieve them, and update pcre2test to show them.
+22. Added PCRE2_CALLOUT_STARTMATCH and PCRE2_CALLOUT_BACKTRACK bits to a new
+field callout_flags in callout blocks. The bits are set by pcre2_match(), but
+not by JIT or pcre2_dfa_match(). Their settings are shown in pcre2test callouts
+if the callout_extra subject modifier is set. These bits are provided to help
+with tracking how a backtracking match is proceeding.
+
Version 10.30 14-August-2017
----------------------------
diff --git a/doc/html/pcre2_match.html b/doc/html/pcre2_match.html
index 5f6f0b1..ced70bb 100644
--- a/doc/html/pcre2_match.html
+++ b/doc/html/pcre2_match.html
@@ -30,7 +30,13 @@ DESCRIPTION
<P>
This function matches a compiled regular expression against a given subject
string, using a matching algorithm that is similar to Perl's. It returns
-offsets to captured substrings. Its arguments are:
+offsets to what it has matched and to captured substrings via the
+<b>match_data</b> block, which can be processed by functions with names that
+start with <b>pcre2_get_ovector_...()</b> or <b>pcre2_substring_...()</b>. The
+return from <b>pcre2_match()</b> is one more than the highest numbered capturing
+pair that has been set (for example, 1 if there are no captures), zero if the
+vector of offsets is too small, or a negative error code for no match and other
+errors. The function arguments are:
<pre>
<i>code</i> Points to the compiled pattern
<i>subject</i> Points to the subject string
diff --git a/doc/html/pcre2_pattern_info.html b/doc/html/pcre2_pattern_info.html
index ae3e7ff..1ebf90b 100644
--- a/doc/html/pcre2_pattern_info.html
+++ b/doc/html/pcre2_pattern_info.html
@@ -27,7 +27,7 @@ DESCRIPTION
<P>
This function returns information about a compiled pattern. Its arguments are:
<pre>
- <i>code</i> Pointer to a compiled regular expression
+ <i>code</i> Pointer to a compiled regular expression pattern
<i>what</i> What information is required
<i>where</i> Where to put the information
</pre>
@@ -42,6 +42,8 @@ request are as follows:
PCRE2_BSR_ANYCRLF: CR, LF, or CRLF only
PCRE2_INFO_CAPTURECOUNT Number of capturing subpatterns
PCRE2_INFO_DEPTHLIMIT Backtracking depth limit if set, otherwise PCRE2_ERROR_UNSET
+ PCRE2_INFO_EXTRAOPTIONS Extra options that were passed in the
+ compile context
PCRE2_INFO_FIRSTBITMAP Bitmap of first code units, or NULL
PCRE2_INFO_FIRSTCODETYPE Type of start-of-match information
0 nothing set
diff --git a/doc/html/pcre2api.html b/doc/html/pcre2api.html
index 4e98cd6..abdd0ff 100644
--- a/doc/html/pcre2api.html
+++ b/doc/html/pcre2api.html
@@ -920,11 +920,15 @@ The <i>offset_limit</i> parameter limits how far an unanchored search can
advance in the subject string. The default value is PCRE2_UNSET. The
<b>pcre2_match()</b> and <b>pcre2_dfa_match()</b> functions return
PCRE2_ERROR_NOMATCH if a match with a starting point before or at the given
-offset is not found. For example, if the pattern /abc/ is matched against
-"123abc" with an offset limit less than 3, the result is PCRE2_ERROR_NO_MATCH.
-A match can never be found if the <i>startoffset</i> argument of
-<b>pcre2_match()</b> or <b>pcre2_dfa_match()</b> is greater than the offset
-limit.
+offset is not found. The <b>pcre2_substitute()</b> function makes no more
+substitutions.
+</P>
+<P>
+For example, if the pattern /abc/ is matched against "123abc" with an offset
+limit less than 3, the result is PCRE2_ERROR_NO_MATCH. A match can never be
+found if the <i>startoffset</i> argument of <b>pcre2_match()</b>,
+<b>pcre2_dfa_match()</b>, or <b>pcre2_substitute()</b> is greater than the offset
+limit set in the match context.
</P>
<P>
When using this facility, you must set the PCRE2_USE_OFFSET_LIMIT option when
@@ -934,10 +938,11 @@ PCRE2_USE_OFFSET_LIMIT is not set, an error is generated.
</P>
<P>
The offset limit facility can be used to track progress when searching large
-subject strings. See also the PCRE2_FIRSTLINE option, which requires a match to
-start within the first line of the subject. If this is set with an offset
-limit, a match must occur in the first line and also within the offset limit.
-In other words, whichever limit comes first is used.
+subject strings or to limit the extent of global substitutions. See also the
+PCRE2_FIRSTLINE option, which requires a match to start within the first line
+of the subject. If this is set with an offset limit, a match must occur in the
+first line and also within the offset limit. In other words, whichever limit
+comes first is used.
<br>
<br>
<b>int pcre2_set_heap_limit(pcre2_match_context *<i>mcontext</i>,</b>
@@ -1940,12 +1945,15 @@ are as follows:
<pre>
PCRE2_INFO_ALLOPTIONS
PCRE2_INFO_ARGOPTIONS
+ PCRE2_INFO_EXTRAOPTIONS
</pre>
-Return a copy of the pattern's options. The third argument should point to a
+Return copies of the pattern's options. The third argument should point to a
<b>uint32_t</b> variable. PCRE2_INFO_ARGOPTIONS returns exactly the options that
were passed to <b>pcre2_compile()</b>, whereas PCRE2_INFO_ALLOPTIONS returns
the compile options as modified by any top-level (*XXX) option settings such as
-(*UTF) at the start of the pattern itself.
+(*UTF) at the start of the pattern itself. PCRE2_INFO_EXTRAOPTIONS returns the
+extra options that were set in the compile context by calling the
+pcre2_set_compile_extra_options() function.
</P>
<P>
For example, if the pattern /(*UTF)abc/ is compiled with the PCRE2_EXTENDED
@@ -3157,13 +3165,27 @@ options can be set in the <i>options</i> argument of <b>pcre2_substitute()</b>.
</P>
<P>
PCRE2_SUBSTITUTE_GLOBAL causes the function to iterate over the subject string,
-replacing every matching substring. If this is not set, only the first matching
-substring is replaced. If any matched substring has zero length, after the
-substitution has happened, an attempt to find a non-empty match at the same
-position is performed. If this is not successful, the current position is
-advanced by one character except when CRLF is a valid newline sequence and the
-next two characters are CR, LF. In this case, the current position is advanced
-by two characters.
+replacing every matching substring. If this option is not set, only the first
+matching substring is replaced. The search for matches takes place in the
+original subject string (that is, previous replacements do not affect it).
+Iteration is implemented by advancing the <i>startoffset</i> value for each
+search, which is always passed the entire subject string. If an offset limit is
+set in the match context, searching stops when that limit is reached.
+</P>
+<P>
+You can restrict the effect of a global substitution to a portion of the
+subject string by setting either or both of <i>startoffset</i> and an offset
+limit. Here is a \fPpcre2test\fP example:
+<pre>
+ /B/g,replace=!,use_offset_limit
+ ABC ABC ABC ABC\=offset=3,offset_limit=12
+ 2: ABC A!C A!C ABC
+</pre>
+When continuing with global substitutions after matching a substring with zero
+length, an attempt to find a non-empty match at the same offset is performed.
+If this is not successful, the offset is advanced by one character except when
+CRLF is a valid newline sequence and the next two characters are CR, LF. In
+this case, the offset is advanced by two characters.
</P>
<P>
PCRE2_SUBSTITUTE_OVERFLOW_LENGTH changes what happens when the output buffer is
@@ -3398,7 +3420,7 @@ Here is an example of a simple call to <b>pcre2_dfa_match()</b>:
11, /* the length of the subject string */
0, /* start at offset 0 in the subject */
0, /* default options */
- match_data, /* the match data block */
+ md, /* the match data block */
NULL, /* a match context; NULL means use defaults */
wspace, /* working space vector */
20); /* number of elements (NOT size in bytes) */
@@ -3567,7 +3589,7 @@ Cambridge, England.
</P>
<br><a name="SEC42" href="#TOC1">REVISION</a><br>
<P>
-Last updated: 13 October 2017
+Last updated: 16 December 2017
<br>
Copyright &copy; 1997-2017 University of Cambridge.
<br>
diff --git a/doc/html/pcre2callout.html b/doc/html/pcre2callout.html
index 5d24ebe..2adf21a 100644
--- a/doc/html/pcre2callout.html
+++ b/doc/html/pcre2callout.html
@@ -206,18 +206,20 @@ callouts such as the example above are obeyed.
<br><a name="SEC4" href="#TOC1">THE CALLOUT INTERFACE</a><br>
<P>
During matching, when PCRE2 reaches a callout point, if an external function is
-provided in the match context, it is called. This applies to both normal and
-DFA matching. The first argument to the callout function is a pointer to a
-<b>pcre2_callout</b> block. The second argument is the void * callout data that
-was supplied when the callout was set up by calling <b>pcre2_set_callout()</b>
-(see the
+provided in the match context, it is called. This applies to both normal,
+DFA, and JIT matching. The first argument to the callout function is a pointer
+to a <b>pcre2_callout</b> block. The second argument is the void * callout data
+that was supplied when the callout was set up by calling
+<b>pcre2_set_callout()</b> (see the
<a href="pcre2api.html"><b>pcre2api</b></a>
-documentation). The callout block structure contains the following fields:
+documentation). The callout block structure contains the following fields, not
+necessarily in this order:
<pre>
uint32_t <i>version</i>;
uint32_t <i>callout_number</i>;
uint32_t <i>capture_top</i>;
uint32_t <i>capture_last</i>;
+ uint32_t <i>callout_flags</i>;
PCRE2_SIZE *<i>offset_vector</i>;
PCRE2_SPTR <i>mark</i>;
PCRE2_SPTR <i>subject</i>;
@@ -231,11 +233,12 @@ documentation). The callout block structure contains the following fields:
PCRE2_SPTR <i>callout_string</i>;
</pre>
The <i>version</i> field contains the version number of the block format. The
-current version is 1; the three callout string fields were added for this
-version. If you are writing an application that might use an earlier release of
-PCRE2, you should check the version number before accessing any of these
-fields. The version number will increase in future if more fields are added,
-but the intention is never to remove any of the existing fields.
+current version is 2; the three callout string fields were added for version 1,
+and the <i>callout_flags</i> field for version 2. If you are writing an
+application that might use an earlier release of PCRE2, you should check the
+version number before accessing any of these fields. The version number will
+increase in future if more fields are added, but the intention is never to
+remove any of the existing fields.
</P>
<br><b>
Fields for numerical callouts
@@ -358,6 +361,36 @@ the zero-terminated name of the most recently passed (*MARK), (*PRUNE), or
of (*PRUNE) or (*THEN) without a name do not obliterate a previous (*MARK). In
callouts from the DFA matching function this field always contains NULL.
</P>
+<P>
+The <i>callout_flags</i> field is always zero in callouts from
+<b>pcre2_dfa_match()</b> or when JIT is being used. When <b>pcre2_match()</b>
+without JIT is used, the following bits may be set:
+<pre>
+ PCRE2_CALLOUT_STARTMATCH
+</pre>
+This is set for the first callout after the start of matching for each new
+starting position in the subject.
+<pre>
+ PCRE2_CALLOUT_BACKTRACK
+</pre>
+This is set if there has been a matching backtrack since the previous callout,
+or since the start of matching if this is the first callout from a
+<b>pcre2_match()</b> run.
+</P>
+<P>
+Both bits are set when a backtrack has caused a "bumpalong" to a new starting
+position in the subject. Output from <b>pcre2test</b> does not indicate the
+presence of these bits unless the <b>callout_extra</b> modifier is set.
+</P>
+<P>
+The information in the <b>callout_flags</b> field is provided so that
+applications can track and tell their users how matching with backtracking is
+done. This can be useful when trying to optimize patterns, or just to
+understand how PCRE2 works. There is no support in <b>pcre2_dfa_match()</b>
+because there is no backtracking in DFA matching, and there is no support in
+JIT because JIT is all about maximimizing matching performance. In both these
+cases the <b>callout_flags</b> field is always zero.
+</P>
<br><a name="SEC5" href="#TOC1">RETURN VALUES FROM CALLOUTS</a><br>
<P>
The external callout function returns an integer to PCRE2. If the value is
@@ -428,7 +461,7 @@ Cambridge, England.
</P>
<br><a name="SEC8" href="#TOC1">REVISION</a><br>
<P>
-Last updated: 14 April 2017
+Last updated: 22 December 2017
<br>
Copyright &copy; 1997-2017 University of Cambridge.
<br>
diff --git a/doc/html/pcre2grep.html b/doc/html/pcre2grep.html
index 0db66ff..27889d8 100644
--- a/doc/html/pcre2grep.html
+++ b/doc/html/pcre2grep.html
@@ -133,11 +133,13 @@ The <b>--locale</b> option can be used to override this.
<br><a name="SEC3" href="#TOC1">SUPPORT FOR COMPRESSED FILES</a><br>
<P>
It is possible to compile <b>pcre2grep</b> so that it uses <b>libz</b> or
-<b>libbz2</b> to read files whose names end in <b>.gz</b> or <b>.bz2</b>,
-respectively. You can find out whether your binary has support for one or both
-of these file types by running it with the <b>--help</b> option. If the
-appropriate support is not present, files are treated as plain text. The
-standard input is always so treated.
+<b>libbz2</b> to read compressed files whose names end in <b>.gz</b> or
+<b>.bz2</b>, respectively. You can find out whether your <b>pcre2grep</b> binary
+has support for one or both of these file types by running it with the
+<b>--help</b> option. If the appropriate support is not present, all files are
+treated as plain text. The standard input is always so treated. When input is
+from a compressed .gz or .bz2 file, the <b>--line-buffered</b> option is
+ignored.
</P>
<br><a name="SEC4" href="#TOC1">BINARY FILES</a><br>
<P>
@@ -151,7 +153,7 @@ of changing the way binary files are handled.
<br><a name="SEC5" href="#TOC1">OPTIONS</a><br>
<P>
The order in which some of the options appear can affect the output. For
-example, both the <b>-h</b> and <b>-l</b> options affect the printing of file
+example, both the <b>-H</b> and <b>-l</b> options affect the printing of file
names. Whichever comes later in the command line will be the one that takes
effect. Similarly, except where noted below, if an option is given twice, the
later setting is used. Numerical values for options may be followed by K or M,
@@ -396,14 +398,16 @@ searching a single file. By default, the file name is not shown in this case.
For matching lines, the file name is followed by a colon; for context lines, a
hyphen separator is used. If a line number is also being output, it follows the
file name. When the <b>-M</b> option causes a pattern to match more than one
-line, only the first is preceded by the file name.
+line, only the first is preceded by the file name. This option overrides any
+previous <b>-h</b>, <b>-l</b>, or <b>-L</b> options.
</P>
<P>
<b>-h</b>, <b>--no-filename</b>
Suppress the output file names when searching multiple files. By default,
file names are shown when multiple files are searched. For matching lines, the
file name is followed by a colon; for context lines, a hyphen separator is used.
-If a line number is also being output, it follows the file name.
+If a line number is also being output, it follows the file name. This option
+overrides any previous <b>-H</b>, <b>-L</b>, or <b>-l</b> options.
</P>
<P>
<b>--heap-limit</b>=<i>number</i>
@@ -460,17 +464,19 @@ given any number of times. If a directory matches both <b>--include-dir</b> and
<b>-L</b>, <b>--files-without-match</b>
Instead of outputting lines from the files, just output the names of the files
that do not contain any lines that would have been output. Each file name is
-output once, on a separate line.
+output once, on a separate line. This option overrides any previous <b>-H</b>,
+<b>-h</b>, or <b>-l</b> options.
</P>
<P>
<b>-l</b>, <b>--files-with-matches</b>
Instead of outputting lines from the files, just output the names of the files
-containing lines that would have been output. Each file name is output
-once, on a separate line. Searching normally stops as soon as a matching line
-is found in a file. However, if the <b>-c</b> (count) option is also used,
-matching continues in order to obtain the correct count, and those files that
-have at least one match are listed along with their counts. Using this option
-with <b>-c</b> is a way of suppressing the listing of files with no matches.
+containing lines that would have been output. Each file name is output once, on
+a separate line. Searching normally stops as soon as a matching line is found
+in a file. However, if the <b>-c</b> (count) option is also used, matching
+continues in order to obtain the correct count, and those files that have at
+least one match are listed along with their counts. Using this option with
+<b>-c</b> is a way of suppressing the listing of files with no matches. This
+opeion overrides any previous <b>-H</b>, <b>-h</b>, or <b>-L</b> options.
</P>
<P>
<b>--label</b>=<i>name</i>
@@ -480,14 +486,16 @@ short form for this option.
</P>
<P>
<b>--line-buffered</b>
-When this option is given, input is read and processed line by line, and the
-output is flushed after each write. By default, input is read in large chunks,
-unless <b>pcre2grep</b> can determine that it is reading from a terminal (which
-is currently possible only in Unix-like environments). Output to terminal is
-normally automatically flushed by the operating system. This option can be
-useful when the input or output is attached to a pipe and you do not want
-<b>pcre2grep</b> to buffer up large amounts of data. However, its use will
-affect performance, and the <b>-M</b> (multiline) option ceases to work.
+When this option is given, non-compressed input is read and processed line by
+line, and the output is flushed after each write. By default, input is read in
+large chunks, unless <b>pcre2grep</b> can determine that it is reading from a
+terminal (which is currently possible only in Unix-like environments). Output
+to terminal is normally automatically flushed by the operating system. This
+option can be useful when the input or output is attached to a pipe and you do
+not want <b>pcre2grep</b> to buffer up large amounts of data. However, its use
+will affect performance, and the <b>-M</b> (multiline) option ceases to work.
+When input is from a compressed .gz or .bz2 file, <b>--line-buffered</b> is
+ignored.
</P>
<P>
<b>--line-offsets</b>
@@ -941,7 +949,7 @@ Cambridge, England.
</P>
<br><a name="SEC15" href="#TOC1">REVISION</a><br>
<P>
-Last updated: 11 October 2017
+Last updated: 13 November 2017
<br>
Copyright &copy; 1997-2017 University of Cambridge.
<br>
diff --git a/doc/html/pcre2test.html b/doc/html/pcre2test.html
index 3427787..7d98d90 100644
--- a/doc/html/pcre2test.html
+++ b/doc/html/pcre2test.html
@@ -159,6 +159,12 @@ Behave as if each pattern has the <b>auto_callout</b> modifier, that is, insert
automatic callouts into every pattern that is compiled.
</P>
<P>
+<b>-AC</b>
+As for <b>-ac</b>, but in addition behave as if each subject line has the
+<b>callout_extra</b> modifier, that is, show additional information from
+callouts.
+</P>
+<P>
<b>-b</b>
Behave as if each pattern has the <b>fullbincode</b> modifier; the full
internal binary form of the pattern is output after compilation.
@@ -243,8 +249,8 @@ available, and the use of JIT is verified.
</P>
<P>
<b>-LM</b>
-List modifiers: write a list of available pattern and subject modifiers to the
-standard output, then exit with zero exit code. All other options are ignored.
+List modifiers: write a list of available pattern and subject modifiers to the
+standard output, then exit with zero exit code. All other options are ignored.
If both -C and -LM are present, whichever is first is recognized.
</P>
<P>
@@ -1182,6 +1188,7 @@ pattern.
callout_capture show captures at callout time
callout_data=&#60;n&#62; set a value to pass via callouts
callout_error=&#60;n&#62;[:&#60;m&#62;] control callout error
+ callout_extra show extra callout information
callout_fail=&#60;n&#62;[:&#60;m&#62;] control callout failure
callout_no_where do not show position of a callout
callout_none do not supply a callout function
@@ -1694,49 +1701,10 @@ documentation.
<br><a name="SEC16" href="#TOC1">CALLOUTS</a><br>
<P>
If the pattern contains any callout requests, <b>pcre2test</b>'s callout
-function is called during matching unless <b>callout_none</b> is specified.
-This works with both matching functions.
-</P>
-<P>
-The callout function in <b>pcre2test</b> returns zero (carry on matching) by
-default, but you can use a <b>callout_fail</b> modifier in a subject line to
-change this and other parameters of the callout.
-</P>
-<P>
-If <b>callout_capture</b> is set, the current captured groups are output when a
-callout occurs. By default, the callout function then generates output that
-indicates where the current match start and matching points are in the subject,
-and what the next pattern item is. This output is suppressed if the
-<b>callout_no_where</b> modifier is set.
-</P>
-<P>
-The default return from the callout function is zero, which allows matching to
-continue. The <b>callout_fail</b> modifier can be given one or two numbers. If
-there is only one number, 1 is returned instead of 0 (causing matching to
-backtrack) when a callout of that number is reached. If two numbers (&#60;n&#62;:&#60;m&#62;)
-are given, 1 is returned when callout &#60;n&#62; is reached and there have been at
-least &#60;m&#62; callouts. The <b>callout_error</b> modifier is similar, except that
-PCRE2_ERROR_CALLOUT is returned, causing the entire matching process to be
-aborted. If both these modifiers are set for the same callout number,
-<b>callout_error</b> takes precedence. Note that callouts with string arguments
-are always given the number zero. See
-</P>
-<P>
-The <b>callout_data</b> modifier can be given an unsigned or a negative number.
-This is set as the "user data" that is passed to the matching function, and
-passed back when the callout function is invoked. Any value other than zero is
-used as a return from <b>pcre2test</b>'s callout function.
-</P>
-<P>
-Inserting callouts can be helpful when using <b>pcre2test</b> to check
-complicated regular expressions. For further information about callouts, see
-the
-<a href="pcre2callout.html"><b>pcre2callout</b></a>
-documentation.
-</P>
-<P>
-The output for callouts with numerical arguments and those with string
-arguments is slightly different.
+function is called during matching unless <b>callout_none</b> is specified. This
+works with both matching functions, and with JIT, though there are some
+differences in behaviour. The output for callouts with numerical arguments and
+those with string arguments is slightly different.
</P>
<br><b>
Callouts with numerical arguments
@@ -1811,6 +1779,107 @@ example:
</PRE>
</P>
+<br><b>
+Callout modifiers
+</b><br>
+<P>
+The callout function in <b>pcre2test</b> returns zero (carry on matching) by
+default, but you can use a <b>callout_fail</b> modifier in a subject line to
+change this and other parameters of the callout (see below).
+</P>
+<P>
+If the <b>callout_capture</b> modifier is set, the current captured groups are
+output when a callout occurs. This is useful only for non-DFA matching, as
+<b>pcre2_dfa_match()</b> does not support capturing, so no captures are ever
+shown.
+</P>
+<P>
+The normal callout output, showing the callout number or pattern offset (as
+described above) is suppressed if the <b>callout_no_where</b> modifier is set.
+</P>
+<P>
+When using the interpretive matching function <b>pcre2_match()</b> without JIT,
+setting the <b>callout_extra</b> modifier causes additional output from
+<b>pcre2test</b>'s callout function to be generated. For the first callout in a
+match attempt at a new starting position in the subject, "New match attempt" is
+output. If there has been a backtrack since the last callout (or start of
+matching if this is the first callout), "Backtrack" is output, followed by "No
+other matching paths" if the backtrack ended the previous match attempt. For
+example:
+<pre>
+ re&#62; /(a+)b/auto_callout,no_start_optimize,no_auto_possess
+ data&#62; aac\=callout_extra
+ New match attempt
+ ---&#62;aac
+ +0 ^ (
+ +1 ^ a+
+ +3 ^ ^ )
+ +4 ^ ^ b
+ Backtrack
+ ---&#62;aac
+ +3 ^^ )
+ +4 ^^ b
+ Backtrack
+ No other matching paths
+ New match attempt
+ ---&#62;aac
+ +0 ^ (
+ +1 ^ a+
+ +3 ^^ )
+ +4 ^^ b
+ Backtrack
+ No other matching paths
+ New match attempt
+ ---&#62;aac
+ +0 ^ (
+ +1 ^ a+
+ Backtrack
+ No other matching paths
+ New match attempt
+ ---&#62;aac
+ +0 ^ (
+ +1 ^ a+
+ No match
+</pre>
+Notice that various optimizations must be turned off if you want all possible
+matching paths to be scanned. If <b>no_start_optimize</b> is not used, there is
+an immediate "no match", without any callouts, because the starting
+optimization fails to find "b" in the subject, which it knows must be present
+for any match. If <b>no_auto_possess</b> is not used, the "a+" item is turned
+into "a++", which reduces the number of backtracks.
+</P>
+<P>
+The <b>callout_extra</b> modifier has no effect if used with the DFA matching
+function, or with JIT.
+</P>
+<br><b>
+Return values from callouts
+</b><br>
+<P>
+The default return from the callout function is zero, which allows matching to
+continue. The <b>callout_fail</b> modifier can be given one or two numbers. If
+there is only one number, 1 is returned instead of 0 (causing matching to
+backtrack) when a callout of that number is reached. If two numbers (&#60;n&#62;:&#60;m&#62;)
+are given, 1 is returned when callout &#60;n&#62; is reached and there have been at
+least &#60;m&#62; callouts. The <b>callout_error</b> modifier is similar, except that
+PCRE2_ERROR_CALLOUT is returned, causing the entire matching process to be
+aborted. If both these modifiers are set for the same callout number,
+<b>callout_error</b> takes precedence. Note that callouts with string arguments
+are always given the number zero.
+</P>
+<P>
+The <b>callout_data</b> modifier can be given an unsigned or a negative number.
+This is set as the "user data" that is passed to the matching function, and
+passed back when the callout function is invoked. Any value other than zero is
+used as a return from <b>pcre2test</b>'s callout function.
+</P>
+<P>
+Inserting callouts can be helpful when using <b>pcre2test</b> to check
+complicated regular expressions. For further information about callouts, see
+the
+<a href="pcre2callout.html"><b>pcre2callout</b></a>
+documentation.
+</P>
<br><a name="SEC17" href="#TOC1">NON-PRINTING CHARACTERS</a><br>
<P>
When <b>pcre2test</b> is outputting text in the compiled version of a pattern,
@@ -1913,7 +1982,7 @@ Cambridge, England.
</P>
<br><a name="SEC21" href="#TOC1">REVISION</a><br>
<P>
-Last updated: 17 October 2017
+Last updated: 21 December 2017
<br>
Copyright &copy; 1997-2017 University of Cambridge.
<br>
diff --git a/doc/pcre2.txt b/doc/pcre2.txt
index d2b9216..efd4247 100644
--- a/doc/pcre2.txt
+++ b/doc/pcre2.txt
@@ -929,35 +929,38 @@ PCRE2 CONTEXTS
advance in the subject string. The default value is PCRE2_UNSET. The
pcre2_match() and pcre2_dfa_match() functions return
PCRE2_ERROR_NOMATCH if a match with a starting point before or at the
- given offset is not found. For example, if the pattern /abc/ is matched
- against "123abc" with an offset limit less than 3, the result is
- PCRE2_ERROR_NO_MATCH. A match can never be found if the startoffset
- argument of pcre2_match() or pcre2_dfa_match() is greater than the off-
- set limit.
+ given offset is not found. The pcre2_substitute() function makes no
+ more substitutions.
- When using this facility, you must set the PCRE2_USE_OFFSET_LIMIT
+ For example, if the pattern /abc/ is matched against "123abc" with an
+ offset limit less than 3, the result is PCRE2_ERROR_NO_MATCH. A match
+ can never be found if the startoffset argument of pcre2_match(),
+ pcre2_dfa_match(), or pcre2_substitute() is greater than the offset
+ limit set in the match context.
+
+ When using this facility, you must set the PCRE2_USE_OFFSET_LIMIT
option when calling pcre2_compile() so that when JIT is in use, differ-
- ent code can be compiled. If a match is started with a non-default
- match limit when PCRE2_USE_OFFSET_LIMIT is not set, an error is gener-
+ ent code can be compiled. If a match is started with a non-default
+ match limit when PCRE2_USE_OFFSET_LIMIT is not set, an error is gener-
ated.
- The offset limit facility can be used to track progress when searching
- large subject strings. See also the PCRE2_FIRSTLINE option, which
- requires a match to start within the first line of the subject. If this
- is set with an offset limit, a match must occur in the first line and
- also within the offset limit. In other words, whichever limit comes
- first is used.
+ The offset limit facility can be used to track progress when searching
+ large subject strings or to limit the extent of global substitutions.
+ See also the PCRE2_FIRSTLINE option, which requires a match to start
+ within the first line of the subject. If this is set with an offset
+ limit, a match must occur in the first line and also within the offset
+ limit. In other words, whichever limit comes first is used.
int pcre2_set_heap_limit(pcre2_match_context *mcontext,
uint32_t value);
- The heap_limit parameter specifies, in units of kilobytes, the maximum
- amount of heap memory that pcre2_match() may use to hold backtracking
- information when running an interpretive match. This limit does not
- apply to matching with the JIT optimization, which has its own memory
+ The heap_limit parameter specifies, in units of kilobytes, the maximum
+ amount of heap memory that pcre2_match() may use to hold backtracking
+ information when running an interpretive match. This limit does not
+ apply to matching with the JIT optimization, which has its own memory
control arrangements (see the pcre2jit documentation for more details),
- nor does it apply to pcre2_dfa_match(). If the limit is reached, the
- negative error code PCRE2_ERROR_HEAPLIMIT is returned. The default
+ nor does it apply to pcre2_dfa_match(). If the limit is reached, the
+ negative error code PCRE2_ERROR_HEAPLIMIT is returned. The default
limit is set when PCRE2 is built; the default default is very large and
is essentially "unlimited".
@@ -966,83 +969,83 @@ PCRE2 CONTEXTS
(*LIMIT_HEAP=ddd)
- where ddd is a decimal number. However, such a setting is ignored
- unless ddd is less than the limit set by the caller of pcre2_match()
+ where ddd is a decimal number. However, such a setting is ignored
+ unless ddd is less than the limit set by the caller of pcre2_match()
or, if no such limit is set, less than the default.
- The pcre2_match() function starts out using a 20K vector on the system
- stack for recording backtracking points. The more nested backtracking
+ The pcre2_match() function starts out using a 20K vector on the system
+ stack for recording backtracking points. The more nested backtracking
points there are (that is, the deeper the search tree), the more memory
- is needed. Heap memory is used only if the initial vector is too
+ is needed. Heap memory is used only if the initial vector is too
small. If the heap limit is set to a value less than 21 (in particular,
- zero) no heap memory will be used. In this case, only patterns that do
+ zero) no heap memory will be used. In this case, only patterns that do
not have a lot of nested backtracking can be successfully processed.
int pcre2_set_match_limit(pcre2_match_context *mcontext,
uint32_t value);
- The match_limit parameter provides a means of preventing PCRE2 from
+ The match_limit parameter provides a means of preventing PCRE2 from
using up too many computing resources when processing patterns that are
not going to match, but which have a very large number of possibilities
- in their search trees. The classic example is a pattern that uses
+ in their search trees. The classic example is a pattern that uses
nested unlimited repeats.
- There is an internal counter in pcre2_match() that is incremented each
- time round its main matching loop. If this value reaches the match
+ There is an internal counter in pcre2_match() that is incremented each
+ time round its main matching loop. If this value reaches the match
limit, pcre2_match() returns the negative value PCRE2_ERROR_MATCHLIMIT.
- This has the effect of limiting the amount of backtracking that can
+ This has the effect of limiting the amount of backtracking that can
take place. For patterns that are not anchored, the count restarts from
- zero for each position in the subject string. This limit also applies
+ zero for each position in the subject string. This limit also applies
to pcre2_dfa_match(), though the counting is done in a different way.
- When pcre2_match() is called with a pattern that was successfully pro-
+ When pcre2_match() is called with a pattern that was successfully pro-
cessed by pcre2_jit_compile(), the way in which matching is executed is
- entirely different. However, there is still the possibility of runaway
- matching that goes on for a very long time, and so the match_limit
- value is also used in this case (but in a different way) to limit how
+ entirely different. However, there is still the possibility of runaway
+ matching that goes on for a very long time, and so the match_limit
+ value is also used in this case (but in a different way) to limit how
long the matching can continue.
- The default value for the limit can be set when PCRE2 is built; the
- default default is 10 million, which handles all but the most extreme
- cases. A value for the match limit may also be supplied by an item at
+ The default value for the limit can be set when PCRE2 is built; the
+ default default is 10 million, which handles all but the most extreme
+ cases. A value for the match limit may also be supplied by an item at
the start of a pattern of the form
(*LIMIT_MATCH=ddd)
- where ddd is a decimal number. However, such a setting is ignored
+ where ddd is a decimal number. However, such a setting is ignored
unless ddd is less than the limit set by the caller of pcre2_match() or
pcre2_dfa_match() or, if no such limit is set, less than the default.
int pcre2_set_depth_limit(pcre2_match_context *mcontext,
uint32_t value);
- This parameter limits the depth of nested backtracking in
- pcre2_match(). Each time a nested backtracking point is passed, a new
+ This parameter limits the depth of nested backtracking in
+ pcre2_match(). Each time a nested backtracking point is passed, a new
memory "frame" is used to remember the state of matching at that point.
- Thus, this parameter indirectly limits the amount of memory that is
- used in a match. However, because the size of each memory "frame"
+ Thus, this parameter indirectly limits the amount of memory that is
+ used in a match. However, because the size of each memory "frame"
depends on the number of capturing parentheses, the actual memory limit
- varies from pattern to pattern. This limit was more useful in versions
+ varies from pattern to pattern. This limit was more useful in versions
before 10.30, where function recursion was used for backtracking.
- The depth limit is not relevant, and is ignored, when matching is done
+ The depth limit is not relevant, and is ignored, when matching is done
using JIT compiled code. However, it is supported by pcre2_dfa_match(),
- which uses it to limit the depth of internal recursive function calls
+ which uses it to limit the depth of internal recursive function calls
that implement atomic groups, lookaround assertions, and pattern recur-
- sions. This is, therefore, an indirect limit on the amount of system
+ sions. This is, therefore, an indirect limit on the amount of system
stack that is used. A recursive pattern such as /(.)(?1)/, when matched
- to a very long string using pcre2_dfa_match(), can use a great deal of
+ to a very long string using pcre2_dfa_match(), can use a great deal of
stack.
- The default value for the depth limit can be set when PCRE2 is built;
- the default default is the same value as the default for the match
- limit. If the limit is exceeded, pcre2_match() or pcre2_dfa_match()
+ The default value for the depth limit can be set when PCRE2 is built;
+ the default default is the same value as the default for the match
+ limit. If the limit is exceeded, pcre2_match() or pcre2_dfa_match()
returns PCRE2_ERROR_DEPTHLIMIT. A value for the depth limit may also be
supplied by an item at the start of a pattern of the form
(*LIMIT_DEPTH=ddd)
- where ddd is a decimal number. However, such a setting is ignored
+ where ddd is a decimal number. However, such a setting is ignored
unless ddd is less than the limit set by the caller of pcre2_match() or
pcre2_dfa_match() or, if no such limit is set, less than the default.
@@ -1051,95 +1054,95 @@ CHECKING BUILD-TIME OPTIONS
int pcre2_config(uint32_t what, void *where);
- The function pcre2_config() makes it possible for a PCRE2 client to
- discover which optional features have been compiled into the PCRE2
- library. The pcre2build documentation has more details about these
+ The function pcre2_config() makes it possible for a PCRE2 client to
+ discover which optional features have been compiled into the PCRE2
+ library. The pcre2build documentation has more details about these
optional features.
- The first argument for pcre2_config() specifies which information is
- required. The second argument is a pointer to memory into which the
- information is placed. If NULL is passed, the function returns the
- amount of memory that is needed for the requested information. For
- calls that return numerical values, the value is in bytes; when
- requesting these values, where should point to appropriately aligned
- memory. For calls that return strings, the required length is given in
+ The first argument for pcre2_config() specifies which information is
+ required. The second argument is a pointer to memory into which the
+ information is placed. If NULL is passed, the function returns the
+ amount of memory that is needed for the requested information. For
+ calls that return numerical values, the value is in bytes; when
+ requesting these values, where should point to appropriately aligned
+ memory. For calls that return strings, the required length is given in
code units, not counting the terminating zero.
- When requesting information, the returned value from pcre2_config() is
- non-negative on success, or the negative error code PCRE2_ERROR_BADOP-
- TION if the value in the first argument is not recognized. The follow-
+ When requesting information, the returned value from pcre2_config() is
+ non-negative on success, or the negative error code PCRE2_ERROR_BADOP-
+ TION if the value in the first argument is not recognized. The follow-
ing information is available:
PCRE2_CONFIG_BSR
- The output is a uint32_t integer whose value indicates what character
- sequences the \R escape sequence matches by default. A value of
+ The output is a uint32_t integer whose value indicates what character
+ sequences the \R escape sequence matches by default. A value of
PCRE2_BSR_UNICODE means that \R matches any Unicode line ending
- sequence; a value of PCRE2_BSR_ANYCRLF means that \R matches only CR,
+ sequence; a value of PCRE2_BSR_ANYCRLF means that \R matches only CR,
LF, or CRLF. The default can be overridden when a pattern is compiled.
PCRE2_CONFIG_COMPILED_WIDTHS
- The output is a uint32_t integer whose lower bits indicate which code
- unit widths were selected when PCRE2 was built. The 1-bit indicates
- 8-bit support, and the 2-bit and 4-bit indicate 16-bit and 32-bit sup-
+ The output is a uint32_t integer whose lower bits indicate which code
+ unit widths were selected when PCRE2 was built. The 1-bit indicates
+ 8-bit support, and the 2-bit and 4-bit indicate 16-bit and 32-bit sup-
port, respectively.
PCRE2_CONFIG_DEPTHLIMIT
- The output is a uint32_t integer that gives the default limit for the
- depth of nested backtracking in pcre2_match() or the depth of nested
- recursions and lookarounds in pcre2_dfa_match(). Further details are
+ The output is a uint32_t integer that gives the default limit for the
+ depth of nested backtracking in pcre2_match() or the depth of nested
+ recursions and lookarounds in pcre2_dfa_match(). Further details are
given with pcre2_set_depth_limit() above.
PCRE2_CONFIG_HEAPLIMIT
- The output is a uint32_t integer that gives, in kilobytes, the default
- limit for the amount of heap memory used by pcre2_match(). Further
+ The output is a uint32_t integer that gives, in kilobytes, the default
+ limit for the amount of heap memory used by pcre2_match(). Further
details are given with pcre2_set_heap_limit() above.
PCRE2_CONFIG_JIT
- The output is a uint32_t integer that is set to one if support for
+ The output is a uint32_t integer that is set to one if support for
just-in-time compiling is available; otherwise it is set to zero.
PCRE2_CONFIG_JITTARGET
- The where argument should point to a buffer that is at least 48 code
- units long. (The exact length required can be found by calling
- pcre2_config() with where set to NULL.) The buffer is filled with a
- string that contains the name of the architecture for which the JIT
- compiler is configured, for example "x86 32bit (little endian +
- unaligned)". If JIT support is not available, PCRE2_ERROR_BADOPTION is
- returned, otherwise the number of code units used is returned. This is
+ The where argument should point to a buffer that is at least 48 code
+ units long. (The exact length required can be found by calling
+ pcre2_config() with where set to NULL.) The buffer is filled with a
+ string that contains the name of the architecture for which the JIT
+ compiler is configured, for example "x86 32bit (little endian +
+ unaligned)". If JIT support is not available, PCRE2_ERROR_BADOPTION is
+ returned, otherwise the number of code units used is returned. This is
the length of the string, plus one unit for the terminating zero.
PCRE2_CONFIG_LINKSIZE
The output is a uint32_t integer that contains the number of bytes used
- for internal linkage in compiled regular expressions. When PCRE2 is
- configured, the value can be set to 2, 3, or 4, with the default being
- 2. This is the value that is returned by pcre2_config(). However, when
- the 16-bit library is compiled, a value of 3 is rounded up to 4, and
- when the 32-bit library is compiled, internal linkages always use 4
+ for internal linkage in compiled regular expressions. When PCRE2 is
+ configured, the value can be set to 2, 3, or 4, with the default being
+ 2. This is the value that is returned by pcre2_config(). However, when
+ the 16-bit library is compiled, a value of 3 is rounded up to 4, and
+ when the 32-bit library is compiled, internal linkages always use 4
bytes, so the configured value is not relevant.
The default value of 2 for the 8-bit and 16-bit libraries is sufficient
- for all but the most massive patterns, since it allows the size of the
+ for all but the most massive patterns, since it allows the size of the
compiled pattern to be up to 64K code units. Larger values allow larger
- regular expressions to be compiled by those two libraries, but at the
+ regular expressions to be compiled by those two libraries, but at the
expense of slower matching.
PCRE2_CONFIG_MATCHLIMIT
The output is a uint32_t integer that gives the default match limit for
- pcre2_match(). Further details are given with pcre2_set_match_limit()
+ pcre2_match(). Further details are given with pcre2_set_match_limit()
above.
PCRE2_CONFIG_NEWLINE
- The output is a uint32_t integer whose value specifies the default
- character sequence that is recognized as meaning "newline". The values
+ The output is a uint32_t integer whose value specifies the default
+ character sequence that is recognized as meaning "newline". The values
are:
PCRE2_NEWLINE_CR Carriage return (CR)
@@ -1149,23 +1152,23 @@ CHECKING BUILD-TIME OPTIONS
PCRE2_NEWLINE_ANYCRLF Any of CR, LF, or CRLF
PCRE2_NEWLINE_NUL The NUL character (binary zero)
- The default should normally correspond to the standard sequence for
+ The default should normally correspond to the standard sequence for
your operating system.
PCRE2_CONFIG_NEVER_BACKSLASH_C
- The output is a uint32_t integer that is set to one if the use of \C
- was permanently disabled when PCRE2 was built; otherwise it is set to
+ The output is a uint32_t integer that is set to one if the use of \C
+ was permanently disabled when PCRE2 was built; otherwise it is set to
zero.
PCRE2_CONFIG_PARENSLIMIT
- The output is a uint32_t integer that gives the maximum depth of nest-
+ The output is a uint32_t integer that gives the maximum depth of nest-
ing of parentheses (of any kind) in a pattern. This limit is imposed to
- cap the amount of system stack used when a pattern is compiled. It is
- specified when PCRE2 is built; the default is 250. This limit does not
- take into account the stack that may already be used by the calling
- application. For finer control over compilation stack usage, see
+ cap the amount of system stack used when a pattern is compiled. It is
+ specified when PCRE2 is built; the default is 250. This limit does not
+ take into account the stack that may already be used by the calling
+ application. For finer control over compilation stack usage, see
pcre2_set_compile_recursion_guard().
PCRE2_CONFIG_STACKRECURSE
@@ -1175,25 +1178,25 @@ CHECKING BUILD-TIME OPTIONS
PCRE2_CONFIG_UNICODE_VERSION
- The where argument should point to a buffer that is at least 24 code
- units long. (The exact length required can be found by calling
- pcre2_config() with where set to NULL.) If PCRE2 has been compiled
- without Unicode support, the buffer is filled with the text "Unicode
- not supported". Otherwise, the Unicode version string (for example,
- "8.0.0") is inserted. The number of code units used is returned. This
+ The where argument should point to a buffer that is at least 24 code
+ units long. (The exact length required can be found by calling
+ pcre2_config() with where set to NULL.) If PCRE2 has been compiled
+ without Unicode support, the buffer is filled with the text "Unicode
+ not supported". Otherwise, the Unicode version string (for example,
+ "8.0.0") is inserted. The number of code units used is returned. This
is the length of the string plus one unit for the terminating zero.
PCRE2_CONFIG_UNICODE
- The output is a uint32_t integer that is set to one if Unicode support
- is available; otherwise it is set to zero. Unicode support implies UTF
+ The output is a uint32_t integer that is set to one if Unicode support
+ is available; otherwise it is set to zero. Unicode support implies UTF
support.
PCRE2_CONFIG_VERSION
- The where argument should point to a buffer that is at least 24 code
- units long. (The exact length required can be found by calling
- pcre2_config() with where set to NULL.) The buffer is filled with the
+ The where argument should point to a buffer that is at least 24 code
+ units long. (The exact length required can be found by calling
+ pcre2_config() with where set to NULL.) The buffer is filled with the
PCRE2 version string, zero-terminated. The number of code units used is
returned. This is the length of the string plus one unit for the termi-
nating zero.
@@ -1211,96 +1214,96 @@ COMPILING A PATTERN
pcre2_code *pcre2_code_copy_with_tables(const pcre2_code *code);
- The pcre2_compile() function compiles a pattern into an internal form.
- The pattern is defined by a pointer to a string of code units and a
- length (in code units). If the pattern is zero-terminated, the length
- can be specified as PCRE2_ZERO_TERMINATED. The function returns a
- pointer to a block of memory that contains the compiled pattern and
+ The pcre2_compile() function compiles a pattern into an internal form.
+ The pattern is defined by a pointer to a string of code units and a
+ length (in code units). If the pattern is zero-terminated, the length
+ can be specified as PCRE2_ZERO_TERMINATED. The function returns a
+ pointer to a block of memory that contains the compiled pattern and
related data, or NULL if an error occurred.
- If the compile context argument ccontext is NULL, memory for the com-
- piled pattern is obtained by calling malloc(). Otherwise, it is
- obtained from the same memory function that was used for the compile
- context. The caller must free the memory by calling pcre2_code_free()
+ If the compile context argument ccontext is NULL, memory for the com-
+ piled pattern is obtained by calling malloc(). Otherwise, it is
+ obtained from the same memory function that was used for the compile
+ context. The caller must free the memory by calling pcre2_code_free()
when it is no longer needed.
The function pcre2_code_copy() makes a copy of the compiled code in new
- memory, using the same memory allocator as was used for the original.
- However, if the code has been processed by the JIT compiler (see
- below), the JIT information cannot be copied (because it is position-
+ memory, using the same memory allocator as was used for the original.
+ However, if the code has been processed by the JIT compiler (see
+ below), the JIT information cannot be copied (because it is position-
dependent). The new copy can initially be used only for non-JIT match-
ing, though it can be passed to pcre2_jit_compile() if required.
The pcre2_code_copy() function provides a way for individual threads in
- a multithreaded application to acquire a private copy of shared com-
- piled code. However, it does not make a copy of the character tables
- used by the compiled pattern; the new pattern code points to the same
- tables as the original code. (See "Locale Support" below for details
- of these character tables.) In many applications the same tables are
- used throughout, so this behaviour is appropriate. Nevertheless, there
+ a multithreaded application to acquire a private copy of shared com-
+ piled code. However, it does not make a copy of the character tables
+ used by the compiled pattern; the new pattern code points to the same
+ tables as the original code. (See "Locale Support" below for details
+ of these character tables.) In many applications the same tables are
+ used throughout, so this behaviour is appropriate. Nevertheless, there
are occasions when a copy of a compiled pattern and the relevant tables
- are needed. The pcre2_code_copy_with_tables() provides this facility.
- Copies of both the code and the tables are made, with the new code
- pointing to the new tables. The memory for the new tables is automati-
- cally freed when pcre2_code_free() is called for the new copy of the
+ are needed. The pcre2_code_copy_with_tables() provides this facility.
+ Copies of both the code and the tables are made, with the new code
+ pointing to the new tables. The memory for the new tables is automati-
+ cally freed when pcre2_code_free() is called for the new copy of the
compiled code.
- NOTE: When one of the matching functions is called, pointers to the
+ NOTE: When one of the matching functions is called, pointers to the
compiled pattern and the subject string are set in the match data block
- so that they can be referenced by the substring extraction functions.
- After running a match, you must not free a compiled pattern (or a sub-
- ject string) until after all operations on the match data block have
+ so that they can be referenced by the substring extraction functions.
+ After running a match, you must not free a compiled pattern (or a sub-
+ ject string) until after all operations on the match data block have
taken place.
- The options argument for pcre2_compile() contains various bit settings
- that affect the compilation. It should be zero if no options are
- required. The available options are described below. Some of them (in
- particular, those that are compatible with Perl, but some others as
- well) can also be set and unset from within the pattern (see the
+ The options argument for pcre2_compile() contains various bit settings
+ that affect the compilation. It should be zero if no options are
+ required. The available options are described below. Some of them (in
+ particular, those that are compatible with Perl, but some others as
+ well) can also be set and unset from within the pattern (see the
detailed description in the pcre2pattern documentation).
- For those options that can be different in different parts of the pat-
- tern, the contents of the options argument specifies their settings at
- the start of compilation. The PCRE2_ANCHORED, PCRE2_ENDANCHORED, and
- PCRE2_NO_UTF_CHECK options can be set at the time of matching as well
+ For those options that can be different in different parts of the pat-
+ tern, the contents of the options argument specifies their settings at
+ the start of compilation. The PCRE2_ANCHORED, PCRE2_ENDANCHORED, and
+ PCRE2_NO_UTF_CHECK options can be set at the time of matching as well
as at compile time.
- Other, less frequently required compile-time parameters (for example,
+ Other, less frequently required compile-time parameters (for example,
the newline setting) can be provided in a compile context (as described
above).
If errorcode or erroroffset is NULL, pcre2_compile() returns NULL imme-
- diately. Otherwise, the variables to which these point are set to an
- error code and an offset (number of code units) within the pattern,
- respectively, when pcre2_compile() returns NULL because a compilation
+ diately. Otherwise, the variables to which these point are set to an
+ error code and an offset (number of code units) within the pattern,
+ respectively, when pcre2_compile() returns NULL because a compilation
error has occurred. The values are not defined when compilation is suc-
cessful and pcre2_compile() returns a non-NULL value.
- There are nearly 100 positive error codes that pcre2_compile() may
- return if it finds an error in the pattern. There are also some nega-
- tive error codes that are used for invalid UTF strings. These are the
+ There are nearly 100 positive error codes that pcre2_compile() may
+ return if it finds an error in the pattern. There are also some nega-
+ tive error codes that are used for invalid UTF strings. These are the
same as given by pcre2_match() and pcre2_dfa_match(), and are described
- in the pcre2unicode page. There is no separate documentation for the
- positive error codes, because the textual error messages that are
- obtained by calling the pcre2_get_error_message() function (see
- "Obtaining a textual error message" below) should be self-explanatory.
- Macro names starting with PCRE2_ERROR_ are defined for both positive
+ in the pcre2unicode page. There is no separate documentation for the
+ positive error codes, because the textual error messages that are
+ obtained by calling the pcre2_get_error_message() function (see
+ "Obtaining a textual error message" below) should be self-explanatory.
+ Macro names starting with PCRE2_ERROR_ are defined for both positive
and negative error codes in pcre2.h.
The value returned in erroroffset is an indication of where in the pat-
- tern the error occurred. It is not necessarily the furthest point in
- the pattern that was read. For example, after the error "lookbehind
+ tern the error occurred. It is not necessarily the furthest point in
+ the pattern that was read. For example, after the error "lookbehind
assertion is not fixed length", the error offset points to the start of
- the failing assertion. For an invalid UTF-8 or UTF-16 string, the off-
+ the failing assertion. For an invalid UTF-8 or UTF-16 string, the off-
set is that of the first code unit of the failing character.
- Some errors are not detected until the whole pattern has been scanned;
- in these cases, the offset passed back is the length of the pattern.
- Note that the offset is in code units, not characters, even in a UTF
+ Some errors are not detected until the whole pattern has been scanned;
+ in these cases, the offset passed back is the length of the pattern.
+ Note that the offset is in code units, not characters, even in a UTF
mode. It may sometimes point into the middle of a UTF-8 or UTF-16 char-
acter.
- This code fragment shows a typical straightforward call to pcre2_com-
+ This code fragment shows a typical straightforward call to pcre2_com-
pile():
pcre2_code *re;
@@ -1314,476 +1317,476 @@ COMPILING A PATTERN
&erroffset, /* for error offset */
NULL); /* no compile context */
- The following names for option bits are defined in the pcre2.h header
+ The following names for option bits are defined in the pcre2.h header
file:
PCRE2_ANCHORED
If this bit is set, the pattern is forced to be "anchored", that is, it
- is constrained to match only at the first matching point in the string
- that is being searched (the "subject string"). This effect can also be
- achieved by appropriate constructs in the pattern itself, which is the
+ is constrained to match only at the first matching point in the string
+ that is being searched (the "subject string"). This effect can also be
+ achieved by appropriate constructs in the pattern itself, which is the
only way to do it in Perl.
PCRE2_ALLOW_EMPTY_CLASS
- By default, for compatibility with Perl, a closing square bracket that
- immediately follows an opening one is treated as a data character for
- the class. When PCRE2_ALLOW_EMPTY_CLASS is set, it terminates the
+ By default, for compatibility with Perl, a closing square bracket that
+ immediately follows an opening one is treated as a data character for
+ the class. When PCRE2_ALLOW_EMPTY_CLASS is set, it terminates the
class, which therefore contains no characters and so can never match.
PCRE2_ALT_BSUX
- This option request alternative handling of three escape sequences,
- which makes PCRE2's behaviour more like ECMAscript (aka JavaScript).
+ This option request alternative handling of three escape sequences,
+ which makes PCRE2's behaviour more like ECMAscript (aka JavaScript).
When it is set:
(1) \U matches an upper case "U" character; by default \U causes a com-
pile time error (Perl uses \U to upper case subsequent characters).
(2) \u matches a lower case "u" character unless it is followed by four
- hexadecimal digits, in which case the hexadecimal number defines the
- code point to match. By default, \u causes a compile time error (Perl
+ hexadecimal digits, in which case the hexadecimal number defines the
+ code point to match. By default, \u causes a compile time error (Perl
uses it to upper case the following character).
- (3) \x matches a lower case "x" character unless it is followed by two
- hexadecimal digits, in which case the hexadecimal number defines the
- code point to match. By default, as in Perl, a hexadecimal number is
+ (3) \x matches a lower case "x" character unless it is followed by two
+ hexadecimal digits, in which case the hexadecimal number defines the
+ code point to match. By default, as in Perl, a hexadecimal number is
always expected after \x, but it may have zero, one, or two digits (so,
for example, \xz matches a binary zero character followed by z).
PCRE2_ALT_CIRCUMFLEX
In multiline mode (when PCRE2_MULTILINE is set), the circumflex
- metacharacter matches at the start of the subject (unless PCRE2_NOTBOL
- is set), and also after any internal newline. However, it does not
+ metacharacter matches at the start of the subject (unless PCRE2_NOTBOL
+ is set), and also after any internal newline. However, it does not
match after a newline at the end of the subject, for compatibility with
- Perl. If you want a multiline circumflex also to match after a termi-
+ Perl. If you want a multiline circumflex also to match after a termi-
nating newline, you must set PCRE2_ALT_CIRCUMFLEX.
PCRE2_ALT_VERBNAMES
- By default, for compatibility with Perl, the name in any verb sequence
- such as (*MARK:NAME) is any sequence of characters that does not
- include a closing parenthesis. The name is not processed in any way,
- and it is not possible to include a closing parenthesis in the name.
- However, if the PCRE2_ALT_VERBNAMES option is set, normal backslash
- processing is applied to verb names and only an unescaped closing
- parenthesis terminates the name. A closing parenthesis can be included
- in a name either as \) or between \Q and \E. If the PCRE2_EXTENDED or
- PCRE2_EXTENDED_MORE option is set, unescaped whitespace in verb names
- is skipped and #-comments are recognized in this mode, exactly as in
+ By default, for compatibility with Perl, the name in any verb sequence
+ such as (*MARK:NAME) is any sequence of characters that does not
+ include a closing parenthesis. The name is not processed in any way,
+ and it is not possible to include a closing parenthesis in the name.
+ However, if the PCRE2_ALT_VERBNAMES option is set, normal backslash
+ processing is applied to verb names and only an unescaped closing
+ parenthesis terminates the name. A closing parenthesis can be included
+ in a name either as \) or between \Q and \E. If the PCRE2_EXTENDED or
+ PCRE2_EXTENDED_MORE option is set, unescaped whitespace in verb names
+ is skipped and #-comments are recognized in this mode, exactly as in
the rest of the pattern.
PCRE2_AUTO_CALLOUT
- If this bit is set, pcre2_compile() automatically inserts callout
- items, all with number 255, before each pattern item, except immedi-
- ately before or after an explicit callout in the pattern. For discus-
+ If this bit is set, pcre2_compile() automatically inserts callout
+ items, all with number 255, before each pattern item, except immedi-
+ ately before or after an explicit callout in the pattern. For discus-
sion of the callout facility, see the pcre2callout documentation.
PCRE2_CASELESS
- If this bit is set, letters in the pattern match both upper and lower
- case letters in the subject. It is equivalent to Perl's /i option, and
- it can be changed within a pattern by a (?i) option setting. If
- PCRE2_UTF is set, Unicode properties are used for all characters with
- more than one other case, and for all characters whose code points are
- greater than U+007f. For lower valued characters with only one other
- case, a lookup table is used for speed. When PCRE2_UTF is not set, a
+ If this bit is set, letters in the pattern match both upper and lower
+ case letters in the subject. It is equivalent to Perl's /i option, and
+ it can be changed within a pattern by a (?i) option setting. If
+ PCRE2_UTF is set, Unicode properties are used for all characters with
+ more than one other case, and for all characters whose code points are
+ greater than U+007f. For lower valued characters with only one other
+ case, a lookup table is used for speed. When PCRE2_UTF is not set, a
lookup table is used for all code points less than 256, and higher code
- points (available only in 16-bit or 32-bit mode) are treated as not
+ points (available only in 16-bit or 32-bit mode) are treated as not
having another case.
PCRE2_DOLLAR_ENDONLY
- If this bit is set, a dollar metacharacter in the pattern matches only
- at the end of the subject string. Without this option, a dollar also
- matches immediately before a newline at the end of the string (but not
- before any other newlines). The PCRE2_DOLLAR_ENDONLY option is ignored
- if PCRE2_MULTILINE is set. There is no equivalent to this option in
+ If this bit is set, a dollar metacharacter in the pattern matches only
+ at the end of the subject string. Without this option, a dollar also
+ matches immediately before a newline at the end of the string (but not
+ before any other newlines). The PCRE2_DOLLAR_ENDONLY option is ignored
+ if PCRE2_MULTILINE is set. There is no equivalent to this option in
Perl, and no way to set it within a pattern.
PCRE2_DOTALL
- If this bit is set, a dot metacharacter in the pattern matches any
- character, including one that indicates a newline. However, it only
+ If this bit is set, a dot metacharacter in the pattern matches any
+ character, including one that indicates a newline. However, it only
ever matches one character, even if newlines are coded as CRLF. Without
this option, a dot does not match when the current position in the sub-
- ject is at a newline. This option is equivalent to Perl's /s option,
+ ject is at a newline. This option is equivalent to Perl's /s option,
and it can be changed within a pattern by a (?s) option setting. A neg-
ative class such as [^a] always matches newline characters, independent
of the setting of this option.
PCRE2_DUPNAMES
- If this bit is set, names used to identify capturing subpatterns need
+ If this bit is set, names used to identify capturing subpatterns need
not be unique. This can be helpful for certain types of pattern when it
- is known that only one instance of the named subpattern can ever be
- matched. There are more details of named subpatterns below; see also
+ is known that only one instance of the named subpattern can ever be
+ matched. There are more details of named subpatterns below; see also
the pcre2pattern documentation.
PCRE2_ENDANCHORED
- If this bit is set, the end of any pattern match must be right at the
+ If this bit is set, the end of any pattern match must be right at the
end of the string being searched (the "subject string"). If the pattern
match succeeds by reaching (*ACCEPT), but does not reach the end of the
- subject, the match fails at the current starting point. For unanchored
- patterns, a new match is then tried at the next starting point. How-
+ subject, the match fails at the current starting point. For unanchored
+ patterns, a new match is then tried at the next starting point. How-
ever, if the match succeeds by reaching the end of the pattern, but not
- the end of the subject, backtracking occurs and an alternative match
+ the end of the subject, backtracking occurs and an alternative match
may be found. Consider these two patterns:
.(*ACCEPT)|..
.|..
- If matched against "abc" with PCRE2_ENDANCHORED set, the first matches
- "c" whereas the second matches "bc". The effect of PCRE2_ENDANCHORED
- can also be achieved by appropriate constructs in the pattern itself,
+ If matched against "abc" with PCRE2_ENDANCHORED set, the first matches
+ "c" whereas the second matches "bc". The effect of PCRE2_ENDANCHORED
+ can also be achieved by appropriate constructs in the pattern itself,
which is the only way to do it in Perl.
For DFA matching with pcre2_dfa_match(), PCRE2_ENDANCHORED applies only
- to the first (that is, the longest) matched string. Other parallel
- matches, which are necessarily substrings of the first one, must obvi-
+ to the first (that is, the longest) matched string. Other parallel
+ matches, which are necessarily substrings of the first one, must obvi-
ously end before the end of the subject.
PCRE2_EXTENDED
- If this bit is set, most white space characters in the pattern are
- totally ignored except when escaped or inside a character class. How-
- ever, white space is not allowed within sequences such as (?> that
+ If this bit is set, most white space characters in the pattern are
+ totally ignored except when escaped or inside a character class. How-
+ ever, white space is not allowed within sequences such as (?> that
introduce various parenthesized subpatterns, nor within numerical quan-
- tifiers such as {1,3}. Ignorable white space is permitted between an
- item and a following quantifier and between a quantifier and a follow-
+ tifiers such as {1,3}. Ignorable white space is permitted between an
+ item and a following quantifier and between a quantifier and a follow-
ing + that indicates possessiveness.
- PCRE2_EXTENDED also causes characters between an unescaped # outside a
- character class and the next newline, inclusive, to be ignored, which
+ PCRE2_EXTENDED also causes characters between an unescaped # outside a
+ character class and the next newline, inclusive, to be ignored, which
makes it possible to include comments inside complicated patterns. Note
- that the end of this type of comment is a literal newline sequence in
+ that the end of this type of comment is a literal newline sequence in
the pattern; escape sequences that happen to represent a newline do not
- count. PCRE2_EXTENDED is equivalent to Perl's /x option, and it can be
+ count. PCRE2_EXTENDED is equivalent to Perl's /x option, and it can be
changed within a pattern by a (?x) option setting.
Which characters are interpreted as newlines can be specified by a set-
- ting in the compile context that is passed to pcre2_compile() or by a
- special sequence at the start of the pattern, as described in the sec-
- tion entitled "Newline conventions" in the pcre2pattern documentation.
+ ting in the compile context that is passed to pcre2_compile() or by a
+ special sequence at the start of the pattern, as described in the sec-
+ tion entitled "Newline conventions" in the pcre2pattern documentation.
A default is defined when PCRE2 is built.
PCRE2_EXTENDED_MORE
- This option has the effect of PCRE2_EXTENDED, but, in addition,
- unescaped space and horizontal tab characters are ignored inside a
- character class. PCRE2_EXTENDED_MORE is equivalent to Perl's 5.26 /xx
- option, and it can be changed within a pattern by a (?xx) option set-
+ This option has the effect of PCRE2_EXTENDED, but, in addition,
+ unescaped space and horizontal tab characters are ignored inside a
+ character class. PCRE2_EXTENDED_MORE is equivalent to Perl's 5.26 /xx
+ option, and it can be changed within a pattern by a (?xx) option set-
ting.
PCRE2_FIRSTLINE
If this option is set, the start of an unanchored pattern match must be
- before or at the first newline in the subject string, though the
- matched text may continue over the newline. See also PCRE2_USE_OFF-
- SET_LIMIT, which provides a more general limiting facility. If
- PCRE2_FIRSTLINE is set with an offset limit, a match must occur in the
- first line and also within the offset limit. In other words, whichever
+ before or at the first newline in the subject string, though the
+ matched text may continue over the newline. See also PCRE2_USE_OFF-
+ SET_LIMIT, which provides a more general limiting facility. If
+ PCRE2_FIRSTLINE is set with an offset limit, a match must occur in the
+ first line and also within the offset limit. In other words, whichever
limit comes first is used.
PCRE2_LITERAL
If this option is set, all meta-characters in the pattern are disabled,
- and it is treated as a literal string. Matching literal strings with a
+ and it is treated as a literal string. Matching literal strings with a
regular expression engine is not the most efficient way of doing it. If
- you are doing a lot of literal matching and are worried about effi-
+ you are doing a lot of literal matching and are worried about effi-
ciency, you should consider using other approaches. The only other main
options that are allowed with PCRE2_LITERAL are: PCRE2_ANCHORED,
PCRE2_ENDANCHORED, PCRE2_AUTO_CALLOUT, PCRE2_CASELESS, PCRE2_FIRSTLINE,
PCRE2_NO_START_OPTIMIZE, PCRE2_NO_UTF_CHECK, PCRE2_UTF, and
- PCRE2_USE_OFFSET_LIMIT. The extra options PCRE2_EXTRA_MATCH_LINE and
- PCRE2_EXTRA_MATCH_WORD are also supported. Any other options cause an
+ PCRE2_USE_OFFSET_LIMIT. The extra options PCRE2_EXTRA_MATCH_LINE and
+ PCRE2_EXTRA_MATCH_WORD are also supported. Any other options cause an
error.
PCRE2_MATCH_UNSET_BACKREF
- If this option is set, a back reference to an unset subpattern group
- matches an empty string (by default this causes the current matching
- alternative to fail). A pattern such as (\1)(a) succeeds when this
- option is set (assuming it can find an "a" in the subject), whereas it
- fails by default, for Perl compatibility. Setting this option makes
+ If this option is set, a back reference to an unset subpattern group
+ matches an empty string (by default this causes the current matching
+ alternative to fail). A pattern such as (\1)(a) succeeds when this
+ option is set (assuming it can find an "a" in the subject), whereas it
+ fails by default, for Perl compatibility. Setting this option makes
PCRE2 behave more like ECMAscript (aka JavaScript).
PCRE2_MULTILINE
- By default, for the purposes of matching "start of line" and "end of
- line", PCRE2 treats the subject string as consisting of a single line
- of characters, even if it actually contains newlines. The "start of
- line" metacharacter (^) matches only at the start of the string, and
- the "end of line" metacharacter ($) matches only at the end of the
+ By default, for the purposes of matching "start of line" and "end of
+ line", PCRE2 treats the subject string as consisting of a single line
+ of characters, even if it actually contains newlines. The "start of
+ line" metacharacter (^) matches only at the start of the string, and
+ the "end of line" metacharacter ($) matches only at the end of the
string, or before a terminating newline (except when PCRE2_DOL-
- LAR_ENDONLY is set). Note, however, that unless PCRE2_DOTALL is set,
+ LAR_ENDONLY is set). Note, however, that unless PCRE2_DOTALL is set,
the "any character" metacharacter (.) does not match at a newline. This
behaviour (for ^, $, and dot) is the same as Perl.
- When PCRE2_MULTILINE it is set, the "start of line" and "end of line"
- constructs match immediately following or immediately before internal
- newlines in the subject string, respectively, as well as at the very
- start and end. This is equivalent to Perl's /m option, and it can be
+ When PCRE2_MULTILINE it is set, the "start of line" and "end of line"
+ constructs match immediately following or immediately before internal
+ newlines in the subject string, respectively, as well as at the very
+ start and end. This is equivalent to Perl's /m option, and it can be
changed within a pattern by a (?m) option setting. Note that the "start
of line" metacharacter does not match after a newline at the end of the
- subject, for compatibility with Perl. However, you can change this by
- setting the PCRE2_ALT_CIRCUMFLEX option. If there are no newlines in a
- subject string, or no occurrences of ^ or $ in a pattern, setting
+ subject, for compatibility with Perl. However, you can change this by
+ setting the PCRE2_ALT_CIRCUMFLEX option. If there are no newlines in a
+ subject string, or no occurrences of ^ or $ in a pattern, setting
PCRE2_MULTILINE has no effect.
PCRE2_NEVER_BACKSLASH_C
- This option locks out the use of \C in the pattern that is being com-
- piled. This escape can cause unpredictable behaviour in UTF-8 or
- UTF-16 modes, because it may leave the current matching point in the
- middle of a multi-code-unit character. This option may be useful in
- applications that process patterns from external sources. Note that
+ This option locks out the use of \C in the pattern that is being com-
+ piled. This escape can cause unpredictable behaviour in UTF-8 or
+ UTF-16 modes, because it may leave the current matching point in the
+ middle of a multi-code-unit character. This option may be useful in
+ applications that process patterns from external sources. Note that
there is also a build-time option that permanently locks out the use of
\C.
PCRE2_NEVER_UCP
- This option locks out the use of Unicode properties for handling \B,
+ This option locks out the use of Unicode properties for handling \B,
\b, \D, \d, \S, \s, \W, \w, and some of the POSIX character classes, as
- described for the PCRE2_UCP option below. In particular, it prevents
- the creator of the pattern from enabling this facility by starting the
- pattern with (*UCP). This option may be useful in applications that
+ described for the PCRE2_UCP option below. In particular, it prevents
+ the creator of the pattern from enabling this facility by starting the
+ pattern with (*UCP). This option may be useful in applications that
process patterns from external sources. The option combination PCRE_UCP
and PCRE_NEVER_UCP causes an error.
PCRE2_NEVER_UTF
- This option locks out interpretation of the pattern as UTF-8, UTF-16,
+ This option locks out interpretation of the pattern as UTF-8, UTF-16,
or UTF-32, depending on which library is in use. In particular, it pre-
- vents the creator of the pattern from switching to UTF interpretation
- by starting the pattern with (*UTF). This option may be useful in
- applications that process patterns from external sources. The combina-
+ vents the creator of the pattern from switching to UTF interpretation
+ by starting the pattern with (*UTF). This option may be useful in
+ applications that process patterns from external sources. The combina-
tion of PCRE2_UTF and PCRE2_NEVER_UTF causes an error.
PCRE2_NO_AUTO_CAPTURE
If this option is set, it disables the use of numbered capturing paren-
- theses in the pattern. Any opening parenthesis that is not followed by
- ? behaves as if it were followed by ?: but named parentheses can still
+ theses in the pattern. Any opening parenthesis that is not followed by
+ ? behaves as if it were followed by ?: but named parentheses can still
be used for capturing (and they acquire numbers in the usual way). This
- is the same as Perl's /n option. Note that, when this option is set,
+ is the same as Perl's /n option. Note that, when this option is set,
references to capturing groups (back references or recursion/subroutine
- calls) may only refer to named groups, though the reference can be by
+ calls) may only refer to named groups, though the reference can be by
name or by number.
PCRE2_NO_AUTO_POSSESS
If this option is set, it disables "auto-possessification", which is an
- optimization that, for example, turns a+b into a++b in order to avoid
- backtracks into a+ that can never be successful. However, if callouts
- are in use, auto-possessification means that some callouts are never
+ optimization that, for example, turns a+b into a++b in order to avoid
+ backtracks into a+ that can never be successful. However, if callouts
+ are in use, auto-possessification means that some callouts are never
taken. You can set this option if you want the matching functions to do
- a full unoptimized search and run all the callouts, but it is mainly
+ a full unoptimized search and run all the callouts, but it is mainly
provided for testing purposes.
PCRE2_NO_DOTSTAR_ANCHOR
If this option is set, it disables an optimization that is applied when
- .* is the first significant item in a top-level branch of a pattern,
- and all the other branches also start with .* or with \A or \G or ^.
- The optimization is automatically disabled for .* if it is inside an
- atomic group or a capturing group that is the subject of a back refer-
- ence, or if the pattern contains (*PRUNE) or (*SKIP). When the opti-
- mization is not disabled, such a pattern is automatically anchored if
+ .* is the first significant item in a top-level branch of a pattern,
+ and all the other branches also start with .* or with \A or \G or ^.
+ The optimization is automatically disabled for .* if it is inside an
+ atomic group or a capturing group that is the subject of a back refer-
+ ence, or if the pattern contains (*PRUNE) or (*SKIP). When the opti-
+ mization is not disabled, such a pattern is automatically anchored if
PCRE2_DOTALL is set for all the .* items and PCRE2_MULTILINE is not set
- for any ^ items. Otherwise, the fact that any match must start either
- at the start of the subject or following a newline is remembered. Like
+ for any ^ items. Otherwise, the fact that any match must start either
+ at the start of the subject or following a newline is remembered. Like
other optimizations, this can cause callouts to be skipped.
PCRE2_NO_START_OPTIMIZE
- This is an option whose main effect is at matching time. It does not
+ This is an option whose main effect is at matching time. It does not
change what pcre2_compile() generates, but it does affect the output of
the JIT compiler.
- There are a number of optimizations that may occur at the start of a
- match, in order to speed up the process. For example, if it is known
- that an unanchored match must start with a specific code unit value,
- the matching code searches the subject for that value, and fails imme-
- diately if it cannot find it, without actually running the main match-
- ing function. This means that a special item such as (*COMMIT) at the
- start of a pattern is not considered until after a suitable starting
- point for the match has been found. Also, when callouts or (*MARK)
- items are in use, these "start-up" optimizations can cause them to be
- skipped if the pattern is never actually used. The start-up optimiza-
- tions are in effect a pre-scan of the subject that takes place before
+ There are a number of optimizations that may occur at the start of a
+ match, in order to speed up the process. For example, if it is known
+ that an unanchored match must start with a specific code unit value,
+ the matching code searches the subject for that value, and fails imme-
+ diately if it cannot find it, without actually running the main match-
+ ing function. This means that a special item such as (*COMMIT) at the
+ start of a pattern is not considered until after a suitable starting
+ point for the match has been found. Also, when callouts or (*MARK)
+ items are in use, these "start-up" optimizations can cause them to be
+ skipped if the pattern is never actually used. The start-up optimiza-
+ tions are in effect a pre-scan of the subject that takes place before
the pattern is run.
The PCRE2_NO_START_OPTIMIZE option disables the start-up optimizations,
- possibly causing performance to suffer, but ensuring that in cases
- where the result is "no match", the callouts do occur, and that items
+ possibly causing performance to suffer, but ensuring that in cases
+ where the result is "no match", the callouts do occur, and that items
such as (*COMMIT) and (*MARK) are considered at every possible starting
position in the subject string.
- Setting PCRE2_NO_START_OPTIMIZE may change the outcome of a matching
+ Setting PCRE2_NO_START_OPTIMIZE may change the outcome of a matching
operation. Consider the pattern
(*COMMIT)ABC
- When this is compiled, PCRE2 records the fact that a match must start
- with the character "A". Suppose the subject string is "DEFABC". The
- start-up optimization scans along the subject, finds "A" and runs the
- first match attempt from there. The (*COMMIT) item means that the pat-
- tern must match the current starting position, which in this case, it
- does. However, if the same match is run with PCRE2_NO_START_OPTIMIZE
- set, the initial scan along the subject string does not happen. The
- first match attempt is run starting from "D" and when this fails,
- (*COMMIT) prevents any further matches being tried, so the overall
+ When this is compiled, PCRE2 records the fact that a match must start
+ with the character "A". Suppose the subject string is "DEFABC". The
+ start-up optimization scans along the subject, finds "A" and runs the
+ first match attempt from there. The (*COMMIT) item means that the pat-
+ tern must match the current starting position, which in this case, it
+ does. However, if the same match is run with PCRE2_NO_START_OPTIMIZE
+ set, the initial scan along the subject string does not happen. The
+ first match attempt is run starting from "D" and when this fails,
+ (*COMMIT) prevents any further matches being tried, so the overall
result is "no match".
- There are also other start-up optimizations. For example, a minimum
+ There are also other start-up optimizations. For example, a minimum
length for the subject may be recorded. Consider the pattern
(*MARK:A)(X|Y)
- The minimum length for a match is one character. If the subject is
+ The minimum length for a match is one character. If the subject is
"ABC", there will be attempts to match "ABC", "BC", and "C". An attempt
to match an empty string at the end of the subject does not take place,
- because PCRE2 knows that the subject is now too short, and so the
- (*MARK) is never encountered. In this case, the optimization does not
+ because PCRE2 knows that the subject is now too short, and so the
+ (*MARK) is never encountered. In this case, the optimization does not
affect the overall match result, which is still "no match", but it does
affect the auxiliary information that is returned.
PCRE2_NO_UTF_CHECK
- When PCRE2_UTF is set, the validity of the pattern as a UTF string is
- automatically checked. There are discussions about the validity of
- UTF-8 strings, UTF-16 strings, and UTF-32 strings in the pcre2unicode
- document. If an invalid UTF sequence is found, pcre2_compile() returns
+ When PCRE2_UTF is set, the validity of the pattern as a UTF string is
+ automatically checked. There are discussions about the validity of
+ UTF-8 strings, UTF-16 strings, and UTF-32 strings in the pcre2unicode
+ document. If an invalid UTF sequence is found, pcre2_compile() returns
a negative error code.
- If you know that your pattern is a valid UTF string, and you want to
- skip this check for performance reasons, you can set the
- PCRE2_NO_UTF_CHECK option. When it is set, the effect of passing an
+ If you know that your pattern is a valid UTF string, and you want to
+ skip this check for performance reasons, you can set the
+ PCRE2_NO_UTF_CHECK option. When it is set, the effect of passing an
invalid UTF string as a pattern is undefined. It may cause your program
to crash or loop.
Note that this option can also be passed to pcre2_match() and
- pcre_dfa_match(), to suppress UTF validity checking of the subject
+ pcre_dfa_match(), to suppress UTF validity checking of the subject
string.
Note also that setting PCRE2_NO_UTF_CHECK at compile time does not dis-
- able the error that is given if an escape sequence for an invalid Uni-
- code code point is encountered in the pattern. In particular, the so-
- called "surrogate" code points (0xd800 to 0xdfff) are invalid. If you
- want to allow escape sequences such as \x{d800} you can set the
- PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES extra option, as described in the
- section entitled "Extra compile options" below. However, this is pos-
+ able the error that is given if an escape sequence for an invalid Uni-
+ code code point is encountered in the pattern. In particular, the so-
+ called "surrogate" code points (0xd800 to 0xdfff) are invalid. If you
+ want to allow escape sequences such as \x{d800} you can set the
+ PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES extra option, as described in the
+ section entitled "Extra compile options" below. However, this is pos-
sible only in UTF-8 and UTF-32 modes, because these values are not rep-
resentable in UTF-16.
PCRE2_UCP
This option changes the way PCRE2 processes \B, \b, \D, \d, \S, \s, \W,
- \w, and some of the POSIX character classes. By default, only ASCII
- characters are recognized, but if PCRE2_UCP is set, Unicode properties
- are used instead to classify characters. More details are given in the
+ \w, and some of the POSIX character classes. By default, only ASCII
+ characters are recognized, but if PCRE2_UCP is set, Unicode properties
+ are used instead to classify characters. More details are given in the
section on generic character types in the pcre2pattern page. If you set
- PCRE2_UCP, matching one of the items it affects takes much longer. The
- option is available only if PCRE2 has been compiled with Unicode sup-
+ PCRE2_UCP, matching one of the items it affects takes much longer. The
+ option is available only if PCRE2 has been compiled with Unicode sup-
port (which is the default).
PCRE2_UNGREEDY
- This option inverts the "greediness" of the quantifiers so that they
- are not greedy by default, but become greedy if followed by "?". It is
- not compatible with Perl. It can also be set by a (?U) option setting
+ This option inverts the "greediness" of the quantifiers so that they
+ are not greedy by default, but become greedy if followed by "?". It is
+ not compatible with Perl. It can also be set by a (?U) option setting
within the pattern.
PCRE2_USE_OFFSET_LIMIT
This option must be set for pcre2_compile() if pcre2_set_offset_limit()
- is going to be used to set a non-default offset limit in a match con-
- text for matches that use this pattern. An error is generated if an
- offset limit is set without this option. For more details, see the
- description of pcre2_set_offset_limit() in the section that describes
+ is going to be used to set a non-default offset limit in a match con-
+ text for matches that use this pattern. An error is generated if an
+ offset limit is set without this option. For more details, see the
+ description of pcre2_set_offset_limit() in the section that describes
match contexts. See also the PCRE2_FIRSTLINE option above.
PCRE2_UTF
- This option causes PCRE2 to regard both the pattern and the subject
- strings that are subsequently processed as strings of UTF characters
- instead of single-code-unit strings. It is available when PCRE2 is
- built to include Unicode support (which is the default). If Unicode
- support is not available, the use of this option provokes an error.
- Details of how PCRE2_UTF changes the behaviour of PCRE2 are given in
+ This option causes PCRE2 to regard both the pattern and the subject
+ strings that are subsequently processed as strings of UTF characters
+ instead of single-code-unit strings. It is available when PCRE2 is
+ built to include Unicode support (which is the default). If Unicode
+ support is not available, the use of this option provokes an error.
+ Details of how PCRE2_UTF changes the behaviour of PCRE2 are given in
the pcre2unicode page.
Extra compile options
- Unlike the main compile-time options, the extra options are not saved
+ Unlike the main compile-time options, the extra options are not saved
with the compiled pattern. The option bits that can be set in a compile
- context by calling the pcre2_set_compile_extra_options() function are
+ context by calling the pcre2_set_compile_extra_options() function are
as follows:
PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES
- This option applies when compiling a pattern in UTF-8 or UTF-32 mode.
- It is forbidden in UTF-16 mode, and ignored in non-UTF modes. Unicode
+ This option applies when compiling a pattern in UTF-8 or UTF-32 mode.
+ It is forbidden in UTF-16 mode, and ignored in non-UTF modes. Unicode
"surrogate" code points in the range 0xd800 to 0xdfff are used in pairs
- in UTF-16 to encode code points with values in the range 0x10000 to
- 0x10ffff. The surrogates cannot therefore be represented in UTF-16.
+ in UTF-16 to encode code points with values in the range 0x10000 to
+ 0x10ffff. The surrogates cannot therefore be represented in UTF-16.
They can be represented in UTF-8 and UTF-32, but are defined as invalid
- code points, and cause errors if encountered in a UTF-8 or UTF-32
+ code points, and cause errors if encountered in a UTF-8 or UTF-32
string that is being checked for validity by PCRE2.
- These values also cause errors if encountered in escape sequences such
+ These values also cause errors if encountered in escape sequences such
as \x{d912} within a pattern. However, it seems that some applications,
- when using PCRE2 to check for unwanted characters in UTF-8 strings,
- explicitly test for the surrogates using escape sequences. The
- PCRE2_NO_UTF_CHECK option does not disable the error that occurs,
- because it applies only to the testing of input strings for UTF valid-
+ when using PCRE2 to check for unwanted characters in UTF-8 strings,
+ explicitly test for the surrogates using escape sequences. The
+ PCRE2_NO_UTF_CHECK option does not disable the error that occurs,
+ because it applies only to the testing of input strings for UTF valid-
ity.
- If the extra option PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES is set, surro-
- gate code point values in UTF-8 and UTF-32 patterns no longer provoke
- errors and are incorporated in the compiled pattern. However, they can
- only match subject characters if the matching function is called with
+ If the extra option PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES is set, surro-
+ gate code point values in UTF-8 and UTF-32 patterns no longer provoke
+ errors and are incorporated in the compiled pattern. However, they can
+ only match subject characters if the matching function is called with
PCRE2_NO_UTF_CHECK set.
PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL
- This is a dangerous option. Use with care. By default, an unrecognized
- escape such as \j or a malformed one such as \x{2z} causes a compile-
+ This is a dangerous option. Use with care. By default, an unrecognized
+ escape such as \j or a malformed one such as \x{2z} causes a compile-
time error when detected by pcre2_compile(). Perl is somewhat inconsis-
- tent in handling such items: for example, \j is treated as a literal
- "j", and non-hexadecimal digits in \x{} are just ignored, though warn-
- ings are given in both cases if Perl's warning switch is enabled. How-
- ever, a malformed octal number after \o{ always causes an error in
+ tent in handling such items: for example, \j is treated as a literal
+ "j", and non-hexadecimal digits in \x{} are just ignored, though warn-
+ ings are given in both cases if Perl's warning switch is enabled. How-
+ ever, a malformed octal number after \o{ always causes an error in
Perl.
- If the PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL extra option is passed to
- pcre2_compile(), all unrecognized or erroneous escape sequences are
- treated as single-character escapes. For example, \j is a literal "j"
- and \x{2z} is treated as the literal string "x{2z}". Setting this
- option means that typos in patterns may go undetected and have unex-
+ If the PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL extra option is passed to
+ pcre2_compile(), all unrecognized or erroneous escape sequences are
+ treated as single-character escapes. For example, \j is a literal "j"
+ and \x{2z} is treated as the literal string "x{2z}". Setting this
+ option means that typos in patterns may go undetected and have unex-
pected results. This is a dangerous option. Use with care.
PCRE2_EXTRA_MATCH_LINE
- This option is provided for use by the -x option of pcre2grep. It
- causes the pattern only to match complete lines. This is achieved by
- automatically inserting the code for "^(?:" at the start of the com-
- piled pattern and ")$" at the end. Thus, when PCRE2_MULTILINE is set,
- the matched line may be in the middle of the subject string. This
+ This option is provided for use by the -x option of pcre2grep. It
+ causes the pattern only to match complete lines. This is achieved by
+ automatically inserting the code for "^(?:" at the start of the com-
+ piled pattern and ")$" at the end. Thus, when PCRE2_MULTILINE is set,
+ the matched line may be in the middle of the subject string. This
option can be used with PCRE2_LITERAL.
PCRE2_EXTRA_MATCH_WORD
- This option is provided for use by the -w option of pcre2grep. It
- causes the pattern only to match strings that have a word boundary at
- the start and the end. This is achieved by automatically inserting the
- code for "\b(?:" at the start of the compiled pattern and ")\b" at the
- end. The option may be used with PCRE2_LITERAL. However, it is ignored
+ This option is provided for use by the -w option of pcre2grep. It
+ causes the pattern only to match strings that have a word boundary at
+ the start and the end. This is achieved by automatically inserting the
+ code for "\b(?:" at the start of the compiled pattern and ")\b" at the
+ end. The option may be used with PCRE2_LITERAL. However, it is ignored
if PCRE2_EXTRA_MATCH_LINE is also set.
@@ -1806,53 +1809,53 @@ JUST-IN-TIME (JIT) COMPILATION
void pcre2_jit_stack_free(pcre2_jit_stack *jit_stack);
- These functions provide support for JIT compilation, which, if the
- just-in-time compiler is available, further processes a compiled pat-
+ These functions provide support for JIT compilation, which, if the
+ just-in-time compiler is available, further processes a compiled pat-
tern into machine code that executes much faster than the pcre2_match()
- interpretive matching function. Full details are given in the pcre2jit
+ interpretive matching function. Full details are given in the pcre2jit
documentation.
- JIT compilation is a heavyweight optimization. It can take some time
- for patterns to be analyzed, and for one-off matches and simple pat-
- terns the benefit of faster execution might be offset by a much slower
- compilation time. Most (but not all) patterns can be optimized by the
+ JIT compilation is a heavyweight optimization. It can take some time
+ for patterns to be analyzed, and for one-off matches and simple pat-
+ terns the benefit of faster execution might be offset by a much slower
+ compilation time. Most (but not all) patterns can be optimized by the
JIT compiler.
LOCALE SUPPORT
- PCRE2 handles caseless matching, and determines whether characters are
- letters, digits, or whatever, by reference to a set of tables, indexed
- by character code point. This applies only to characters whose code
- points are less than 256. By default, higher-valued code points never
- match escapes such as \w or \d. However, if PCRE2 is built with Uni-
+ PCRE2 handles caseless matching, and determines whether characters are
+ letters, digits, or whatever, by reference to a set of tables, indexed
+ by character code point. This applies only to characters whose code
+ points are less than 256. By default, higher-valued code points never
+ match escapes such as \w or \d. However, if PCRE2 is built with Uni-
code support, all characters can be tested with \p and \P, or, alterna-
- tively, the PCRE2_UCP option can be set when a pattern is compiled;
- this causes \w and friends to use Unicode property support instead of
+ tively, the PCRE2_UCP option can be set when a pattern is compiled;
+ this causes \w and friends to use Unicode property support instead of
the built-in tables.
- The use of locales with Unicode is discouraged. If you are handling
- characters with code points greater than 128, you should either use
+ The use of locales with Unicode is discouraged. If you are handling
+ characters with code points greater than 128, you should either use
Unicode support, or use locales, but not try to mix the two.
- PCRE2 contains an internal set of character tables that are used by
- default. These are sufficient for many applications. Normally, the
+ PCRE2 contains an internal set of character tables that are used by
+ default. These are sufficient for many applications. Normally, the
internal tables recognize only ASCII characters. However, when PCRE2 is
built, it is possible to cause the internal tables to be rebuilt in the
default "C" locale of the local system, which may cause them to be dif-
ferent.
- The internal tables can be overridden by tables supplied by the appli-
- cation that calls PCRE2. These may be created in a different locale
- from the default. As more and more applications change to using Uni-
+ The internal tables can be overridden by tables supplied by the appli-
+ cation that calls PCRE2. These may be created in a different locale
+ from the default. As more and more applications change to using Uni-
code, the need for this locale support is expected to die away.
- External tables are built by calling the pcre2_maketables() function,
- in the relevant locale. The result can be passed to pcre2_compile() as
- often as necessary, by creating a compile context and calling
- pcre2_set_character_tables() to set the tables pointer therein. For
- example, to build and use tables that are appropriate for the French
- locale (where accented characters with values greater than 128 are
+ External tables are built by calling the pcre2_maketables() function,
+ in the relevant locale. The result can be passed to pcre2_compile() as
+ often as necessary, by creating a compile context and calling
+ pcre2_set_character_tables() to set the tables pointer therein. For
+ example, to build and use tables that are appropriate for the French
+ locale (where accented characters with values greater than 128 are
treated as letters), the following code could be used:
setlocale(LC_CTYPE, "fr_FR");
@@ -1861,15 +1864,15 @@ LOCALE SUPPORT
pcre2_set_character_tables(ccontext, tables);
re = pcre2_compile(..., ccontext);
- The locale name "fr_FR" is used on Linux and other Unix-like systems;
- if you are using Windows, the name for the French locale is "french".
- It is the caller's responsibility to ensure that the memory containing
+ The locale name "fr_FR" is used on Linux and other Unix-like systems;
+ if you are using Windows, the name for the French locale is "french".
+ It is the caller's responsibility to ensure that the memory containing
the tables remains available for as long as it is needed.
The pointer that is passed (via the compile context) to pcre2_compile()
- is saved with the compiled pattern, and the same tables are used by
- pcre2_match() and pcre_dfa_match(). Thus, for any single pattern, com-
- pilation and matching both happen in the same locale, but different
+ is saved with the compiled pattern, and the same tables are used by
+ pcre2_match() and pcre_dfa_match(). Thus, for any single pattern, com-
+ pilation and matching both happen in the same locale, but different
patterns can be processed in different locales.
@@ -1877,13 +1880,13 @@ INFORMATION ABOUT A COMPILED PATTERN
int pcre2_pattern_info(const pcre2 *code, uint32_t what, void *where);
- The pcre2_pattern_info() function returns general information about a
+ The pcre2_pattern_info() function returns general information about a
compiled pattern. For information about callouts, see the next section.
- The first argument for pcre2_pattern_info() is a pointer to the com-
+ The first argument for pcre2_pattern_info() is a pointer to the com-
piled pattern. The second argument specifies which piece of information
- is required, and the third argument is a pointer to a variable to
- receive the data. If the third argument is NULL, the first argument is
- ignored, and the function returns the size in bytes of the variable
+ is required, and the third argument is a pointer to a variable to
+ receive the data. If the third argument is NULL, the first argument is
+ ignored, and the function returns the size in bytes of the variable
that is required for the information requested. Otherwise, the yield of
the function is zero for success, or one of the following negative num-
bers:
@@ -1893,9 +1896,9 @@ INFORMATION ABOUT A COMPILED PATTERN
PCRE2_ERROR_BADOPTION the value of what was invalid
PCRE2_ERROR_UNSET the requested field is not set
- The "magic number" is placed at the start of each compiled pattern as
- an simple check against passing an arbitrary memory pointer. Here is a
- typical call of pcre2_pattern_info(), to obtain the length of the com-
+ The "magic number" is placed at the start of each compiled pattern as
+ an simple check against passing an arbitrary memory pointer. Here is a
+ typical call of pcre2_pattern_info(), to obtain the length of the com-
piled pattern:
int rc;
@@ -1910,12 +1913,16 @@ INFORMATION ABOUT A COMPILED PATTERN
PCRE2_INFO_ALLOPTIONS
PCRE2_INFO_ARGOPTIONS
-
- Return a copy of the pattern's options. The third argument should point
- to a uint32_t variable. PCRE2_INFO_ARGOPTIONS returns exactly the
- options that were passed to pcre2_compile(), whereas PCRE2_INFO_ALLOP-
- TIONS returns the compile options as modified by any top-level (*XXX)
- option settings such as (*UTF) at the start of the pattern itself.
+ PCRE2_INFO_EXTRAOPTIONS
+
+ Return copies of the pattern's options. The third argument should point
+ to a uint32_t variable. PCRE2_INFO_ARGOPTIONS returns exactly the
+ options that were passed to pcre2_compile(), whereas PCRE2_INFO_ALLOP-
+ TIONS returns the compile options as modified by any top-level (*XXX)
+ option settings such as (*UTF) at the start of the pattern itself.
+ PCRE2_INFO_EXTRAOPTIONS returns the extra options that were set in the
+ compile context by calling the pcre2_set_compile_extra_options() func-
+ tion.
For example, if the pattern /(*UTF)abc/ is compiled with the
PCRE2_EXTENDED option, the result for PCRE2_INFO_ALLOPTIONS is
@@ -3062,88 +3069,103 @@ CREATING A NEW STRING WITH SUBSTITUTIONS
options can be set in the options argument of pcre2_substitute().
PCRE2_SUBSTITUTE_GLOBAL causes the function to iterate over the subject
- string, replacing every matching substring. If this is not set, only
- the first matching substring is replaced. If any matched substring has
- zero length, after the substitution has happened, an attempt to find a
- non-empty match at the same position is performed. If this is not suc-
- cessful, the current position is advanced by one character except when
- CRLF is a valid newline sequence and the next two characters are CR,
- LF. In this case, the current position is advanced by two characters.
-
- PCRE2_SUBSTITUTE_OVERFLOW_LENGTH changes what happens when the output
+ string, replacing every matching substring. If this option is not set,
+ only the first matching substring is replaced. The search for matches
+ takes place in the original subject string (that is, previous replace-
+ ments do not affect it). Iteration is implemented by advancing the
+ startoffset value for each search, which is always passed the entire
+ subject string. If an offset limit is set in the match context, search-
+ ing stops when that limit is reached.
+
+ You can restrict the effect of a global substitution to a portion of
+ the subject string by setting either or both of startoffset and an off-
+ set limit. Here is a pcre2test example:
+
+ /B/g,replace=!,use_offset_limit
+ ABC ABC ABC ABC\=offset=3,offset_limit=12
+ 2: ABC A!C A!C ABC
+
+ When continuing with global substitutions after matching a substring
+ with zero length, an attempt to find a non-empty match at the same off-
+ set is performed. If this is not successful, the offset is advanced by
+ one character except when CRLF is a valid newline sequence and the next
+ two characters are CR, LF. In this case, the offset is advanced by two
+ characters.
+
+ PCRE2_SUBSTITUTE_OVERFLOW_LENGTH changes what happens when the output
buffer is too small. The default action is to return PCRE2_ERROR_NOMEM-
- ORY immediately. If this option is set, however, pcre2_substitute()
+ ORY immediately. If this option is set, however, pcre2_substitute()
continues to go through the motions of matching and substituting (with-
- out, of course, writing anything) in order to compute the size of buf-
- fer that is needed. This value is passed back via the outlengthptr
- variable, with the result of the function still being
+ out, of course, writing anything) in order to compute the size of buf-
+ fer that is needed. This value is passed back via the outlengthptr
+ variable, with the result of the function still being
PCRE2_ERROR_NOMEMORY.
- Passing a buffer size of zero is a permitted way of finding out how
- much memory is needed for given substitution. However, this does mean
+ Passing a buffer size of zero is a permitted way of finding out how
+ much memory is needed for given substitution. However, this does mean
that the entire operation is carried out twice. Depending on the appli-
- cation, it may be more efficient to allocate a large buffer and free
- the excess afterwards, instead of using PCRE2_SUBSTITUTE_OVER-
+ cation, it may be more efficient to allocate a large buffer and free
+ the excess afterwards, instead of using PCRE2_SUBSTITUTE_OVER-
FLOW_LENGTH.
- PCRE2_SUBSTITUTE_UNKNOWN_UNSET causes references to capturing groups
- that do not appear in the pattern to be treated as unset groups. This
- option should be used with care, because it means that a typo in a
- group name or number no longer causes the PCRE2_ERROR_NOSUBSTRING
+ PCRE2_SUBSTITUTE_UNKNOWN_UNSET causes references to capturing groups
+ that do not appear in the pattern to be treated as unset groups. This
+ option should be used with care, because it means that a typo in a
+ group name or number no longer causes the PCRE2_ERROR_NOSUBSTRING
error.
- PCRE2_SUBSTITUTE_UNSET_EMPTY causes unset capturing groups (including
+ PCRE2_SUBSTITUTE_UNSET_EMPTY causes unset capturing groups (including
unknown groups when PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set) to be
- treated as empty strings when inserted as described above. If this
- option is not set, an attempt to insert an unset group causes the
- PCRE2_ERROR_UNSET error. This option does not influence the extended
+ treated as empty strings when inserted as described above. If this
+ option is not set, an attempt to insert an unset group causes the
+ PCRE2_ERROR_UNSET error. This option does not influence the extended
substitution syntax described below.
- PCRE2_SUBSTITUTE_EXTENDED causes extra processing to be applied to the
- replacement string. Without this option, only the dollar character is
- special, and only the group insertion forms listed above are valid.
+ PCRE2_SUBSTITUTE_EXTENDED causes extra processing to be applied to the
+ replacement string. Without this option, only the dollar character is
+ special, and only the group insertion forms listed above are valid.
When PCRE2_SUBSTITUTE_EXTENDED is set, two things change:
- Firstly, backslash in a replacement string is interpreted as an escape
+ Firstly, backslash in a replacement string is interpreted as an escape
character. The usual forms such as \n or \x{ddd} can be used to specify
- particular character codes, and backslash followed by any non-alphanu-
- meric character quotes that character. Extended quoting can be coded
+ particular character codes, and backslash followed by any non-alphanu-
+ meric character quotes that character. Extended quoting can be coded
using \Q...\E, exactly as in pattern strings.
- There are also four escape sequences for forcing the case of inserted
- letters. The insertion mechanism has three states: no case forcing,
+ There are also four escape sequences for forcing the case of inserted
+ letters. The insertion mechanism has three states: no case forcing,
force upper case, and force lower case. The escape sequences change the
current state: \U and \L change to upper or lower case forcing, respec-
- tively, and \E (when not terminating a \Q quoted sequence) reverts to
- no case forcing. The sequences \u and \l force the next character (if
- it is a letter) to upper or lower case, respectively, and then the
+ tively, and \E (when not terminating a \Q quoted sequence) reverts to
+ no case forcing. The sequences \u and \l force the next character (if
+ it is a letter) to upper or lower case, respectively, and then the
state automatically reverts to no case forcing. Case forcing applies to
all inserted characters, including those from captured groups and let-
ters within \Q...\E quoted sequences.
Note that case forcing sequences such as \U...\E do not nest. For exam-
- ple, the result of processing "\Uaa\LBB\Ecc\E" is "AAbbcc"; the final
+ ple, the result of processing "\Uaa\LBB\Ecc\E" is "AAbbcc"; the final
\E has no effect.
- The second effect of setting PCRE2_SUBSTITUTE_EXTENDED is to add more
- flexibility to group substitution. The syntax is similar to that used
+ The second effect of setting PCRE2_SUBSTITUTE_EXTENDED is to add more
+ flexibility to group substitution. The syntax is similar to that used
by Bash:
${<n>:-<string>}
${<n>:+<string1>:<string2>}
- As before, <n> may be a group number or a name. The first form speci-
- fies a default value. If group <n> is set, its value is inserted; if
- not, <string> is expanded and the result inserted. The second form
- specifies strings that are expanded and inserted when group <n> is set
- or unset, respectively. The first form is just a convenient shorthand
+ As before, <n> may be a group number or a name. The first form speci-
+ fies a default value. If group <n> is set, its value is inserted; if
+ not, <string> is expanded and the result inserted. The second form
+ specifies strings that are expanded and inserted when group <n> is set
+ or unset, respectively. The first form is just a convenient shorthand
for
${<n>:+${<n>}:<string>}
- Backslash can be used to escape colons and closing curly brackets in
- the replacement strings. A change of the case forcing state within a
- replacement string remains in force afterwards, as shown in this
+ Backslash can be used to escape colons and closing curly brackets in
+ the replacement strings. A change of the case forcing state within a
+ replacement string remains in force afterwards, as shown in this
pcre2test example:
/(some)?(body)/substitute_extended,replace=${1:+\U:\L}HeLLo
@@ -3152,41 +3174,41 @@ CREATING A NEW STRING WITH SUBSTITUTIONS
somebody
1: HELLO
- The PCRE2_SUBSTITUTE_UNSET_EMPTY option does not affect these extended
- substitutions. However, PCRE2_SUBSTITUTE_UNKNOWN_UNSET does cause
+ The PCRE2_SUBSTITUTE_UNSET_EMPTY option does not affect these extended
+ substitutions. However, PCRE2_SUBSTITUTE_UNKNOWN_UNSET does cause
unknown groups in the extended syntax forms to be treated as unset.
- If successful, pcre2_substitute() returns the number of replacements
+ If successful, pcre2_substitute() returns the number of replacements
that were made. This may be zero if no matches were found, and is never
greater than 1 unless PCRE2_SUBSTITUTE_GLOBAL is set.
In the event of an error, a negative error code is returned. Except for
- PCRE2_ERROR_NOMATCH (which is never returned), errors from
+ PCRE2_ERROR_NOMATCH (which is never returned), errors from
pcre2_match() are passed straight back.
PCRE2_ERROR_NOSUBSTRING is returned for a non-existent substring inser-
tion, unless PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set.
PCRE2_ERROR_UNSET is returned for an unset substring insertion (includ-
- ing an unknown substring when PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set)
+ ing an unknown substring when PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set)
when the simple (non-extended) syntax is used and PCRE2_SUBSTI-
TUTE_UNSET_EMPTY is not set.
- PCRE2_ERROR_NOMEMORY is returned if the output buffer is not big
+ PCRE2_ERROR_NOMEMORY is returned if the output buffer is not big
enough. If the PCRE2_SUBSTITUTE_OVERFLOW_LENGTH option is set, the size
- of buffer that is needed is returned via outlengthptr. Note that this
+ of buffer that is needed is returned via outlengthptr. Note that this
does not happen by default.
- PCRE2_ERROR_BADREPLACEMENT is used for miscellaneous syntax errors in
+ PCRE2_ERROR_BADREPLACEMENT is used for miscellaneous syntax errors in
the replacement string, with more particular errors being
- PCRE2_ERROR_BADREPESCAPE (invalid escape sequence), PCRE2_ERROR_REP-
- MISSINGBRACE (closing curly bracket not found), PCRE2_ERROR_BADSUBSTI-
+ PCRE2_ERROR_BADREPESCAPE (invalid escape sequence), PCRE2_ERROR_REP-
+ MISSINGBRACE (closing curly bracket not found), PCRE2_ERROR_BADSUBSTI-
TUTION (syntax error in extended group substitution), and
- PCRE2_ERROR_BADSUBSPATTERN (the pattern match ended before it started,
+ PCRE2_ERROR_BADSUBSPATTERN (the pattern match ended before it started,
which can happen if \K is used in an assertion).
As for all PCRE2 errors, a text message that describes the error can be
- obtained by calling the pcre2_get_error_message() function (see
+ obtained by calling the pcre2_get_error_message() function (see
"Obtaining a textual error message" above).
@@ -3195,56 +3217,56 @@ DUPLICATE SUBPATTERN NAMES
int pcre2_substring_nametable_scan(const pcre2_code *code,
PCRE2_SPTR name, PCRE2_SPTR *first, PCRE2_SPTR *last);
- When a pattern is compiled with the PCRE2_DUPNAMES option, names for
- subpatterns are not required to be unique. Duplicate names are always
- allowed for subpatterns with the same number, created by using the (?|
- feature. Indeed, if such subpatterns are named, they are required to
+ When a pattern is compiled with the PCRE2_DUPNAMES option, names for
+ subpatterns are not required to be unique. Duplicate names are always
+ allowed for subpatterns with the same number, created by using the (?|
+ feature. Indeed, if such subpatterns are named, they are required to
use the same names.
Normally, patterns with duplicate names are such that in any one match,
- only one of the named subpatterns participates. An example is shown in
+ only one of the named subpatterns participates. An example is shown in
the pcre2pattern documentation.
- When duplicates are present, pcre2_substring_copy_byname() and
- pcre2_substring_get_byname() return the first substring corresponding
- to the given name that is set. Only if none are set is
- PCRE2_ERROR_UNSET is returned. The pcre2_substring_number_from_name()
+ When duplicates are present, pcre2_substring_copy_byname() and
+ pcre2_substring_get_byname() return the first substring corresponding
+ to the given name that is set. Only if none are set is
+ PCRE2_ERROR_UNSET is returned. The pcre2_substring_number_from_name()
function returns the error PCRE2_ERROR_NOUNIQUESUBSTRING when there are
duplicate names.
- If you want to get full details of all captured substrings for a given
- name, you must use the pcre2_substring_nametable_scan() function. The
- first argument is the compiled pattern, and the second is the name. If
- the third and fourth arguments are NULL, the function returns a group
+ If you want to get full details of all captured substrings for a given
+ name, you must use the pcre2_substring_nametable_scan() function. The
+ first argument is the compiled pattern, and the second is the name. If
+ the third and fourth arguments are NULL, the function returns a group
number for a unique name, or PCRE2_ERROR_NOUNIQUESUBSTRING otherwise.
When the third and fourth arguments are not NULL, they must be pointers
- to variables that are updated by the function. After it has run, they
+ to variables that are updated by the function. After it has run, they
point to the first and last entries in the name-to-number table for the
- given name, and the function returns the length of each entry in code
- units. In both cases, PCRE2_ERROR_NOSUBSTRING is returned if there are
+ given name, and the function returns the length of each entry in code
+ units. In both cases, PCRE2_ERROR_NOSUBSTRING is returned if there are
no entries for the given name.
The format of the name table is described above in the section entitled
- Information about a pattern. Given all the relevant entries for the
- name, you can extract each of their numbers, and hence the captured
+ Information about a pattern. Given all the relevant entries for the
+ name, you can extract each of their numbers, and hence the captured
data.
FINDING ALL POSSIBLE MATCHES AT ONE POSITION
- The traditional matching function uses a similar algorithm to Perl,
- which stops when it finds the first match at a given point in the sub-
+ The traditional matching function uses a similar algorithm to Perl,
+ which stops when it finds the first match at a given point in the sub-
ject. If you want to find all possible matches, or the longest possible
- match at a given position, consider using the alternative matching
- function (see below) instead. If you cannot use the alternative func-
+ match at a given position, consider using the alternative matching
+ function (see below) instead. If you cannot use the alternative func-
tion, you can kludge it up by making use of the callout facility, which
is described in the pcre2callout documentation.
What you have to do is to insert a callout right at the end of the pat-
- tern. When your callout function is called, extract and save the cur-
- rent matched substring. Then return 1, which forces pcre2_match() to
- backtrack and try other alternatives. Ultimately, when it runs out of
+ tern. When your callout function is called, extract and save the cur-
+ rent matched substring. Then return 1, which forces pcre2_match() to
+ backtrack and try other alternatives. Ultimately, when it runs out of
matches, pcre2_match() will yield PCRE2_ERROR_NOMATCH.
@@ -3256,26 +3278,26 @@ MATCHING A PATTERN: THE ALTERNATIVE FUNCTION
pcre2_match_context *mcontext,
int *workspace, PCRE2_SIZE wscount);
- The function pcre2_dfa_match() is called to match a subject string
- against a compiled pattern, using a matching algorithm that scans the
+ The function pcre2_dfa_match() is called to match a subject string
+ against a compiled pattern, using a matching algorithm that scans the
subject string just once (not counting lookaround assertions), and does
- not backtrack. This has different characteristics to the normal algo-
- rithm, and is not compatible with Perl. Some of the features of PCRE2
- patterns are not supported. Nevertheless, there are times when this
- kind of matching can be useful. For a discussion of the two matching
+ not backtrack. This has different characteristics to the normal algo-
+ rithm, and is not compatible with Perl. Some of the features of PCRE2
+ patterns are not supported. Nevertheless, there are times when this
+ kind of matching can be useful. For a discussion of the two matching
algorithms, and a list of features that pcre2_dfa_match() does not sup-
port, see the pcre2matching documentation.
- The arguments for the pcre2_dfa_match() function are the same as for
+ The arguments for the pcre2_dfa_match() function are the same as for
pcre2_match(), plus two extras. The ovector within the match data block
is used in a different way, and this is described below. The other com-
- mon arguments are used in the same way as for pcre2_match(), so their
+ mon arguments are used in the same way as for pcre2_match(), so their
description is not repeated here.
- The two additional arguments provide workspace for the function. The
- workspace vector should contain at least 20 elements. It is used for
+ The two additional arguments provide workspace for the function. The
+ workspace vector should contain at least 20 elements. It is used for
keeping track of multiple paths through the pattern tree. More
- workspace is needed for patterns and subjects where there are a lot of
+ workspace is needed for patterns and subjects where there are a lot of
potential matches.
Here is an example of a simple call to pcre2_dfa_match():
@@ -3288,52 +3310,52 @@ MATCHING A PATTERN: THE ALTERNATIVE FUNCTION
11, /* the length of the subject string */
0, /* start at offset 0 in the subject */
0, /* default options */
- match_data, /* the match data block */
+ md, /* the match data block */
NULL, /* a match context; NULL means use defaults */
wspace, /* working space vector */
20); /* number of elements (NOT size in bytes) */
Option bits for pcre_dfa_match()
- The unused bits of the options argument for pcre2_dfa_match() must be
- zero. The only bits that may be set are PCRE2_ANCHORED, PCRE2_ENDAN-
- CHORED, PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY,
+ The unused bits of the options argument for pcre2_dfa_match() must be
+ zero. The only bits that may be set are PCRE2_ANCHORED, PCRE2_ENDAN-
+ CHORED, PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY,
PCRE2_NOTEMPTY_ATSTART, PCRE2_NO_UTF_CHECK, PCRE2_PARTIAL_HARD,
- PCRE2_PARTIAL_SOFT, PCRE2_DFA_SHORTEST, and PCRE2_DFA_RESTART. All but
- the last four of these are exactly the same as for pcre2_match(), so
+ PCRE2_PARTIAL_SOFT, PCRE2_DFA_SHORTEST, and PCRE2_DFA_RESTART. All but
+ the last four of these are exactly the same as for pcre2_match(), so
their description is not repeated here.
PCRE2_PARTIAL_HARD
PCRE2_PARTIAL_SOFT
- These have the same general effect as they do for pcre2_match(), but
- the details are slightly different. When PCRE2_PARTIAL_HARD is set for
- pcre2_dfa_match(), it returns PCRE2_ERROR_PARTIAL if the end of the
+ These have the same general effect as they do for pcre2_match(), but
+ the details are slightly different. When PCRE2_PARTIAL_HARD is set for
+ pcre2_dfa_match(), it returns PCRE2_ERROR_PARTIAL if the end of the
subject is reached and there is still at least one matching possibility
that requires additional characters. This happens even if some complete
- matches have already been found. When PCRE2_PARTIAL_SOFT is set, the
- return code PCRE2_ERROR_NOMATCH is converted into PCRE2_ERROR_PARTIAL
- if the end of the subject is reached, there have been no complete
+ matches have already been found. When PCRE2_PARTIAL_SOFT is set, the
+ return code PCRE2_ERROR_NOMATCH is converted into PCRE2_ERROR_PARTIAL
+ if the end of the subject is reached, there have been no complete
matches, but there is still at least one matching possibility. The por-
- tion of the string that was inspected when the longest partial match
+ tion of the string that was inspected when the longest partial match
was found is set as the first matching string in both cases. There is a
- more detailed discussion of partial and multi-segment matching, with
+ more detailed discussion of partial and multi-segment matching, with
examples, in the pcre2partial documentation.
PCRE2_DFA_SHORTEST
- Setting the PCRE2_DFA_SHORTEST option causes the matching algorithm to
+ Setting the PCRE2_DFA_SHORTEST option causes the matching algorithm to
stop as soon as it has found one match. Because of the way the alterna-
- tive algorithm works, this is necessarily the shortest possible match
+ tive algorithm works, this is necessarily the shortest possible match
at the first possible matching point in the subject string.
PCRE2_DFA_RESTART
- When pcre2_dfa_match() returns a partial match, it is possible to call
+ When pcre2_dfa_match() returns a partial match, it is possible to call
it again, with additional subject characters, and have it continue with
the same match. The PCRE2_DFA_RESTART option requests this action; when
- it is set, the workspace and wscount options must reference the same
- vector as before because data about the match so far is left in them
+ it is set, the workspace and wscount options must reference the same
+ vector as before because data about the match so far is left in them
after a partial match. There is more discussion of this facility in the
pcre2partial documentation.
@@ -3341,8 +3363,8 @@ MATCHING A PATTERN: THE ALTERNATIVE FUNCTION
When pcre2_dfa_match() succeeds, it may have matched more than one sub-
string in the subject. Note, however, that all the matches from one run
- of the function start at the same point in the subject. The shorter
- matches are all initial substrings of the longer matches. For example,
+ of the function start at the same point in the subject. The shorter
+ matches are all initial substrings of the longer matches. For example,
if the pattern
<.*>
@@ -3357,17 +3379,17 @@ MATCHING A PATTERN: THE ALTERNATIVE FUNCTION
<something> <something else>
<something>
- On success, the yield of the function is a number greater than zero,
- which is the number of matched substrings. The offsets of the sub-
- strings are returned in the ovector, and can be extracted by number in
- the same way as for pcre2_match(), but the numbers bear no relation to
- any capturing groups that may exist in the pattern, because DFA match-
+ On success, the yield of the function is a number greater than zero,
+ which is the number of matched substrings. The offsets of the sub-
+ strings are returned in the ovector, and can be extracted by number in
+ the same way as for pcre2_match(), but the numbers bear no relation to
+ any capturing groups that may exist in the pattern, because DFA match-
ing does not support group capture.
- Calls to the convenience functions that extract substrings by name
- return the error PCRE2_ERROR_DFA_UFUNC (unsupported function) if used
+ Calls to the convenience functions that extract substrings by name
+ return the error PCRE2_ERROR_DFA_UFUNC (unsupported function) if used
after a DFA match. The convenience functions that extract substrings by
- number never return PCRE2_ERROR_NOSUBSTRING, and the meanings of some
+ number never return PCRE2_ERROR_NOSUBSTRING, and the meanings of some
other errors are slightly different:
PCRE2_ERROR_UNAVAILABLE
@@ -3377,64 +3399,64 @@ MATCHING A PATTERN: THE ALTERNATIVE FUNCTION
PCRE2_ERROR_UNSET
- There is a slot in the ovector for this substring, but there were
+ There is a slot in the ovector for this substring, but there were
insufficient matches to fill it.
- The matched strings are stored in the ovector in reverse order of
- length; that is, the longest matching string is first. If there were
- too many matches to fit into the ovector, the yield of the function is
+ The matched strings are stored in the ovector in reverse order of
+ length; that is, the longest matching string is first. If there were
+ too many matches to fit into the ovector, the yield of the function is
zero, and the vector is filled with the longest matches.
- NOTE: PCRE2's "auto-possessification" optimization usually applies to
- character repeats at the end of a pattern (as well as internally). For
- example, the pattern "a\d+" is compiled as if it were "a\d++". For DFA
- matching, this means that only one possible match is found. If you
- really do want multiple matches in such cases, either use an ungreedy
- repeat such as "a\d+?" or set the PCRE2_NO_AUTO_POSSESS option when
+ NOTE: PCRE2's "auto-possessification" optimization usually applies to
+ character repeats at the end of a pattern (as well as internally). For
+ example, the pattern "a\d+" is compiled as if it were "a\d++". For DFA
+ matching, this means that only one possible match is found. If you
+ really do want multiple matches in such cases, either use an ungreedy
+ repeat such as "a\d+?" or set the PCRE2_NO_AUTO_POSSESS option when
compiling.
Error returns from pcre2_dfa_match()
The pcre2_dfa_match() function returns a negative number when it fails.
- Many of the errors are the same as for pcre2_match(), as described
+ Many of the errors are the same as for pcre2_match(), as described
above. There are in addition the following errors that are specific to
pcre2_dfa_match():
PCRE2_ERROR_DFA_UITEM
- This return is given if pcre2_dfa_match() encounters an item in the
- pattern that it does not support, for instance, the use of \C in a UTF
+ This return is given if pcre2_dfa_match() encounters an item in the
+ pattern that it does not support, for instance, the use of \C in a UTF
mode or a back reference.
PCRE2_ERROR_DFA_UCOND
- This return is given if pcre2_dfa_match() encounters a condition item
- that uses a back reference for the condition, or a test for recursion
+ This return is given if pcre2_dfa_match() encounters a condition item
+ that uses a back reference for the condition, or a test for recursion
in a specific group. These are not supported.
PCRE2_ERROR_DFA_WSSIZE
- This return is given if pcre2_dfa_match() runs out of space in the
+ This return is given if pcre2_dfa_match() runs out of space in the
workspace vector.
PCRE2_ERROR_DFA_RECURSE
- When a recursive subpattern is processed, the matching function calls
+ When a recursive subpattern is processed, the matching function calls
itself recursively, using private memory for the ovector and workspace.
- This error is given if the internal ovector is not large enough. This
+ This error is given if the internal ovector is not large enough. This
should be extremely rare, as a vector of size 1000 is used.
PCRE2_ERROR_DFA_BADRESTART
- When pcre2_dfa_match() is called with the PCRE2_DFA_RESTART option,
- some plausibility checks are made on the contents of the workspace,
- which should contain data about the previous partial match. If any of
+ When pcre2_dfa_match() is called with the PCRE2_DFA_RESTART option,
+ some plausibility checks are made on the contents of the workspace,
+ which should contain data about the previous partial match. If any of
these checks fail, this error is given.
SEE ALSO
- pcre2build(3), pcre2callout(3), pcre2demo(3), pcre2matching(3),
+ pcre2build(3), pcre2callout(3), pcre2demo(3), pcre2matching(3),
pcre2partial(3), pcre2posix(3), pcre2sample(3), pcre2unicode(3).
@@ -3447,7 +3469,7 @@ AUTHOR
REVISION
- Last updated: 13 October 2017
+ Last updated: 16 December 2017
Copyright (c) 1997-2017 University of Cambridge.
------------------------------------------------------------------------------
@@ -4183,16 +4205,18 @@ THE CALLOUT INTERFACE
During matching, when PCRE2 reaches a callout point, if an external
function is provided in the match context, it is called. This applies
- to both normal and DFA matching. The first argument to the callout
- function is a pointer to a pcre2_callout block. The second argument is
- the void * callout data that was supplied when the callout was set up
- by calling pcre2_set_callout() (see the pcre2api documentation). The
- callout block structure contains the following fields:
+ to both normal, DFA, and JIT matching. The first argument to the call-
+ out function is a pointer to a pcre2_callout block. The second argument
+ is the void * callout data that was supplied when the callout was set
+ up by calling pcre2_set_callout() (see the pcre2api documentation). The
+ callout block structure contains the following fields, not necessarily
+ in this order:
uint32_t version;
uint32_t callout_number;
uint32_t capture_top;
uint32_t capture_last;
+ uint32_t callout_flags;
PCRE2_SIZE *offset_vector;
PCRE2_SPTR mark;
PCRE2_SPTR subject;
@@ -4205,133 +4229,161 @@ THE CALLOUT INTERFACE
PCRE2_SIZE callout_string_length;
PCRE2_SPTR callout_string;
- The version field contains the version number of the block format. The
- current version is 1; the three callout string fields were added for
- this version. If you are writing an application that might use an ear-
- lier release of PCRE2, you should check the version number before
- accessing any of these fields. The version number will increase in
- future if more fields are added, but the intention is never to remove
- any of the existing fields.
+ The version field contains the version number of the block format. The
+ current version is 2; the three callout string fields were added for
+ version 1, and the callout_flags field for version 2. If you are writ-
+ ing an application that might use an earlier release of PCRE2, you
+ should check the version number before accessing any of these fields.
+ The version number will increase in future if more fields are added,
+ but the intention is never to remove any of the existing fields.
Fields for numerical callouts
- For a numerical callout, callout_string is NULL, and callout_number
- contains the number of the callout, in the range 0-255. This is the
- number that follows (?C for callouts that part of the pattern; it is
+ For a numerical callout, callout_string is NULL, and callout_number
+ contains the number of the callout, in the range 0-255. This is the
+ number that follows (?C for callouts that part of the pattern; it is
255 for automatically generated callouts.
Fields for string callouts
- For callouts with string arguments, callout_number is always zero, and
- callout_string points to the string that is contained within the com-
+ For callouts with string arguments, callout_number is always zero, and
+ callout_string points to the string that is contained within the com-
piled pattern. Its length is given by callout_string_length. Duplicated
ending delimiters that were present in the original pattern string have
been turned into single characters, but there is no other processing of
- the callout string argument. An additional code unit containing binary
- zero is present after the string, but is not included in the length.
- The delimiter that was used to start the string is also stored within
- the pattern, immediately before the string itself. You can access this
+ the callout string argument. An additional code unit containing binary
+ zero is present after the string, but is not included in the length.
+ The delimiter that was used to start the string is also stored within
+ the pattern, immediately before the string itself. You can access this
delimiter as callout_string[-1] if you need it.
The callout_string_offset field is the code unit offset to the start of
the callout argument string within the original pattern string. This is
- provided for the benefit of applications such as script languages that
+ provided for the benefit of applications such as script languages that
might need to report errors in the callout string within the pattern.
Fields for all callouts
- The remaining fields in the callout block are the same for both kinds
+ The remaining fields in the callout block are the same for both kinds
of callout.
- The offset_vector field is a pointer to a vector of capturing offsets
+ The offset_vector field is a pointer to a vector of capturing offsets
(the "ovector"). You may read the elements in this vector, but you must
not change any of them.
- For calls to pcre2_match(), the offset_vector field is not (since
- release 10.30) a pointer to the actual ovector that was passed to the
- matching function in the match data block. Instead it points to an
- internal ovector of a size large enough to hold all possible captured
+ For calls to pcre2_match(), the offset_vector field is not (since
+ release 10.30) a pointer to the actual ovector that was passed to the
+ matching function in the match data block. Instead it points to an
+ internal ovector of a size large enough to hold all possible captured
substrings in the pattern. Note that whenever a recursion or subroutine
- call within a pattern completes, the capturing state is reset to what
+ call within a pattern completes, the capturing state is reset to what
it was before.
- The capture_last field contains the number of the most recently cap-
- tured substring, and the capture_top field contains one more than the
- number of the highest numbered captured substring so far. If no sub-
- strings have yet been captured, the value of capture_last is 0 and the
- value of capture_top is 1. The values of these fields do not always
- differ by one; for example, when the callout in the pattern
+ The capture_last field contains the number of the most recently cap-
+ tured substring, and the capture_top field contains one more than the
+ number of the highest numbered captured substring so far. If no sub-
+ strings have yet been captured, the value of capture_last is 0 and the
+ value of capture_top is 1. The values of these fields do not always
+ differ by one; for example, when the callout in the pattern
((a)(b))(?C2) is taken, capture_last is 1 but capture_top is 4.
- The contents of ovector[2] to ovector[<capture_top>*2-1] can be
+ The contents of ovector[2] to ovector[<capture_top>*2-1] can be
inspected in order to extract substrings that have been matched so far,
- in the same way as extracting substrings after a match has completed.
- The values in ovector[0] and ovector[1] are always PCRE2_UNSET because
- the match is by definition not complete. Substrings that have not been
- captured but whose numbers are less than capture_top also have both of
+ in the same way as extracting substrings after a match has completed.
+ The values in ovector[0] and ovector[1] are always PCRE2_UNSET because
+ the match is by definition not complete. Substrings that have not been
+ captured but whose numbers are less than capture_top also have both of
their ovector slots set to PCRE2_UNSET.
- For DFA matching, the offset_vector field points to the ovector that
- was passed to the matching function in the match data block, but it
- holds no useful information at callout time because pcre2_dfa_match()
- does not support substring capturing. The value of capture_top is
+ For DFA matching, the offset_vector field points to the ovector that
+ was passed to the matching function in the match data block, but it
+ holds no useful information at callout time because pcre2_dfa_match()
+ does not support substring capturing. The value of capture_top is
always 1 and the value of capture_last is always 0 for DFA matching.
The subject and subject_length fields contain copies of the values that
were passed to the matching function.
- The start_match field normally contains the offset within the subject
- at which the current match attempt started. However, if the escape
- sequence \K has been encountered, this value is changed to reflect the
- modified starting point. If the pattern is not anchored, the callout
+ The start_match field normally contains the offset within the subject
+ at which the current match attempt started. However, if the escape
+ sequence \K has been encountered, this value is changed to reflect the
+ modified starting point. If the pattern is not anchored, the callout
function may be called several times from the same point in the pattern
for different starting points in the subject.
- The current_position field contains the offset within the subject of
+ The current_position field contains the offset within the subject of
the current match pointer.
The pattern_position field contains the offset in the pattern string to
the next item to be matched.
- The next_item_length field contains the length of the next item to be
- processed in the pattern string. When the callout is at the end of the
- pattern, the length is zero. When the callout precedes an opening
+ The next_item_length field contains the length of the next item to be
+ processed in the pattern string. When the callout is at the end of the
+ pattern, the length is zero. When the callout precedes an opening
parenthesis, the length includes meta characters that follow the paren-
- thesis. For example, in a callout before an assertion such as (?=ab)
- the length is 3. For an an alternation bar or a closing parenthesis,
- the length is one, unless a closing parenthesis is followed by a quan-
+ thesis. For example, in a callout before an assertion such as (?=ab)
+ the length is 3. For an an alternation bar or a closing parenthesis,
+ the length is one, unless a closing parenthesis is followed by a quan-
tifier, in which case its length is included. (This changed in release
- 10.23. In earlier releases, before an opening parenthesis the length
- was that of the entire subpattern, and before an alternation bar or a
+ 10.23. In earlier releases, before an opening parenthesis the length
+ was that of the entire subpattern, and before an alternation bar or a
closing parenthesis the length was zero.)
- The pattern_position and next_item_length fields are intended to help
- in distinguishing between different automatic callouts, which all have
- the same callout number. However, they are set for all callouts, and
+ The pattern_position and next_item_length fields are intended to help
+ in distinguishing between different automatic callouts, which all have
+ the same callout number. However, they are set for all callouts, and
are used by pcre2test to show the next item to be matched when display-
ing callout information.
In callouts from pcre2_match() the mark field contains a pointer to the
- zero-terminated name of the most recently passed (*MARK), (*PRUNE), or
- (*THEN) item in the match, or NULL if no such items have been passed.
- Instances of (*PRUNE) or (*THEN) without a name do not obliterate a
+ zero-terminated name of the most recently passed (*MARK), (*PRUNE), or
+ (*THEN) item in the match, or NULL if no such items have been passed.
+ Instances of (*PRUNE) or (*THEN) without a name do not obliterate a
previous (*MARK). In callouts from the DFA matching function this field
always contains NULL.
+ The callout_flags field is always zero in callouts from
+ pcre2_dfa_match() or when JIT is being used. When pcre2_match() without
+ JIT is used, the following bits may be set:
+
+ PCRE2_CALLOUT_STARTMATCH
+
+ This is set for the first callout after the start of matching for each
+ new starting position in the subject.
+
+ PCRE2_CALLOUT_BACKTRACK
+
+ This is set if there has been a matching backtrack since the previous
+ callout, or since the start of matching if this is the first callout
+ from a pcre2_match() run.
+
+ Both bits are set when a backtrack has caused a "bumpalong" to a new
+ starting position in the subject. Output from pcre2test does not indi-
+ cate the presence of these bits unless the callout_extra modifier is
+ set.
+
+ The information in the callout_flags field is provided so that applica-
+ tions can track and tell their users how matching with backtracking is
+ done. This can be useful when trying to optimize patterns, or just to
+ understand how PCRE2 works. There is no support in pcre2_dfa_match()
+ because there is no backtracking in DFA matching, and there is no sup-
+ port in JIT because JIT is all about maximimizing matching performance.
+ In both these cases the callout_flags field is always zero.
+
RETURN VALUES FROM CALLOUTS
The external callout function returns an integer to PCRE2. If the value
- is zero, matching proceeds as normal. If the value is greater than
- zero, matching fails at the current point, but the testing of other
+ is zero, matching proceeds as normal. If the value is greater than
+ zero, matching fails at the current point, but the testing of other
matching possibilities goes ahead, just as if a lookahead assertion had
failed. If the value is less than zero, the match is abandoned, and the
matching function returns the negative value.
- Negative values should normally be chosen from the set of
- PCRE2_ERROR_xxx values. In particular, PCRE2_ERROR_NOMATCH forces a
- standard "no match" failure. The error number PCRE2_ERROR_CALLOUT is
- reserved for use by callout functions; it will never be used by PCRE2
+ Negative values should normally be chosen from the set of
+ PCRE2_ERROR_xxx values. In particular, PCRE2_ERROR_NOMATCH forces a
+ standard "no match" failure. The error number PCRE2_ERROR_CALLOUT is
+ reserved for use by callout functions; it will never be used by PCRE2
itself.
@@ -4342,14 +4394,14 @@ CALLOUT ENUMERATION
void *user_data);
A script language that supports the use of string arguments in callouts
- might like to scan all the callouts in a pattern before running the
+ might like to scan all the callouts in a pattern before running the
match. This can be done by calling pcre2_callout_enumerate(). The first
- argument is a pointer to a compiled pattern, the second points to a
- callback function, and the third is arbitrary user data. The callback
- function is called for every callout in the pattern in the order in
+ argument is a pointer to a compiled pattern, the second points to a
+ callback function, and the third is arbitrary user data. The callback
+ function is called for every callout in the pattern in the order in
which they appear. Its first argument is a pointer to a callout enumer-
- ation block, and its second argument is the user_data value that was
- passed to pcre2_callout_enumerate(). The data block contains the fol-
+ ation block, and its second argument is the user_data value that was
+ passed to pcre2_callout_enumerate(). The data block contains the fol-
lowing fields:
version Block version number
@@ -4360,17 +4412,17 @@ CALLOUT ENUMERATION
callout_string_length Length of callout string
callout_string Points to callout string or is NULL
- The version number is currently 0. It will increase if new fields are
- ever added to the block. The remaining fields are the same as their
- namesakes in the pcre2_callout block that is used for callouts during
+ The version number is currently 0. It will increase if new fields are
+ ever added to the block. The remaining fields are the same as their
+ namesakes in the pcre2_callout block that is used for callouts during
matching, as described above.
- Note that the value of pattern_position is unique for each callout.
- However, if a callout occurs inside a group that is quantified with a
+ Note that the value of pattern_position is unique for each callout.
+ However, if a callout occurs inside a group that is quantified with a
non-zero minimum or a fixed maximum, the group is replicated inside the
- compiled pattern. For example, a pattern such as /(a){2}/ is compiled
- as if it were /(a)(a)/. This means that the callout will be enumerated
- more than once, but with the same value for pattern_position in each
+ compiled pattern. For example, a pattern such as /(a){2}/ is compiled
+ as if it were /(a)(a)/. This means that the callout will be enumerated
+ more than once, but with the same value for pattern_position in each
case.
The callback function should normally return zero. If it returns a non-
@@ -4387,7 +4439,7 @@ AUTHOR
REVISION
- Last updated: 14 April 2017
+ Last updated: 22 December 2017
Copyright (c) 1997-2017 University of Cambridge.
------------------------------------------------------------------------------
diff --git a/doc/pcre2api.3 b/doc/pcre2api.3
index 6925df1..15d817b 100644
--- a/doc/pcre2api.3
+++ b/doc/pcre2api.3
@@ -3185,7 +3185,7 @@ subject string by setting either or both of \fIstartoffset\fP and an offset
limit. Here is a \fPpcre2test\fP example:
.sp
/B/g,replace=!,use_offset_limit
- ABC ABC ABC ABC\=offset=3,offset_limit=12
+ ABC ABC ABC ABC\e=offset=3,offset_limit=12
2: ABC A!C A!C ABC
.sp
When continuing with global substitutions after matching a substring with zero
diff --git a/doc/pcre2callout.3 b/doc/pcre2callout.3
index a0b635a..e3fd600 100644
--- a/doc/pcre2callout.3
+++ b/doc/pcre2callout.3
@@ -1,4 +1,4 @@
-.TH PCRE2CALLOUT 3 "14 April 2017" "PCRE2 10.30"
+.TH PCRE2CALLOUT 3 "22 December 2017" "PCRE2 10.31"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH SYNOPSIS
@@ -191,20 +191,22 @@ callouts such as the example above are obeyed.
.rs
.sp
During matching, when PCRE2 reaches a callout point, if an external function is
-provided in the match context, it is called. This applies to both normal and
-DFA matching. The first argument to the callout function is a pointer to a
-\fBpcre2_callout\fP block. The second argument is the void * callout data that
-was supplied when the callout was set up by calling \fBpcre2_set_callout()\fP
-(see the
+provided in the match context, it is called. This applies to both normal,
+DFA, and JIT matching. The first argument to the callout function is a pointer
+to a \fBpcre2_callout\fP block. The second argument is the void * callout data
+that was supplied when the callout was set up by calling
+\fBpcre2_set_callout()\fP (see the
.\" HREF
\fBpcre2api\fP
.\"
-documentation). The callout block structure contains the following fields:
+documentation). The callout block structure contains the following fields, not
+necessarily in this order:
.sp
uint32_t \fIversion\fP;
uint32_t \fIcallout_number\fP;
uint32_t \fIcapture_top\fP;
uint32_t \fIcapture_last\fP;
+ uint32_t \fIcallout_flags\fP;
PCRE2_SIZE *\fIoffset_vector\fP;
PCRE2_SPTR \fImark\fP;
PCRE2_SPTR \fIsubject\fP;
@@ -218,11 +220,12 @@ documentation). The callout block structure contains the following fields:
PCRE2_SPTR \fIcallout_string\fP;
.sp
The \fIversion\fP field contains the version number of the block format. The
-current version is 1; the three callout string fields were added for this
-version. If you are writing an application that might use an earlier release of
-PCRE2, you should check the version number before accessing any of these
-fields. The version number will increase in future if more fields are added,
-but the intention is never to remove any of the existing fields.
+current version is 2; the three callout string fields were added for version 1,
+and the \fIcallout_flags\fP field for version 2. If you are writing an
+application that might use an earlier release of PCRE2, you should check the
+version number before accessing any of these fields. The version number will
+increase in future if more fields are added, but the intention is never to
+remove any of the existing fields.
.
.
.SS "Fields for numerical callouts"
@@ -331,6 +334,33 @@ the zero-terminated name of the most recently passed (*MARK), (*PRUNE), or
(*THEN) item in the match, or NULL if no such items have been passed. Instances
of (*PRUNE) or (*THEN) without a name do not obliterate a previous (*MARK). In
callouts from the DFA matching function this field always contains NULL.
+.P
+The \fIcallout_flags\fP field is always zero in callouts from
+\fBpcre2_dfa_match()\fP or when JIT is being used. When \fBpcre2_match()\fP
+without JIT is used, the following bits may be set:
+.sp
+ PCRE2_CALLOUT_STARTMATCH
+.sp
+This is set for the first callout after the start of matching for each new
+starting position in the subject.
+.sp
+ PCRE2_CALLOUT_BACKTRACK
+.sp
+This is set if there has been a matching backtrack since the previous callout,
+or since the start of matching if this is the first callout from a
+\fBpcre2_match()\fP run.
+.P
+Both bits are set when a backtrack has caused a "bumpalong" to a new starting
+position in the subject. Output from \fBpcre2test\fP does not indicate the
+presence of these bits unless the \fBcallout_extra\fP modifier is set.
+.P
+The information in the \fBcallout_flags\fP field is provided so that
+applications can track and tell their users how matching with backtracking is
+done. This can be useful when trying to optimize patterns, or just to
+understand how PCRE2 works. There is no support in \fBpcre2_dfa_match()\fP
+because there is no backtracking in DFA matching, and there is no support in
+JIT because JIT is all about maximimizing matching performance. In both these
+cases the \fBcallout_flags\fP field is always zero.
.
.
.SH "RETURN VALUES FROM CALLOUTS"
@@ -411,6 +441,6 @@ Cambridge, England.
.rs
.sp
.nf
-Last updated: 14 April 2017
+Last updated: 22 December 2017
Copyright (c) 1997-2017 University of Cambridge.
.fi
diff --git a/doc/pcre2grep.txt b/doc/pcre2grep.txt
index 5067827..30517b4 100644
--- a/doc/pcre2grep.txt
+++ b/doc/pcre2grep.txt
@@ -103,157 +103,158 @@ DESCRIPTION
SUPPORT FOR COMPRESSED FILES
It is possible to compile pcre2grep so that it uses libz or libbz2 to
- read files whose names end in .gz or .bz2, respectively. You can find
- out whether your binary has support for one or both of these file types
- by running it with the --help option. If the appropriate support is not
- present, files are treated as plain text. The standard input is always
- so treated.
+ read compressed files whose names end in .gz or .bz2, respectively. You
+ can find out whether your pcre2grep binary has support for one or both
+ of these file types by running it with the --help option. If the appro-
+ priate support is not present, all files are treated as plain text. The
+ standard input is always so treated. When input is from a compressed
+ .gz or .bz2 file, the --line-buffered option is ignored.
BINARY FILES
- By default, a file that contains a binary zero byte within the first
- 1024 bytes is identified as a binary file, and is processed specially.
+ By default, a file that contains a binary zero byte within the first
+ 1024 bytes is identified as a binary file, and is processed specially.
(GNU grep identifies binary files in this manner.) However, if the new-
- line type is specified as "nul", that is, the line terminator is a
- binary zero, the test for a binary file is not applied. See the
- --binary-files option for a means of changing the way binary files are
+ line type is specified as "nul", that is, the line terminator is a
+ binary zero, the test for a binary file is not applied. See the
+ --binary-files option for a means of changing the way binary files are
handled.
OPTIONS
- The order in which some of the options appear can affect the output.
- For example, both the -h and -l options affect the printing of file
- names. Whichever comes later in the command line will be the one that
- takes effect. Similarly, except where noted below, if an option is
- given twice, the later setting is used. Numerical values for options
- may be followed by K or M, to signify multiplication by 1024 or
+ The order in which some of the options appear can affect the output.
+ For example, both the -H and -l options affect the printing of file
+ names. Whichever comes later in the command line will be the one that
+ takes effect. Similarly, except where noted below, if an option is
+ given twice, the later setting is used. Numerical values for options
+ may be followed by K or M, to signify multiplication by 1024 or
1024*1024 respectively.
-- This terminates the list of options. It is useful if the next
- item on the command line starts with a hyphen but is not an
- option. This allows for the processing of patterns and file
+ item on the command line starts with a hyphen but is not an
+ option. This allows for the processing of patterns and file
names that start with hyphens.
-A number, --after-context=number
- Output up to number lines of context after each matching
- line. Fewer lines are output if the next match or the end of
- the file is reached, or if the processing buffer size has
- been set too small. If file names and/or line numbers are
- being output, a hyphen separator is used instead of a colon
- for the context lines. A line containing "--" is output
+ Output up to number lines of context after each matching
+ line. Fewer lines are output if the next match or the end of
+ the file is reached, or if the processing buffer size has
+ been set too small. If file names and/or line numbers are
+ being output, a hyphen separator is used instead of a colon
+ for the context lines. A line containing "--" is output
between each group of lines, unless they are in fact contigu-
- ous in the input file. The value of number is expected to be
+ ous in the input file. The value of number is expected to be
relatively small. When -c is used, -A is ignored.
-a, --text
- Treat binary files as text. This is equivalent to --binary-
+ Treat binary files as text. This is equivalent to --binary-
files=text.
-B number, --before-context=number
- Output up to number lines of context before each matching
- line. Fewer lines are output if the previous match or the
- start of the file is within number lines, or if the process-
- ing buffer size has been set too small. If file names and/or
- line numbers are being output, a hyphen separator is used
- instead of a colon for the context lines. A line containing
- "--" is output between each group of lines, unless they are
- in fact contiguous in the input file. The value of number is
- expected to be relatively small. When -c is used, -B is
+ Output up to number lines of context before each matching
+ line. Fewer lines are output if the previous match or the
+ start of the file is within number lines, or if the process-
+ ing buffer size has been set too small. If file names and/or
+ line numbers are being output, a hyphen separator is used
+ instead of a colon for the context lines. A line containing
+ "--" is output between each group of lines, unless they are
+ in fact contiguous in the input file. The value of number is
+ expected to be relatively small. When -c is used, -B is
ignored.
--binary-files=word
- Specify how binary files are to be processed. If the word is
- "binary" (the default), pattern matching is performed on
- binary files, but the only output is "Binary file <name>
- matches" when a match succeeds. If the word is "text", which
- is equivalent to the -a or --text option, binary files are
- processed in the same way as any other file. In this case,
- when a match succeeds, the output may be binary garbage,
- which can have nasty effects if sent to a terminal. If the
- word is "without-match", which is equivalent to the -I
- option, binary files are not processed at all; they are
+ Specify how binary files are to be processed. If the word is
+ "binary" (the default), pattern matching is performed on
+ binary files, but the only output is "Binary file <name>
+ matches" when a match succeeds. If the word is "text", which
+ is equivalent to the -a or --text option, binary files are
+ processed in the same way as any other file. In this case,
+ when a match succeeds, the output may be binary garbage,
+ which can have nasty effects if sent to a terminal. If the
+ word is "without-match", which is equivalent to the -I
+ option, binary files are not processed at all; they are
assumed not to be of interest and are skipped without causing
any output or affecting the return code.
--buffer-size=number
- Set the parameter that controls how much memory is obtained
+ Set the parameter that controls how much memory is obtained
at the start of processing for buffering files that are being
scanned. See also --max-buffer-size below.
-C number, --context=number
- Output number lines of context both before and after each
- matching line. This is equivalent to setting both -A and -B
+ Output number lines of context both before and after each
+ matching line. This is equivalent to setting both -A and -B
to the same value.
-c, --count
- Do not output lines from the files that are being scanned;
- instead output the number of lines that would have been
+ Do not output lines from the files that are being scanned;
+ instead output the number of lines that would have been
shown, either because they matched, or, if -v is set, because
- they failed to match. By default, this count is exactly the
- same as the number of lines that would have been output, but
- if the -M (multiline) option is used (without -v), there may
- be more suppressed lines than the count (that is, the number
+ they failed to match. By default, this count is exactly the
+ same as the number of lines that would have been output, but
+ if the -M (multiline) option is used (without -v), there may
+ be more suppressed lines than the count (that is, the number
of matches).
- If no lines are selected, the number zero is output. If sev-
- eral files are are being scanned, a count is output for each
- of them and the -t option can be used to cause a total to be
- output at the end. However, if the --files-with-matches
- option is also used, only those files whose counts are
- greater than zero are listed. When -c is used, the -A, -B,
+ If no lines are selected, the number zero is output. If sev-
+ eral files are are being scanned, a count is output for each
+ of them and the -t option can be used to cause a total to be
+ output at the end. However, if the --files-with-matches
+ option is also used, only those files whose counts are
+ greater than zero are listed. When -c is used, the -A, -B,
and -C options are ignored.
--colour, --color
If this option is given without any data, it is equivalent to
- "--colour=auto". If data is required, it must be given in
+ "--colour=auto". If data is required, it must be given in
the same shell item, separated by an equals sign.
--colour=value, --color=value
This option specifies under what circumstances the parts of a
line that matched a pattern should be coloured in the output.
- By default, the output is not coloured. The value (which is
- optional, see above) may be "never", "always", or "auto". In
- the latter case, colouring happens only if the standard out-
- put is connected to a terminal. More resources are used when
+ By default, the output is not coloured. The value (which is
+ optional, see above) may be "never", "always", or "auto". In
+ the latter case, colouring happens only if the standard out-
+ put is connected to a terminal. More resources are used when
colouring is enabled, because pcre2grep has to search for all
- possible matches in a line, not just one, in order to colour
+ possible matches in a line, not just one, in order to colour
them all.
- The colour that is used can be specified by setting one of
- the environment variables PCRE2GREP_COLOUR, PCRE2GREP_COLOR,
+ The colour that is used can be specified by setting one of
+ the environment variables PCRE2GREP_COLOUR, PCRE2GREP_COLOR,
PCREGREP_COLOUR, or PCREGREP_COLOR, which are checked in that
order. If none of these are set, pcre2grep looks for
- GREP_COLORS or GREP_COLOR (in that order). The value of the
- variable should be a string of two numbers, separated by a
- semicolon, except in the case of GREP_COLORS, which must
+ GREP_COLORS or GREP_COLOR (in that order). The value of the
+ variable should be a string of two numbers, separated by a
+ semicolon, except in the case of GREP_COLORS, which must
start with "ms=" or "mt=" followed by two semicolon-separated
- colours, terminated by the end of the string or by a colon.
- If GREP_COLORS does not start with "ms=" or "mt=" it is
+ colours, terminated by the end of the string or by a colon.
+ If GREP_COLORS does not start with "ms=" or "mt=" it is
ignored, and GREP_COLOR is checked.
- If the string obtained from one of the above variables con-
+ If the string obtained from one of the above variables con-
tains any characters other than semicolon or digits, the set-
ting is ignored and the default colour is used. The string is
copied directly into the control string for setting colour on
- a terminal, so it is your responsibility to ensure that the
- values make sense. If no relevant environment variable is
+ a terminal, so it is your responsibility to ensure that the
+ values make sense. If no relevant environment variable is
set, the default is "1;31", which gives red.
-D action, --devices=action
- If an input path is not a regular file or a directory,
- "action" specifies how it is to be processed. Valid values
+ If an input path is not a regular file or a directory,
+ "action" specifies how it is to be processed. Valid values
are "read" (the default) or "skip" (silently skip the path).
-d action, --directories=action
If an input path is a directory, "action" specifies how it is
- to be processed. Valid values are "read" (the default in
- non-Windows environments, for compatibility with GNU grep),
- "recurse" (equivalent to the -r option), or "skip" (silently
- skip the path, the default in Windows environments). In the
- "read" case, directories are read as if they were ordinary
- files. In some operating systems the effect of reading a
+ to be processed. Valid values are "read" (the default in
+ non-Windows environments, for compatibility with GNU grep),
+ "recurse" (equivalent to the -r option), or "skip" (silently
+ skip the path, the default in Windows environments). In the
+ "read" case, directories are read as if they were ordinary
+ files. In some operating systems the effect of reading a
directory like this is an immediate end-of-file; in others it
may provoke an error.
@@ -263,180 +264,183 @@ OPTIONS
-e pattern, --regex=pattern, --regexp=pattern
Specify a pattern to be matched. This option can be used mul-
tiple times in order to specify several patterns. It can also
- be used as a way of specifying a single pattern that starts
- with a hyphen. When -e is used, no argument pattern is taken
- from the command line; all arguments are treated as file
- names. There is no limit to the number of patterns. They are
- applied to each line in the order in which they are defined
+ be used as a way of specifying a single pattern that starts
+ with a hyphen. When -e is used, no argument pattern is taken
+ from the command line; all arguments are treated as file
+ names. There is no limit to the number of patterns. They are
+ applied to each line in the order in which they are defined
until one matches.
- If -f is used with -e, the command line patterns are matched
+ If -f is used with -e, the command line patterns are matched
first, followed by the patterns from the file(s), independent
- of the order in which these options are specified. Note that
- multiple use of -e is not the same as a single pattern with
+ of the order in which these options are specified. Note that
+ multiple use of -e is not the same as a single pattern with
alternatives. For example, X|Y finds the first character in a
- line that is X or Y, whereas if the two patterns are given
+ line that is X or Y, whereas if the two patterns are given
separately, with X first, pcre2grep finds X if it is present,
even if it follows Y in the line. It finds Y only if there is
- no X in the line. This matters only if you are using -o or
+ no X in the line. This matters only if you are using -o or
--colo(u)r to show the part(s) of the line that matched.
--exclude=pattern
Files (but not directories) whose names match the pattern are
- skipped without being processed. This applies to all files,
- whether listed on the command line, obtained from --file-
+ skipped without being processed. This applies to all files,
+ whether listed on the command line, obtained from --file-
list, or by scanning a directory. The pattern is a PCRE2 reg-
- ular expression, and is matched against the final component
- of the file name, not the entire path. The -F, -w, and -x
+ ular expression, and is matched against the final component
+ of the file name, not the entire path. The -F, -w, and -x
options do not apply to this pattern. The option may be given
any number of times in order to specify multiple patterns. If
- a file name matches both an --include and an --exclude pat-
+ a file name matches both an --include and an --exclude pat-
tern, it is excluded. There is no short form for this option.
--exclude-from=filename
- Treat each non-empty line of the file as the data for an
+ Treat each non-empty line of the file as the data for an
--exclude option. What constitutes a newline when reading the
- file is the operating system's default. The --newline option
- has no effect on this option. This option may be given more
+ file is the operating system's default. The --newline option
+ has no effect on this option. This option may be given more
than once in order to specify a number of files to read.
--exclude-dir=pattern
Directories whose names match the pattern are skipped without
- being processed, whatever the setting of the --recursive
- option. This applies to all directories, whether listed on
+ being processed, whatever the setting of the --recursive
+ option. This applies to all directories, whether listed on
the command line, obtained from --file-list, or by scanning a
- parent directory. The pattern is a PCRE2 regular expression,
- and is matched against the final component of the directory
- name, not the entire path. The -F, -w, and -x options do not
- apply to this pattern. The option may be given any number of
- times in order to specify more than one pattern. If a direc-
- tory matches both --include-dir and --exclude-dir, it is
+ parent directory. The pattern is a PCRE2 regular expression,
+ and is matched against the final component of the directory
+ name, not the entire path. The -F, -w, and -x options do not
+ apply to this pattern. The option may be given any number of
+ times in order to specify more than one pattern. If a direc-
+ tory matches both --include-dir and --exclude-dir, it is
excluded. There is no short form for this option.
-F, --fixed-strings
- Interpret each data-matching pattern as a list of fixed
- strings, separated by newlines, instead of as a regular
- expression. What constitutes a newline for this purpose is
- controlled by the --newline option. The -w (match as a word)
- and -x (match whole line) options can be used with -F. They
+ Interpret each data-matching pattern as a list of fixed
+ strings, separated by newlines, instead of as a regular
+ expression. What constitutes a newline for this purpose is
+ controlled by the --newline option. The -w (match as a word)
+ and -x (match whole line) options can be used with -F. They
apply to each of the fixed strings. A line is selected if any
of the fixed strings are found in it (subject to -w or -x, if
- present). This option applies only to the patterns that are
- matched against the contents of files; it does not apply to
- patterns specified by any of the --include or --exclude
+ present). This option applies only to the patterns that are
+ matched against the contents of files; it does not apply to
+ patterns specified by any of the --include or --exclude
options.
-f filename, --file=filename
- Read patterns from the file, one per line, and match them
- against each line of input. What constitutes a newline when
- reading the file is the operating system's default. The
- --newline option has no effect on this option. Trailing
- white space is removed from each line, and blank lines are
- ignored. An empty file contains no patterns and therefore
- matches nothing. See also the comments about multiple pat-
- terns versus a single pattern with alternatives in the
+ Read patterns from the file, one per line, and match them
+ against each line of input. What constitutes a newline when
+ reading the file is the operating system's default. The
+ --newline option has no effect on this option. Trailing
+ white space is removed from each line, and blank lines are
+ ignored. An empty file contains no patterns and therefore
+ matches nothing. See also the comments about multiple pat-
+ terns versus a single pattern with alternatives in the
description of -e above.
- If this option is given more than once, all the specified
- files are read. A data line is output if any of the patterns
- match it. A file name can be given as "-" to refer to the
- standard input. When -f is used, patterns specified on the
- command line using -e may also be present; they are tested
- before the file's patterns. However, no other pattern is
+ If this option is given more than once, all the specified
+ files are read. A data line is output if any of the patterns
+ match it. A file name can be given as "-" to refer to the
+ standard input. When -f is used, patterns specified on the
+ command line using -e may also be present; they are tested
+ before the file's patterns. However, no other pattern is
taken from the command line; all arguments are treated as the
names of paths to be searched.
--file-list=filename
- Read a list of files and/or directories that are to be
- scanned from the given file, one per line. Trailing white
+ Read a list of files and/or directories that are to be
+ scanned from the given file, one per line. Trailing white
space is removed from each line, and blank lines are ignored.
- These paths are processed before any that are listed on the
- command line. The file name can be given as "-" to refer to
+ These paths are processed before any that are listed on the
+ command line. The file name can be given as "-" to refer to
the standard input. If --file and --file-list are both spec-
- ified as "-", patterns are read first. This is useful only
- when the standard input is a terminal, from which further
- lines (the list of files) can be read after an end-of-file
- indication. If this option is given more than once, all the
+ ified as "-", patterns are read first. This is useful only
+ when the standard input is a terminal, from which further
+ lines (the list of files) can be read after an end-of-file
+ indication. If this option is given more than once, all the
specified files are read.
--file-offsets
- Instead of showing lines or parts of lines that match, show
- each match as an offset from the start of the file and a
- length, separated by a comma. In this mode, no context is
- shown. That is, the -A, -B, and -C options are ignored. If
+ Instead of showing lines or parts of lines that match, show
+ each match as an offset from the start of the file and a
+ length, separated by a comma. In this mode, no context is
+ shown. That is, the -A, -B, and -C options are ignored. If
there is more than one match in a line, each of them is shown
- separately. This option is mutually exclusive with --output,
+ separately. This option is mutually exclusive with --output,
--line-offsets, and --only-matching.
-H, --with-filename
- Force the inclusion of the file name at the start of output
+ Force the inclusion of the file name at the start of output
lines when searching a single file. By default, the file name
is not shown in this case. For matching lines, the file name
is followed by a colon; for context lines, a hyphen separator
- is used. If a line number is also being output, it follows
- the file name. When the -M option causes a pattern to match
- more than one line, only the first is preceded by the file
- name.
+ is used. If a line number is also being output, it follows
+ the file name. When the -M option causes a pattern to match
+ more than one line, only the first is preceded by the file
+ name. This option overrides any previous -h, -l, or -L
+ options.
-h, --no-filename
Suppress the output file names when searching multiple files.
By default, file names are shown when multiple files are
searched. For matching lines, the file name is followed by a
colon; for context lines, a hyphen separator is used. If a
- line number is also being output, it follows the file name.
+ line number is also being output, it follows the file name.
+ This option overrides any previous -H, -L, or -l options.
--heap-limit=number
See --match-limit below.
- --help Output a help message, giving brief details of the command
- options and file type support, and then exit. Anything else
+ --help Output a help message, giving brief details of the command
+ options and file type support, and then exit. Anything else
on the command line is ignored.
- -I Ignore binary files. This is equivalent to --binary-
+ -I Ignore binary files. This is equivalent to --binary-
files=without-match.
-i, --ignore-case
Ignore upper/lower case distinctions during comparisons.
--include=pattern
- If any --include patterns are specified, the only files that
- are processed are those that match one of the patterns (and
- do not match an --exclude pattern). This option does not
- affect directories, but it applies to all files, whether
- listed on the command line, obtained from --file-list, or by
- scanning a directory. The pattern is a PCRE2 regular expres-
- sion, and is matched against the final component of the file
- name, not the entire path. The -F, -w, and -x options do not
- apply to this pattern. The option may be given any number of
- times. If a file name matches both an --include and an
- --exclude pattern, it is excluded. There is no short form
+ If any --include patterns are specified, the only files that
+ are processed are those that match one of the patterns (and
+ do not match an --exclude pattern). This option does not
+ affect directories, but it applies to all files, whether
+ listed on the command line, obtained from --file-list, or by
+ scanning a directory. The pattern is a PCRE2 regular expres-
+ sion, and is matched against the final component of the file
+ name, not the entire path. The -F, -w, and -x options do not
+ apply to this pattern. The option may be given any number of
+ times. If a file name matches both an --include and an
+ --exclude pattern, it is excluded. There is no short form
for this option.
--include-from=filename
- Treat each non-empty line of the file as the data for an
+ Treat each non-empty line of the file as the data for an
--include option. What constitutes a newline for this purpose
- is the operating system's default. The --newline option has
+ is the operating system's default. The --newline option has
no effect on this option. This option may be given any number
of times; all the files are read.
--include-dir=pattern
- If any --include-dir patterns are specified, the only direc-
- tories that are processed are those that match one of the
- patterns (and do not match an --exclude-dir pattern). This
- applies to all directories, whether listed on the command
- line, obtained from --file-list, or by scanning a parent
- directory. The pattern is a PCRE2 regular expression, and is
- matched against the final component of the directory name,
- not the entire path. The -F, -w, and -x options do not apply
+ If any --include-dir patterns are specified, the only direc-
+ tories that are processed are those that match one of the
+ patterns (and do not match an --exclude-dir pattern). This
+ applies to all directories, whether listed on the command
+ line, obtained from --file-list, or by scanning a parent
+ directory. The pattern is a PCRE2 regular expression, and is
+ matched against the final component of the directory name,
+ not the entire path. The -F, -w, and -x options do not apply
to this pattern. The option may be given any number of times.
- If a directory matches both --include-dir and --exclude-dir,
+ If a directory matches both --include-dir and --exclude-dir,
it is excluded. There is no short form for this option.
-L, --files-without-match
- Instead of outputting lines from the files, just output the
- names of the files that do not contain any lines that would
- have been output. Each file name is output once, on a sepa-
- rate line.
+ Instead of outputting lines from the files, just output the
+ names of the files that do not contain any lines that would
+ have been output. Each file name is output once, on a sepa-
+ rate line. This option overrides any previous -H, -h, or -l
+ options.
-l, --files-with-matches
Instead of outputting lines from the files, just output the
@@ -447,7 +451,8 @@ OPTIONS
matching continues in order to obtain the correct count, and
those files that have at least one match are listed along
with their counts. Using this option with -c is a way of sup-
- pressing the listing of files with no matches.
+ pressing the listing of files with no matches. This opeion
+ overrides any previous -H, -h, or -L options.
--label=name
This option supplies a name to be used for the standard input
@@ -455,293 +460,295 @@ OPTIONS
input)" is used. There is no short form for this option.
--line-buffered
- When this option is given, input is read and processed line
- by line, and the output is flushed after each write. By
- default, input is read in large chunks, unless pcre2grep can
- determine that it is reading from a terminal (which is cur-
- rently possible only in Unix-like environments). Output to
- terminal is normally automatically flushed by the operating
- system. This option can be useful when the input or output is
- attached to a pipe and you do not want pcre2grep to buffer up
- large amounts of data. However, its use will affect perfor-
- mance, and the -M (multiline) option ceases to work.
+ When this option is given, non-compressed input is read and
+ processed line by line, and the output is flushed after each
+ write. By default, input is read in large chunks, unless
+ pcre2grep can determine that it is reading from a terminal
+ (which is currently possible only in Unix-like environments).
+ Output to terminal is normally automatically flushed by the
+ operating system. This option can be useful when the input or
+ output is attached to a pipe and you do not want pcre2grep to
+ buffer up large amounts of data. However, its use will affect
+ performance, and the -M (multiline) option ceases to work.
+ When input is from a compressed .gz or .bz2 file, --line-
+ buffered is ignored.
--line-offsets
- Instead of showing lines or parts of lines that match, show
+ Instead of showing lines or parts of lines that match, show
each match as a line number, the offset from the start of the
- line, and a length. The line number is terminated by a colon
- (as usual; see the -n option), and the offset and length are
- separated by a comma. In this mode, no context is shown.
- That is, the -A, -B, and -C options are ignored. If there is
- more than one match in a line, each of them is shown sepa-
- rately. This option is mutually exclusive with --output,
+ line, and a length. The line number is terminated by a colon
+ (as usual; see the -n option), and the offset and length are
+ separated by a comma. In this mode, no context is shown.
+ That is, the -A, -B, and -C options are ignored. If there is
+ more than one match in a line, each of them is shown sepa-
+ rately. This option is mutually exclusive with --output,
--file-offsets, and --only-matching.
--locale=locale-name
- This option specifies a locale to be used for pattern match-
- ing. It overrides the value in the LC_ALL or LC_CTYPE envi-
- ronment variables. If no locale is specified, the PCRE2
- library's default (usually the "C" locale) is used. There is
+ This option specifies a locale to be used for pattern match-
+ ing. It overrides the value in the LC_ALL or LC_CTYPE envi-
+ ronment variables. If no locale is specified, the PCRE2
+ library's default (usually the "C" locale) is used. There is
no short form for this option.
--match-limit=number
- Processing some regular expression patterns may take a very
+ Processing some regular expression patterns may take a very
long time to search for all possible matching strings. Others
- may require a very large amount of memory. There are three
+ may require a very large amount of memory. There are three
options that set resource limits for matching.
The --match-limit option provides a means of limiting comput-
- ing resource usage when processing patterns that are not
- going to match, but which have a very large number of possi-
+ ing resource usage when processing patterns that are not
+ going to match, but which have a very large number of possi-
bilities in their search trees. The classic example is a pat-
- tern that uses nested unlimited repeats. Internally, PCRE2
- has a counter that is incremented each time around its main
+ tern that uses nested unlimited repeats. Internally, PCRE2
+ has a counter that is incremented each time around its main
processing loop. If the value set by --match-limit is
reached, an error occurs.
- The --heap-limit option specifies, as a number of kilobytes,
+ The --heap-limit option specifies, as a number of kilobytes,
the amount of heap memory that may be used for matching. Heap
memory is needed only if matching the pattern requires a sig-
- nificant number of nested backtracking points to be remem-
+ nificant number of nested backtracking points to be remem-
bered. This parameter can be set to zero to forbid the use of
heap memory altogether.
- The --depth-limit option limits the depth of nested back-
+ The --depth-limit option limits the depth of nested back-
tracking points, which indirectly limits the amount of memory
that is used. The amount of memory needed for each backtrack-
- ing point depends on the number of capturing parentheses in
+ ing point depends on the number of capturing parentheses in
the pattern, so the amount of memory that is used before this
- limit acts varies from pattern to pattern. This limit is of
+ limit acts varies from pattern to pattern. This limit is of
use only if it is set smaller than --match-limit.
- There are no short forms for these options. The default set-
- tings are specified when the PCRE2 library is compiled, with
- the default defaults being very large and so effectively
+ There are no short forms for these options. The default set-
+ tings are specified when the PCRE2 library is compiled, with
+ the default defaults being very large and so effectively
unlimited.
--max-buffer-size=number
- This limits the expansion of the processing buffer, whose
- initial size can be set by --buffer-size. The maximum buffer
- size is silently forced to be no smaller than the starting
+ This limits the expansion of the processing buffer, whose
+ initial size can be set by --buffer-size. The maximum buffer
+ size is silently forced to be no smaller than the starting
buffer size.
-M, --multiline
- Allow patterns to match more than one line. When this option
+ Allow patterns to match more than one line. When this option
is set, the PCRE2 library is called in "multiline" mode. This
- allows a matched string to extend past the end of a line and
- continue on one or more subsequent lines. Patterns used with
+ allows a matched string to extend past the end of a line and
+ continue on one or more subsequent lines. Patterns used with
-M may usefully contain literal newline characters and inter-
- nal occurrences of ^ and $ characters. The output for a suc-
- cessful match may consist of more than one line. The first
- line is the line in which the match started, and the last
- line is the line in which the match ended. If the matched
- string ends with a newline sequence, the output ends at the
- end of that line. If -v is set, none of the lines in a
- multi-line match are output. Once a match has been handled,
- scanning restarts at the beginning of the line after the one
+ nal occurrences of ^ and $ characters. The output for a suc-
+ cessful match may consist of more than one line. The first
+ line is the line in which the match started, and the last
+ line is the line in which the match ended. If the matched
+ string ends with a newline sequence, the output ends at the
+ end of that line. If -v is set, none of the lines in a
+ multi-line match are output. Once a match has been handled,
+ scanning restarts at the beginning of the line after the one
in which the match ended.
- The newline sequence that separates multiple lines must be
- matched as part of the pattern. For example, to find the
- phrase "regular expression" in a file where "regular" might
- be at the end of a line and "expression" at the start of the
+ The newline sequence that separates multiple lines must be
+ matched as part of the pattern. For example, to find the
+ phrase "regular expression" in a file where "regular" might
+ be at the end of a line and "expression" at the start of the
next line, you could use this command:
pcre2grep -M 'regular\s+expression' <file>
- The \s escape sequence matches any white space character,
- including newlines, and is followed by + so as to match
- trailing white space on the first line as well as possibly
+ The \s escape sequence matches any white space character,
+ including newlines, and is followed by + so as to match
+ trailing white space on the first line as well as possibly
handling a two-character newline sequence.
- There is a limit to the number of lines that can be matched,
- imposed by the way that pcre2grep buffers the input file as
- it scans it. With a sufficiently large processing buffer,
+ There is a limit to the number of lines that can be matched,
+ imposed by the way that pcre2grep buffers the input file as
+ it scans it. With a sufficiently large processing buffer,
this should not be a problem, but the -M option does not work
when input is read line by line (see --line-buffered.)
-N newline-type, --newline=newline-type
- The PCRE2 library supports five different conventions for
- indicating the ends of lines. They are the single-character
- sequences CR (carriage return) and LF (linefeed), the two-
- character sequence CRLF, an "anycrlf" convention, which rec-
- ognizes any of the preceding three types, and an "any" con-
+ The PCRE2 library supports five different conventions for
+ indicating the ends of lines. They are the single-character
+ sequences CR (carriage return) and LF (linefeed), the two-
+ character sequence CRLF, an "anycrlf" convention, which rec-
+ ognizes any of the preceding three types, and an "any" con-
vention, in which any Unicode line ending sequence is assumed
- to end a line. The Unicode sequences are the three just men-
- tioned, plus VT (vertical tab, U+000B), FF (form feed,
- U+000C), NEL (next line, U+0085), LS (line separator,
+ to end a line. The Unicode sequences are the three just men-
+ tioned, plus VT (vertical tab, U+000B), FF (form feed,
+ U+000C), NEL (next line, U+0085), LS (line separator,
U+2028), and PS (paragraph separator, U+2029).
- When the PCRE2 library is built, a default line-ending
- sequence is specified. This is normally the standard
+ When the PCRE2 library is built, a default line-ending
+ sequence is specified. This is normally the standard
sequence for the operating system. Unless otherwise specified
- by this option, pcre2grep uses the library's default. The
+ by this option, pcre2grep uses the library's default. The
possible values for this option are CR, LF, CRLF, ANYCRLF, or
- ANY. This makes it possible to use pcre2grep to scan files
+ ANY. This makes it possible to use pcre2grep to scan files
that have come from other environments without having to mod-
- ify their line endings. If the data that is being scanned
- does not agree with the convention set by this option,
- pcre2grep may behave in strange ways. Note that this option
- does not apply to files specified by the -f, --exclude-from,
- or --include-from options, which are expected to use the
+ ify their line endings. If the data that is being scanned
+ does not agree with the convention set by this option,
+ pcre2grep may behave in strange ways. Note that this option
+ does not apply to files specified by the -f, --exclude-from,
+ or --include-from options, which are expected to use the
operating system's standard newline sequence.
-n, --line-number
Precede each output line by its line number in the file, fol-
- lowed by a colon for matching lines or a hyphen for context
+ lowed by a colon for matching lines or a hyphen for context
lines. If the file name is also being output, it precedes the
- line number. When the -M option causes a pattern to match
- more than one line, only the first is preceded by its line
+ line number. When the -M option causes a pattern to match
+ more than one line, only the first is preceded by its line
number. This option is forced if --line-offsets is used.
- --no-jit If the PCRE2 library is built with support for just-in-time
+ --no-jit If the PCRE2 library is built with support for just-in-time
compiling (which speeds up matching), pcre2grep automatically
makes use of this, unless it was explicitly disabled at build
- time. This option can be used to disable the use of JIT at
- run time. It is provided for testing and working round prob-
+ time. This option can be used to disable the use of JIT at
+ run time. It is provided for testing and working round prob-
lems. It should never be needed in normal use.
-O text, --output=text
- When there is a match, instead of outputting the whole line
- that matched, output just the given text. This option is
- mutually exclusive with --only-matching, --file-offsets, and
+ When there is a match, instead of outputting the whole line
+ that matched, output just the given text. This option is
+ mutually exclusive with --only-matching, --file-offsets, and
--line-offsets. Escape sequences starting with a dollar char-
- acter may be used to insert the contents of the matched part
+ acter may be used to insert the contents of the matched part
of the line and/or captured substrings into the text.
- $<digits> or ${<digits>} is replaced by the captured sub-
- string of the given decimal number; zero substitutes the
+ $<digits> or ${<digits>} is replaced by the captured sub-
+ string of the given decimal number; zero substitutes the
whole match. If the number is greater than the number of cap-
- turing substrings, or if the capture is unset, the replace-
+ turing substrings, or if the capture is unset, the replace-
ment is empty.
- $a is replaced by bell; $b by backspace; $e by escape; $f by
- form feed; $n by newline; $r by carriage return; $t by tab;
+ $a is replaced by bell; $b by backspace; $e by escape; $f by
+ form feed; $n by newline; $r by carriage return; $t by tab;
$v by vertical tab.
- $o<digits> is replaced by the character represented by the
+ $o<digits> is replaced by the character represented by the
given octal number; up to three digits are processed.
- $x<digits> is replaced by the character represented by the
+ $x<digits> is replaced by the character represented by the
given hexadecimal number; up to two digits are processed.
- Any other character is substituted by itself. In particular,
+ Any other character is substituted by itself. In particular,
$$ is replaced by a single dollar.
-o, --only-matching
Show only the part of the line that matched a pattern instead
- of the whole line. In this mode, no context is shown. That
- is, the -A, -B, and -C options are ignored. If there is more
- than one match in a line, each of them is shown separately,
- on a separate line of output. If -o is combined with -v
- (invert the sense of the match to find non-matching lines),
- no output is generated, but the return code is set appropri-
- ately. If the matched portion of the line is empty, nothing
- is output unless the file name or line number are being
- printed, in which case they are shown on an otherwise empty
+ of the whole line. In this mode, no context is shown. That
+ is, the -A, -B, and -C options are ignored. If there is more
+ than one match in a line, each of them is shown separately,
+ on a separate line of output. If -o is combined with -v
+ (invert the sense of the match to find non-matching lines),
+ no output is generated, but the return code is set appropri-
+ ately. If the matched portion of the line is empty, nothing
+ is output unless the file name or line number are being
+ printed, in which case they are shown on an otherwise empty
line. This option is mutually exclusive with --output,
--file-offsets and --line-offsets.
-onumber, --only-matching=number
- Show only the part of the line that matched the capturing
+ Show only the part of the line that matched the capturing
parentheses of the given number. Up to 32 capturing parenthe-
ses are supported, and -o0 is equivalent to -o without a num-
- ber. Because these options can be given without an argument
- (see above), if an argument is present, it must be given in
- the same shell item, for example, -o3 or --only-matching=2.
+ ber. Because these options can be given without an argument
+ (see above), if an argument is present, it must be given in
+ the same shell item, for example, -o3 or --only-matching=2.
The comments given for the non-argument case above also apply
to this option. If the specified capturing parentheses do not
- exist in the pattern, or were not set in the match, nothing
- is output unless the file name or line number are being out-
+ exist in the pattern, or were not set in the match, nothing
+ is output unless the file name or line number are being out-
put.
- If this option is given multiple times, multiple substrings
- are output for each match, in the order the options are
- given, and all on one line. For example, -o3 -o1 -o3 causes
- the substrings matched by capturing parentheses 3 and 1 and
- then 3 again to be output. By default, there is no separator
+ If this option is given multiple times, multiple substrings
+ are output for each match, in the order the options are
+ given, and all on one line. For example, -o3 -o1 -o3 causes
+ the substrings matched by capturing parentheses 3 and 1 and
+ then 3 again to be output. By default, there is no separator
(but see the next option).
--om-separator=text
- Specify a separating string for multiple occurrences of -o.
- The default is an empty string. Separating strings are never
+ Specify a separating string for multiple occurrences of -o.
+ The default is an empty string. Separating strings are never
coloured.
-q, --quiet
Work quietly, that is, display nothing except error messages.
- The exit status indicates whether or not any matches were
+ The exit status indicates whether or not any matches were
found.
-r, --recursive
- If any given path is a directory, recursively scan the files
- it contains, taking note of any --include and --exclude set-
- tings. By default, a directory is read as a normal file; in
- some operating systems this gives an immediate end-of-file.
- This option is a shorthand for setting the -d option to
+ If any given path is a directory, recursively scan the files
+ it contains, taking note of any --include and --exclude set-
+ tings. By default, a directory is read as a normal file; in
+ some operating systems this gives an immediate end-of-file.
+ This option is a shorthand for setting the -d option to
"recurse".
--recursion-limit=number
See --match-limit above.
-s, --no-messages
- Suppress error messages about non-existent or unreadable
- files. Such files are quietly skipped. However, the return
+ Suppress error messages about non-existent or unreadable
+ files. Such files are quietly skipped. However, the return
code is still 2, even if matches were found in other files.
-t, --total-count
- This option is useful when scanning more than one file. If
- used on its own, -t suppresses all output except for a grand
- total number of matching lines (or non-matching lines if -v
- is used) in all the files. If -t is used with -c, a grand
- total is output except when the previous output is just one
- line. In other words, it is not output when just one file's
- count is listed. If file names are being output, the grand
- total is preceded by "TOTAL:". Otherwise, it appears as just
- another number. The -t option is ignored when used with -L
- (list files without matches), because the grand total would
+ This option is useful when scanning more than one file. If
+ used on its own, -t suppresses all output except for a grand
+ total number of matching lines (or non-matching lines if -v
+ is used) in all the files. If -t is used with -c, a grand
+ total is output except when the previous output is just one
+ line. In other words, it is not output when just one file's
+ count is listed. If file names are being output, the grand
+ total is preceded by "TOTAL:". Otherwise, it appears as just
+ another number. The -t option is ignored when used with -L
+ (list files without matches), because the grand total would
always be zero.
-u, --utf-8
Operate in UTF-8 mode. This option is available only if PCRE2
has been compiled with UTF-8 support. All patterns (including
- those for any --exclude and --include options) and all sub-
- ject lines that are scanned must be valid strings of UTF-8
+ those for any --exclude and --include options) and all sub-
+ ject lines that are scanned must be valid strings of UTF-8
characters.
-V, --version
- Write the version numbers of pcre2grep and the PCRE2 library
- to the standard output and then exit. Anything else on the
+ Write the version numbers of pcre2grep and the PCRE2 library
+ to the standard output and then exit. Anything else on the
command line is ignored.
-v, --invert-match
- Invert the sense of the match, so that lines which do not
+ Invert the sense of the match, so that lines which do not
match any of the patterns are the ones that are found.
-w, --word-regex, --word-regexp
Force the patterns only to match "words". That is, there must
- be a word boundary at the start and end of each matched
- string. This is equivalent to having "\b(?:" at the start of
- each pattern, and ")\b" at the end. This option applies only
- to the patterns that are matched against the contents of
- files; it does not apply to patterns specified by any of the
+ be a word boundary at the start and end of each matched
+ string. This is equivalent to having "\b(?:" at the start of
+ each pattern, and ")\b" at the end. This option applies only
+ to the patterns that are matched against the contents of
+ files; it does not apply to patterns specified by any of the
--include or --exclude options.
-x, --line-regex, --line-regexp
- Force the patterns to start matching only at the beginnings
- of lines, and in addition, require them to match entire
+ Force the patterns to start matching only at the beginnings
+ of lines, and in addition, require them to match entire
lines. In multiline mode the match may be more than one line.
This is equivalent to having "^(?:" at the start of each pat-
- tern and ")$" at the end. This option applies only to the
- patterns that are matched against the contents of files; it
- does not apply to patterns specified by any of the --include
+ tern and ")$" at the end. This option applies only to the
+ patterns that are matched against the contents of files; it
+ does not apply to patterns specified by any of the --include
or --exclude options.
ENVIRONMENT VARIABLES
- The environment variables LC_ALL and LC_CTYPE are examined, in that
- order, for a locale. The first one that is set is used. This can be
- overridden by the --locale option. If no locale is set, the PCRE2
+ The environment variables LC_ALL and LC_CTYPE are examined, in that
+ order, for a locale. The first one that is set is used. This can be
+ overridden by the --locale option. If no locale is set, the PCRE2
library's default (usually the "C" locale) is used.
@@ -749,99 +756,99 @@ NEWLINES
The -N (--newline) option allows pcre2grep to scan files with different
newline conventions from the default. Any parts of the input files that
- are written to the standard output are copied identically, with what-
- ever newline sequences they have in the input. However, the setting of
- this option does not affect the interpretation of files specified by
+ are written to the standard output are copied identically, with what-
+ ever newline sequences they have in the input. However, the setting of
+ this option does not affect the interpretation of files specified by
the -f, --exclude-from, or --include-from options, which are assumed to
- use the operating system's standard newline sequence, nor does it
- affect the way in which pcre2grep writes informational messages to the
+ use the operating system's standard newline sequence, nor does it
+ affect the way in which pcre2grep writes informational messages to the
standard error and output streams. For these it uses the string "\n" to
- indicate newlines, relying on the C I/O library to convert this to an
+ indicate newlines, relying on the C I/O library to convert this to an
appropriate sequence.
OPTIONS COMPATIBILITY
Many of the short and long forms of pcre2grep's options are the same as
- in the GNU grep program. Any long option of the form --xxx-regexp (GNU
+ in the GNU grep program. Any long option of the form --xxx-regexp (GNU
terminology) is also available as --xxx-regex (PCRE2 terminology). How-
- ever, the --depth-limit, --file-list, --file-offsets, --heap-limit,
- --include-dir, --line-offsets, --locale, --match-limit, -M, --multi-
- line, -N, --newline, --om-separator, --output, -u, and --utf-8 options
- are specific to pcre2grep, as is the use of the --only-matching option
+ ever, the --depth-limit, --file-list, --file-offsets, --heap-limit,
+ --include-dir, --line-offsets, --locale, --match-limit, -M, --multi-
+ line, -N, --newline, --om-separator, --output, -u, and --utf-8 options
+ are specific to pcre2grep, as is the use of the --only-matching option
with a capturing parentheses number.
- Although most of the common options work the same way, a few are dif-
- ferent in pcre2grep. For example, the --include option's argument is a
- glob for GNU grep, but a regular expression for pcre2grep. If both the
- -c and -l options are given, GNU grep lists only file names, without
+ Although most of the common options work the same way, a few are dif-
+ ferent in pcre2grep. For example, the --include option's argument is a
+ glob for GNU grep, but a regular expression for pcre2grep. If both the
+ -c and -l options are given, GNU grep lists only file names, without
counts, but pcre2grep gives the counts as well.
OPTIONS WITH DATA
There are four different ways in which an option with data can be spec-
- ified. If a short form option is used, the data may follow immedi-
+ ified. If a short form option is used, the data may follow immedi-
ately, or (with one exception) in the next command line item. For exam-
ple:
-f/some/file
-f /some/file
- The exception is the -o option, which may appear with or without data.
- Because of this, if data is present, it must follow immediately in the
+ The exception is the -o option, which may appear with or without data.
+ Because of this, if data is present, it must follow immediately in the
same item, for example -o3.
- If a long form option is used, the data may appear in the same command
- line item, separated by an equals character, or (with two exceptions)
+ If a long form option is used, the data may appear in the same command
+ line item, separated by an equals character, or (with two exceptions)
it may appear in the next command line item. For example:
--file=/some/file
--file /some/file
- Note, however, that if you want to supply a file name beginning with ~
- as data in a shell command, and have the shell expand ~ to a home
+ Note, however, that if you want to supply a file name beginning with ~
+ as data in a shell command, and have the shell expand ~ to a home
directory, you must separate the file name from the option, because the
shell does not treat ~ specially unless it is at the start of an item.
- The exceptions to the above are the --colour (or --color) and --only-
- matching options, for which the data is optional. If one of these
- options does have data, it must be given in the first form, using an
+ The exceptions to the above are the --colour (or --color) and --only-
+ matching options, for which the data is optional. If one of these
+ options does have data, it must be given in the first form, using an
equals character. Otherwise pcre2grep will assume that it has no data.
USING PCRE2'S CALLOUT FACILITY
- pcre2grep has, by default, support for calling external programs or
- scripts or echoing specific strings during matching by making use of
- PCRE2's callout facility. However, this support can be disabled when
- pcre2grep is built. You can find out whether your binary has support
- for callouts by running it with the --help option. If the support is
+ pcre2grep has, by default, support for calling external programs or
+ scripts or echoing specific strings during matching by making use of
+ PCRE2's callout facility. However, this support can be disabled when
+ pcre2grep is built. You can find out whether your binary has support
+ for callouts by running it with the --help option. If the support is
not enabled, all callouts in patterns are ignored by pcre2grep.
- A callout in a PCRE2 pattern is of the form (?C<arg>) where the argu-
- ment is either a number or a quoted string (see the pcre2callout docu-
- mentation for details). Numbered callouts are ignored by pcre2grep;
+ A callout in a PCRE2 pattern is of the form (?C<arg>) where the argu-
+ ment is either a number or a quoted string (see the pcre2callout docu-
+ mentation for details). Numbered callouts are ignored by pcre2grep;
only callouts with string arguments are useful.
Calling external programs or scripts
If the callout string does not start with a pipe (vertical bar) charac-
- ter, it is parsed into a list of substrings separated by pipe charac-
- ters. The first substring must be an executable name, with the follow-
+ ter, it is parsed into a list of substrings separated by pipe charac-
+ ters. The first substring must be an executable name, with the follow-
ing substrings specifying arguments:
executable_name|arg1|arg2|...
- Any substring (including the executable name) may contain escape
- sequences started by a dollar character: $<digits> or ${<digits>} is
- replaced by the captured substring of the given decimal number, which
- must be greater than zero. If the number is greater than the number of
- capturing substrings, or if the capture is unset, the replacement is
+ Any substring (including the executable name) may contain escape
+ sequences started by a dollar character: $<digits> or ${<digits>} is
+ replaced by the captured substring of the given decimal number, which
+ must be greater than zero. If the number is greater than the number of
+ capturing substrings, or if the capture is unset, the replacement is
empty.
- Any other character is substituted by itself. In particular, $$ is
- replaced by a single dollar and $| is replaced by a pipe character.
+ Any other character is substituted by itself. In particular, $$ is
+ replaced by a single dollar and $| is replaced by a pipe character.
Here is an example:
echo -e "abcde\n12345" | pcre2grep \
@@ -857,54 +864,54 @@ USING PCRE2'S CALLOUT FACILITY
The parameters for the execv() system call that is used to run the pro-
gram or script are zero-terminated strings. This means that binary zero
- characters in the callout argument will cause premature termination of
- their substrings, and therefore should not be present. Any syntax
- errors in the string (for example, a dollar not followed by another
- character) cause the callout to be ignored. If running the program
+ characters in the callout argument will cause premature termination of
+ their substrings, and therefore should not be present. Any syntax
+ errors in the string (for example, a dollar not followed by another
+ character) cause the callout to be ignored. If running the program
fails for any reason (including the non-existence of the executable), a
- local matching failure occurs and the matcher backtracks in the normal
+ local matching failure occurs and the matcher backtracks in the normal
way.
Echoing a specific string
- If the callout string starts with a pipe (vertical bar) character, the
+ If the callout string starts with a pipe (vertical bar) character, the
rest of the string is written to the output, having been passed through
- the same escape processing as text from the --output option. This pro-
+ the same escape processing as text from the --output option. This pro-
vides a simple echoing facility that avoids calling an external program
- or script. No terminator is added to the string, so if you want a new-
- line, you must include it explicitly. Matching continues normally
- after the string is output. If you want to see only the callout output
- but not any output from an actual match, you should end the relevant
+ or script. No terminator is added to the string, so if you want a new-
+ line, you must include it explicitly. Matching continues normally
+ after the string is output. If you want to see only the callout output
+ but not any output from an actual match, you should end the relevant
pattern with (*FAIL).
MATCHING ERRORS
- It is possible to supply a regular expression that takes a very long
- time to fail to match certain lines. Such patterns normally involve
- nested indefinite repeats, for example: (a+)*\d when matched against a
- line of a's with no final digit. The PCRE2 matching function has a
- resource limit that causes it to abort in these circumstances. If this
- happens, pcre2grep outputs an error message and the line that caused
- the problem to the standard error stream. If there are more than 20
+ It is possible to supply a regular expression that takes a very long
+ time to fail to match certain lines. Such patterns normally involve
+ nested indefinite repeats, for example: (a+)*\d when matched against a
+ line of a's with no final digit. The PCRE2 matching function has a
+ resource limit that causes it to abort in these circumstances. If this
+ happens, pcre2grep outputs an error message and the line that caused
+ the problem to the standard error stream. If there are more than 20
such errors, pcre2grep gives up.
- The --match-limit option of pcre2grep can be used to set the overall
- resource limit. There are also other limits that affect the amount of
- memory used during matching; see the discussion of --heap-limit and
+ The --match-limit option of pcre2grep can be used to set the overall
+ resource limit. There are also other limits that affect the amount of
+ memory used during matching; see the discussion of --heap-limit and
--depth-limit above.
DIAGNOSTICS
Exit status is 0 if any matches were found, 1 if no matches were found,
- and 2 for syntax errors, overlong lines, non-existent or inaccessible
- files (even if matches were found in other files) or too many matching
+ and 2 for syntax errors, overlong lines, non-existent or inaccessible
+ files (even if matches were found in other files) or too many matching
errors. Using the -s option to suppress error messages about inaccessi-
ble files does not affect the return code.
- When run under VMS, the return code is placed in the symbol
- PCRE2GREP_RC because VMS does not distinguish between exit(0) and
+ When run under VMS, the return code is placed in the symbol
+ PCRE2GREP_RC because VMS does not distinguish between exit(0) and
exit(1).
@@ -922,5 +929,5 @@ AUTHOR
REVISION
- Last updated: 11 October 2017
+ Last updated: 13 November 2017
Copyright (c) 1997-2017 University of Cambridge.
diff --git a/doc/pcre2test.1 b/doc/pcre2test.1
index 4a5df24..ee78792 100644
--- a/doc/pcre2test.1
+++ b/doc/pcre2test.1
@@ -1,4 +1,4 @@
-.TH PCRE2TEST 1 "17 October 2017" "PCRE 10.31"
+.TH PCRE2TEST 1 "21 Decbmber 2017" "PCRE 10.31"
.SH NAME
pcre2test - a program for testing Perl-compatible regular expressions.
.SH SYNOPSIS
@@ -129,6 +129,11 @@ has not been built, this option causes an error.
Behave as if each pattern has the \fBauto_callout\fP modifier, that is, insert
automatic callouts into every pattern that is compiled.
.TP 10
+\fB-AC\fP
+As for \fB-ac\fP, but in addition behave as if each subject line has the
+\fBcallout_extra\fP modifier, that is, show additional information from
+callouts.
+.TP 10
\fB-b\fP
Behave as if each pattern has the \fBfullbincode\fP modifier; the full
internal binary form of the pattern is output after compilation.
@@ -203,8 +208,8 @@ successful compilation, each pattern is passed to the just-in-time compiler, if
available, and the use of JIT is verified.
.TP 10
\fB-LM\fP
-List modifiers: write a list of available pattern and subject modifiers to the
-standard output, then exit with zero exit code. All other options are ignored.
+List modifiers: write a list of available pattern and subject modifiers to the
+standard output, then exit with zero exit code. All other options are ignored.
If both -C and -LM are present, whichever is first is recognized.
.TP 10
\fB-pattern\fB \fImodifier-list\fP
@@ -1152,6 +1157,7 @@ pattern.
callout_capture show captures at callout time
callout_data=<n> set a value to pass via callouts
callout_error=<n>[:<m>] control callout error
+ callout_extra show extra callout information
callout_fail=<n>[:<m>] control callout failure
callout_no_where do not show position of a callout
callout_none do not supply a callout function
@@ -1664,45 +1670,10 @@ documentation.
.rs
.sp
If the pattern contains any callout requests, \fBpcre2test\fP's callout
-function is called during matching unless \fBcallout_none\fP is specified.
-This works with both matching functions.
-.P
-The callout function in \fBpcre2test\fP returns zero (carry on matching) by
-default, but you can use a \fBcallout_fail\fP modifier in a subject line to
-change this and other parameters of the callout.
-.P
-If \fBcallout_capture\fP is set, the current captured groups are output when a
-callout occurs. By default, the callout function then generates output that
-indicates where the current match start and matching points are in the subject,
-and what the next pattern item is. This output is suppressed if the
-\fBcallout_no_where\fP modifier is set.
-.P
-The default return from the callout function is zero, which allows matching to
-continue. The \fBcallout_fail\fP modifier can be given one or two numbers. If
-there is only one number, 1 is returned instead of 0 (causing matching to
-backtrack) when a callout of that number is reached. If two numbers (<n>:<m>)
-are given, 1 is returned when callout <n> is reached and there have been at
-least <m> callouts. The \fBcallout_error\fP modifier is similar, except that
-PCRE2_ERROR_CALLOUT is returned, causing the entire matching process to be
-aborted. If both these modifiers are set for the same callout number,
-\fBcallout_error\fP takes precedence. Note that callouts with string arguments
-are always given the number zero. See
-.P
-The \fBcallout_data\fP modifier can be given an unsigned or a negative number.
-This is set as the "user data" that is passed to the matching function, and
-passed back when the callout function is invoked. Any value other than zero is
-used as a return from \fBpcre2test\fP's callout function.
-.P
-Inserting callouts can be helpful when using \fBpcre2test\fP to check
-complicated regular expressions. For further information about callouts, see
-the
-.\" HREF
-\fBpcre2callout\fP
-.\"
-documentation.
-.P
-The output for callouts with numerical arguments and those with string
-arguments is slightly different.
+function is called during matching unless \fBcallout_none\fP is specified. This
+works with both matching functions, and with JIT, though there are some
+differences in behaviour. The output for callouts with numerical arguments and
+those with string arguments is slightly different.
.
.
.SS "Callouts with numerical arguments"
@@ -1776,6 +1747,103 @@ example:
.sp
.
.
+.SS "Callout modifiers"
+.rs
+.sp
+The callout function in \fBpcre2test\fP returns zero (carry on matching) by
+default, but you can use a \fBcallout_fail\fP modifier in a subject line to
+change this and other parameters of the callout (see below).
+.P
+If the \fBcallout_capture\fP modifier is set, the current captured groups are
+output when a callout occurs. This is useful only for non-DFA matching, as
+\fBpcre2_dfa_match()\fP does not support capturing, so no captures are ever
+shown.
+.P
+The normal callout output, showing the callout number or pattern offset (as
+described above) is suppressed if the \fBcallout_no_where\fP modifier is set.
+.P
+When using the interpretive matching function \fBpcre2_match()\fP without JIT,
+setting the \fBcallout_extra\fP modifier causes additional output from
+\fBpcre2test\fP's callout function to be generated. For the first callout in a
+match attempt at a new starting position in the subject, "New match attempt" is
+output. If there has been a backtrack since the last callout (or start of
+matching if this is the first callout), "Backtrack" is output, followed by "No
+other matching paths" if the backtrack ended the previous match attempt. For
+example:
+.sp
+ re> /(a+)b/auto_callout,no_start_optimize,no_auto_possess
+ data> aac\e=callout_extra
+ New match attempt
+ --->aac
+ +0 ^ (
+ +1 ^ a+
+ +3 ^ ^ )
+ +4 ^ ^ b
+ Backtrack
+ --->aac
+ +3 ^^ )
+ +4 ^^ b
+ Backtrack
+ No other matching paths
+ New match attempt
+ --->aac
+ +0 ^ (
+ +1 ^ a+
+ +3 ^^ )
+ +4 ^^ b
+ Backtrack
+ No other matching paths
+ New match attempt
+ --->aac
+ +0 ^ (
+ +1 ^ a+
+ Backtrack
+ No other matching paths
+ New match attempt
+ --->aac
+ +0 ^ (
+ +1 ^ a+
+ No match
+.sp
+Notice that various optimizations must be turned off if you want all possible
+matching paths to be scanned. If \fBno_start_optimize\fP is not used, there is
+an immediate "no match", without any callouts, because the starting
+optimization fails to find "b" in the subject, which it knows must be present
+for any match. If \fBno_auto_possess\fP is not used, the "a+" item is turned
+into "a++", which reduces the number of backtracks.
+.P
+The \fBcallout_extra\fP modifier has no effect if used with the DFA matching
+function, or with JIT.
+.
+.
+.SS "Return values from callouts"
+.rs
+.sp
+The default return from the callout function is zero, which allows matching to
+continue. The \fBcallout_fail\fP modifier can be given one or two numbers. If
+there is only one number, 1 is returned instead of 0 (causing matching to
+backtrack) when a callout of that number is reached. If two numbers (<n>:<m>)
+are given, 1 is returned when callout <n> is reached and there have been at
+least <m> callouts. The \fBcallout_error\fP modifier is similar, except that
+PCRE2_ERROR_CALLOUT is returned, causing the entire matching process to be
+aborted. If both these modifiers are set for the same callout number,
+\fBcallout_error\fP takes precedence. Note that callouts with string arguments
+are always given the number zero.
+.P
+The \fBcallout_data\fP modifier can be given an unsigned or a negative number.
+This is set as the "user data" that is passed to the matching function, and
+passed back when the callout function is invoked. Any value other than zero is
+used as a return from \fBpcre2test\fP's callout function.
+.P
+Inserting callouts can be helpful when using \fBpcre2test\fP to check
+complicated regular expressions. For further information about callouts, see
+the
+.\" HREF
+\fBpcre2callout\fP
+.\"
+documentation.
+.
+.
.
.SH "NON-PRINTING CHARACTERS"
.rs
@@ -1894,6 +1962,6 @@ Cambridge, England.
.rs
.sp
.nf
-Last updated: 17 October 2017
+Last updated: 21 December 2017
Copyright (c) 1997-2017 University of Cambridge.
.fi
diff --git a/doc/pcre2test.txt b/doc/pcre2test.txt
index 9e2bfe3..93efd24 100644
--- a/doc/pcre2test.txt
+++ b/doc/pcre2test.txt
@@ -120,6 +120,10 @@ COMMAND LINE OPTIONS
is, insert automatic callouts into every pattern that is com-
piled.
+ -AC As for -ac, but in addition behave as if each subject line
+ has the callout_extra modifier, that is, show additional
+ information from callouts.
+
-b Behave as if each pattern has the fullbincode modifier; the
full internal binary form of the pattern is output after com-
pilation.
@@ -1056,6 +1060,7 @@ SUBJECT MODIFIERS
callout_capture show captures at callout time
callout_data=<n> set a value to pass via callouts
callout_error=<n>[:<m>] control callout error
+ callout_extra show extra callout information
callout_fail=<n>[:<m>] control callout failure
callout_no_where do not show position of a callout
callout_none do not supply a callout function
@@ -1529,63 +1534,30 @@ RESTARTING AFTER A PARTIAL MATCH
CALLOUTS
If the pattern contains any callout requests, pcre2test's callout func-
- tion is called during matching unless callout_none is specified. This
- works with both matching functions.
-
- The callout function in pcre2test returns zero (carry on matching) by
- default, but you can use a callout_fail modifier in a subject line to
- change this and other parameters of the callout.
-
- If callout_capture is set, the current captured groups are output when
- a callout occurs. By default, the callout function then generates out-
- put that indicates where the current match start and matching points
- are in the subject, and what the next pattern item is. This output is
- suppressed if the callout_no_where modifier is set.
-
- The default return from the callout function is zero, which allows
- matching to continue. The callout_fail modifier can be given one or two
- numbers. If there is only one number, 1 is returned instead of 0 (caus-
- ing matching to backtrack) when a callout of that number is reached. If
- two numbers (<n>:<m>) are given, 1 is returned when callout <n> is
- reached and there have been at least <m> callouts. The callout_error
- modifier is similar, except that PCRE2_ERROR_CALLOUT is returned, caus-
- ing the entire matching process to be aborted. If both these modifiers
- are set for the same callout number, callout_error takes precedence.
- Note that callouts with string arguments are always given the number
- zero. See
-
- The callout_data modifier can be given an unsigned or a negative num-
- ber. This is set as the "user data" that is passed to the matching
- function, and passed back when the callout function is invoked. Any
- value other than zero is used as a return from pcre2test's callout
- function.
-
- Inserting callouts can be helpful when using pcre2test to check compli-
- cated regular expressions. For further information about callouts, see
- the pcre2callout documentation.
-
- The output for callouts with numerical arguments and those with string
- arguments is slightly different.
+ tion is called during matching unless callout_none is specified. This
+ works with both matching functions, and with JIT, though there are some
+ differences in behaviour. The output for callouts with numerical argu-
+ ments and those with string arguments is slightly different.
Callouts with numerical arguments
By default, the callout function displays the callout number, the start
- and current positions in the subject text at the callout time, and the
+ and current positions in the subject text at the callout time, and the
next pattern item to be tested. For example:
--->pqrabcdef
0 ^ ^ \d
- This output indicates that callout number 0 occurred for a match
- attempt starting at the fourth character of the subject string, when
- the pointer was at the seventh character, and when the next pattern
- item was \d. Just one circumflex is output if the start and current
- positions are the same, or if the current position precedes the start
+ This output indicates that callout number 0 occurred for a match
+ attempt starting at the fourth character of the subject string, when
+ the pointer was at the seventh character, and when the next pattern
+ item was \d. Just one circumflex is output if the start and current
+ positions are the same, or if the current position precedes the start
position, which can happen if the callout is in a lookbehind assertion.
Callouts numbered 255 are assumed to be automatic callouts, inserted as
a result of the auto_callout pattern modifier. In this case, instead of
- showing the callout number, the offset in the pattern, preceded by a
+ showing the callout number, the offset in the pattern, preceded by a
plus, is output. For example:
re> /\d?[A-E]\*/auto_callout
@@ -1598,7 +1570,7 @@ CALLOUTS
0: E*
If a pattern contains (*MARK) items, an additional line is output when-
- ever a change of latest mark is passed to the callout function. For
+ ever a change of latest mark is passed to the callout function. For
example:
re> /a(*MARK:X)bc/auto_callout
@@ -1612,17 +1584,17 @@ CALLOUTS
+12 ^ ^
0: abc
- The mark changes between matching "a" and "b", but stays the same for
- the rest of the match, so nothing more is output. If, as a result of
- backtracking, the mark reverts to being unset, the text "<unset>" is
+ The mark changes between matching "a" and "b", but stays the same for
+ the rest of the match, so nothing more is output. If, as a result of
+ backtracking, the mark reverts to being unset, the text "<unset>" is
output.
Callouts with string arguments
The output for a callout with a string argument is similar, except that
- instead of outputting a callout number before the position indicators,
- the callout string and its offset in the pattern string are output
- before the reflection of the subject string, and the subject string is
+ instead of outputting a callout number before the position indicators,
+ the callout string and its offset in the pattern string are output
+ before the reflection of the subject string, and the subject string is
reflected for each callout. For example:
re> /^ab(?C'first')cd(?C"second")ef/
@@ -1636,6 +1608,100 @@ CALLOUTS
0: abcdef
+ Callout modifiers
+
+ The callout function in pcre2test returns zero (carry on matching) by
+ default, but you can use a callout_fail modifier in a subject line to
+ change this and other parameters of the callout (see below).
+
+ If the callout_capture modifier is set, the current captured groups are
+ output when a callout occurs. This is useful only for non-DFA matching,
+ as pcre2_dfa_match() does not support capturing, so no captures are
+ ever shown.
+
+ The normal callout output, showing the callout number or pattern offset
+ (as described above) is suppressed if the callout_no_where modifier is
+ set.
+
+ When using the interpretive matching function pcre2_match() without
+ JIT, setting the callout_extra modifier causes additional output from
+ pcre2test's callout function to be generated. For the first callout in
+ a match attempt at a new starting position in the subject, "New match
+ attempt" is output. If there has been a backtrack since the last call-
+ out (or start of matching if this is the first callout), "Backtrack" is
+ output, followed by "No other matching paths" if the backtrack ended
+ the previous match attempt. For example:
+
+ re> /(a+)b/auto_callout,no_start_optimize,no_auto_possess
+ data> aac\=callout_extra
+ New match attempt
+ --->aac
+ +0 ^ (
+ +1 ^ a+
+ +3 ^ ^ )
+ +4 ^ ^ b
+ Backtrack
+ --->aac
+ +3 ^^ )
+ +4 ^^ b
+ Backtrack
+ No other matching paths
+ New match attempt
+ --->aac
+ +0 ^ (
+ +1 ^ a+
+ +3 ^^ )
+ +4 ^^ b
+ Backtrack
+ No other matching paths
+ New match attempt
+ --->aac
+ +0 ^ (
+ +1 ^ a+
+ Backtrack
+ No other matching paths
+ New match attempt
+ --->aac
+ +0 ^ (
+ +1 ^ a+
+ No match
+
+ Notice that various optimizations must be turned off if you want all
+ possible matching paths to be scanned. If no_start_optimize is not
+ used, there is an immediate "no match", without any callouts, because
+ the starting optimization fails to find "b" in the subject, which it
+ knows must be present for any match. If no_auto_possess is not used,
+ the "a+" item is turned into "a++", which reduces the number of back-
+ tracks.
+
+ The callout_extra modifier has no effect if used with the DFA matching
+ function, or with JIT.
+
+ Return values from callouts
+
+ The default return from the callout function is zero, which allows
+ matching to continue. The callout_fail modifier can be given one or two
+ numbers. If there is only one number, 1 is returned instead of 0 (caus-
+ ing matching to backtrack) when a callout of that number is reached. If
+ two numbers (<n>:<m>) are given, 1 is returned when callout <n> is
+ reached and there have been at least <m> callouts. The callout_error
+ modifier is similar, except that PCRE2_ERROR_CALLOUT is returned, caus-
+ ing the entire matching process to be aborted. If both these modifiers
+ are set for the same callout number, callout_error takes precedence.
+ Note that callouts with string arguments are always given the number
+ zero.
+
+ The callout_data modifier can be given an unsigned or a negative num-
+ ber. This is set as the "user data" that is passed to the matching
+ function, and passed back when the callout function is invoked. Any
+ value other than zero is used as a return from pcre2test's callout
+ function.
+
+ Inserting callouts can be helpful when using pcre2test to check compli-
+ cated regular expressions. For further information about callouts, see
+ the pcre2callout documentation.
+
+
NON-PRINTING CHARACTERS
When pcre2test is outputting text in the compiled version of a pattern,
@@ -1733,5 +1799,5 @@ AUTHOR
REVISION
- Last updated: 17 October 2017
+ Last updated: 21 December 2017
Copyright (c) 1997-2017 University of Cambridge.
diff --git a/src/pcre2.h b/src/pcre2.h
index bbb1771..17c5b47 100644
--- a/src/pcre2.h
+++ b/src/pcre2.h
@@ -494,6 +494,11 @@ without changing the API of the function, thereby allowing old clients to work
without modification. Define the generic version in a macro; the width-specific
versions are generated from this macro below. */
+/* Flags for the callout_flags field. These are cleared after a callout. */
+
+#define PCRE2_CALLOUT_STARTMATCH 0x00000001u /* Set for each bumpalong */
+#define PCRE2_CALLOUT_BACKTRACK 0x00000002u /* Set after a backtrack */
+
#define PCRE2_STRUCTURE_LIST \
typedef struct pcre2_callout_block { \
uint32_t version; /* Identifies version of block */ \
@@ -513,6 +518,8 @@ typedef struct pcre2_callout_block { \
PCRE2_SIZE callout_string_offset; /* Offset to string within pattern */ \
PCRE2_SIZE callout_string_length; /* Length of string compiled into pattern */ \
PCRE2_SPTR callout_string; /* String compiled into pattern */ \
+ /* ------------------- Added for Version 2 -------------------------- */ \
+ uint32_t callout_flags; /* See above for list */ \
/* ------------------------------------------------------------------ */ \
} pcre2_callout_block; \
\
diff --git a/src/pcre2.h.in b/src/pcre2.h.in
index 6718689..a3a3fa6 100644
--- a/src/pcre2.h.in
+++ b/src/pcre2.h.in
@@ -494,6 +494,11 @@ without changing the API of the function, thereby allowing old clients to work
without modification. Define the generic version in a macro; the width-specific
versions are generated from this macro below. */
+/* Flags for the callout_flags field. These are cleared after a callout. */
+
+#define PCRE2_CALLOUT_STARTMATCH 0x00000001u /* Set for each bumpalong */
+#define PCRE2_CALLOUT_BACKTRACK 0x00000002u /* Set after a backtrack */
+
#define PCRE2_STRUCTURE_LIST \
typedef struct pcre2_callout_block { \
uint32_t version; /* Identifies version of block */ \
@@ -513,6 +518,8 @@ typedef struct pcre2_callout_block { \
PCRE2_SIZE callout_string_offset; /* Offset to string within pattern */ \
PCRE2_SIZE callout_string_length; /* Length of string compiled into pattern */ \
PCRE2_SPTR callout_string; /* String compiled into pattern */ \
+ /* ------------------- Added for Version 2 -------------------------- */ \
+ uint32_t callout_flags; /* See above for list */ \
/* ------------------------------------------------------------------ */ \
} pcre2_callout_block; \
\
diff --git a/src/pcre2_dfa_match.c b/src/pcre2_dfa_match.c
index ca235fb..9cdbebb 100644
--- a/src/pcre2_dfa_match.c
+++ b/src/pcre2_dfa_match.c
@@ -2574,7 +2574,8 @@ for (;;)
if (mb->callout != NULL)
{
pcre2_callout_block cb;
- cb.version = 1;
+ cb.version = 2;
+ cb.callout_flags = 0;
cb.capture_top = 1;
cb.capture_last = 0;
cb.offset_vector = offsets;
@@ -2943,7 +2944,8 @@ for (;;)
if (mb->callout != NULL)
{
pcre2_callout_block cb;
- cb.version = 1;
+ cb.version = 2;
+ cb.callout_flags = 0;
cb.capture_top = 1;
cb.capture_last = 0;
cb.offset_vector = offsets;
diff --git a/src/pcre2_jit_compile.c b/src/pcre2_jit_compile.c
index 14e7dc6..2cdc324 100644
--- a/src/pcre2_jit_compile.c
+++ b/src/pcre2_jit_compile.c
@@ -7952,7 +7952,8 @@ oveccount = callout_block->capture_top;
SLJIT_ASSERT(oveccount >= 1);
-callout_block->version = 1;
+callout_block->version = 2;
+callout_block->callout_flags = 0;
/* Offsets in subject. */
callout_block->subject_length = arguments->end - arguments->begin;
diff --git a/src/pcre2_match.c b/src/pcre2_match.c
index e8b2000..265fade 100644
--- a/src/pcre2_match.c
+++ b/src/pcre2_match.c
@@ -321,6 +321,7 @@ callout_ovector[0] = callout_ovector[1] = PCRE2_UNSET;
rc = mb->callout(cb, mb->callout_data);
callout_ovector[0] = save0;
callout_ovector[1] = save1;
+cb->callout_flags = 0;
return rc;
}
@@ -5919,8 +5920,9 @@ in rrc. */
#define LBL(val) case val: goto L_RM##val;
RETURN_SWITCH:
-if (Frdepth == 0) return rrc; /* Exit from the top level */
-F = (heapframe *)((char *)F - Fback_frame); /* Back track */
+if (Frdepth == 0) return rrc; /* Exit from the top level */
+F = (heapframe *)((char *)F - Fback_frame); /* Back track */
+mb->cb->callout_flags |= PCRE2_CALLOUT_BACKTRACK; /* Note for callouts */
#ifdef DEBUG_SHOW_RMATCH
fprintf(stderr, "++ RETURN %d to %d\n", rrc, Freturn_id);
@@ -6171,13 +6173,14 @@ startline = (re->flags & PCRE2_STARTLINE) != 0;
bumpalong_limit = (mcontext->offset_limit == PCRE2_UNSET)?
end_subject : subject + mcontext->offset_limit;
-/* Set up the fixed fields in the callout block, with a pointer in the
-match block. */
+/* Initialize and set up the fixed fields in the callout block, with a pointer
+in the match block. */
mb->cb = &cb;
-cb.version = 1;
+cb.version = 2;
cb.subject = subject;
cb.subject_length = (PCRE2_SIZE)(end_subject - subject);
+cb.callout_flags = 0;
/* Fill in the remaining fields in the match block. */
@@ -6644,6 +6647,8 @@ for(;;)
first starting point for which a partial match was found. */
cb.start_match = (PCRE2_SIZE)(start_match - subject);
+ cb.callout_flags |= PCRE2_CALLOUT_STARTMATCH;
+
mb->start_used_ptr = start_match;
mb->last_used_ptr = start_match;
mb->match_call_count = 0;
diff --git a/src/pcre2test.c b/src/pcre2test.c
index e0fead5..1206444 100644
--- a/src/pcre2test.c
+++ b/src/pcre2test.c
@@ -485,6 +485,7 @@ so many of them that they are split into two fields. */
#define CTL2_SUBSTITUTE_UNSET_EMPTY 0x00000008u
#define CTL2_SUBJECT_LITERAL 0x00000010u
#define CTL2_CALLOUT_NO_WHERE 0x00000020u
+#define CTL2_CALLOUT_EXTRA 0x00000040u
#define CTL2_NL_SET 0x40000000u /* Informational */
#define CTL2_BSR_SET 0x80000000u /* Informational */
@@ -598,6 +599,7 @@ static modstruct modlist[] = {
{ "callout_capture", MOD_DAT, MOD_CTL, CTL_CALLOUT_CAPTURE, DO(control) },
{ "callout_data", MOD_DAT, MOD_INS, 0, DO(callout_data) },
{ "callout_error", MOD_DAT, MOD_IN2, 0, DO(cerror) },
+ { "callout_extra", MOD_DAT, MOD_CTL, CTL2_CALLOUT_EXTRA, DO(control2) },
{ "callout_fail", MOD_DAT, MOD_IN2, 0, DO(cfail) },
{ "callout_info", MOD_PAT, MOD_CTL, CTL_CALLOUT_INFO, PO(control) },
{ "callout_no_where", MOD_DAT, MOD_CTL, CTL2_CALLOUT_NO_WHERE, DO(control2) },
@@ -3971,7 +3973,7 @@ Returns: nothing
static void
show_controls(uint32_t controls, uint32_t controls2, const char *before)
{
-fprintf(outfile, "%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s",
+fprintf(outfile, "%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s",
before,
((controls & CTL_AFTERTEXT) != 0)? " aftertext" : "",
((controls & CTL_ALLAFTERTEXT) != 0)? " allaftertext" : "",
@@ -3981,6 +3983,7 @@ fprintf(outfile, "%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s
((controls & CTL_BINCODE) != 0)? " bincode" : "",
((controls2 & CTL2_BSR_SET) != 0)? " bsr" : "",
((controls & CTL_CALLOUT_CAPTURE) != 0)? " callout_capture" : "",
+ ((controls2 & CTL2_CALLOUT_EXTRA) != 0)? " callout_extra" : "",
((controls & CTL_CALLOUT_INFO) != 0)? " callout_info" : "",
((controls & CTL_CALLOUT_NONE) != 0)? " callout_none" : "",
((controls2 & CTL2_CALLOUT_NO_WHERE) != 0)? " callout_no_where" : "",
@@ -4409,7 +4412,7 @@ if ((pat_patctl.control & CTL_INFO) != 0)
pattern_info(PCRE2_INFO_ARGOPTIONS, &compile_options, FALSE);
pattern_info(PCRE2_INFO_ALLOPTIONS, &overall_options, FALSE);
- pattern_info(PCRE2_INFO_EXTRAOPTIONS, &extra_options, FALSE);
+ pattern_info(PCRE2_INFO_EXTRAOPTIONS, &extra_options, FALSE);
/* Remove UTF/UCP if they were there only because of forbid_utf. This saves
cluttering up the verification output of non-UTF test files. */
@@ -4436,9 +4439,9 @@ if ((pat_patctl.control & CTL_INFO) != 0)
show_compile_options(overall_options, "Overall options:", "\n");
}
}
-
- if (extra_options != 0)
- show_compile_extra_options(extra_options, "Extra options:", "\n");
+
+ if (extra_options != 0)
+ show_compile_extra_options(extra_options, "Extra options:", "\n");
if (jchanged) fprintf(outfile, "Duplicate name status changes\n");
@@ -5842,17 +5845,43 @@ Return:
static int
callout_function(pcre2_callout_block_8 *cb, void *callout_data_ptr)
{
+FILE *f, *fdefault;
uint32_t i, pre_start, post_start, subject_length;
PCRE2_SIZE current_position;
BOOL utf = (FLD(compiled_code, overall_options) & PCRE2_UTF) != 0;
BOOL callout_capture = (dat_datctl.control & CTL_CALLOUT_CAPTURE) != 0;
BOOL callout_where = (dat_datctl.control2 & CTL2_CALLOUT_NO_WHERE) == 0;
-/* This FILE is used for echoing the subject. This is done only once in simple
-cases. */
+/* The FILE f is used for echoing the subject string if it is non-NULL. This
+happens only once in simple cases, but we want to repeat after any additional
+output caused by CALLOUT_EXTRA. */
+
+fdefault = (!first_callout && !callout_capture && cb->callout_string == NULL)?
+ NULL : outfile;
+
+if ((dat_datctl.control2 & CTL2_CALLOUT_EXTRA) != 0)
+ {
+ f = outfile;
+ switch (cb->callout_flags)
+ {
+ case PCRE2_CALLOUT_BACKTRACK:
+ fprintf(f, "Backtrack\n");
+ break;
+
+ case PCRE2_CALLOUT_STARTMATCH|PCRE2_CALLOUT_BACKTRACK:
+ fprintf(f, "Backtrack\nNo other matching paths\n");
+ /* Fall through */
+
+ case PCRE2_CALLOUT_STARTMATCH:
+ fprintf(f, "New match attempt\n");
+ break;
-FILE *f = (first_callout || callout_capture || cb->callout_string != NULL)?
- outfile : NULL;
+ default:
+ f = fdefault;
+ break;
+ }
+ }
+else f = fdefault;
/* For a callout with a string argument, show the string first because there
isn't a tidy way to fit it in the rest of the data. */
@@ -5902,7 +5931,6 @@ lengths of the substrings. */
if (callout_where)
{
-
if (f != NULL) fprintf(f, "--->");
/* The subject before the match start. */
@@ -5931,9 +5959,10 @@ if (callout_where)
if (f != NULL) fprintf(f, "\n");
- /* For automatic callouts, show the pattern offset. Otherwise, for a numerical
- callout whose number has not already been shown with captured strings, show the
- number here. A callout with a string argument has been displayed above. */
+ /* For automatic callouts, show the pattern offset. Otherwise, for a
+ numerical callout whose number has not already been shown with captured
+ strings, show the number here. A callout with a string argument has been
+ displayed above. */
if (cb->callout_number == 255)
{
@@ -5963,6 +5992,8 @@ if (callout_where)
if (cb->next_item_length != 0)
fprintf(outfile, "%.*s", (int)(cb->next_item_length),
pbuffer8 + cb->pattern_position);
+ else
+ fprintf(outfile, "End of pattern");
fprintf(outfile, "\n");
}
@@ -7685,7 +7716,8 @@ printf(" -16 use the 16-bit library\n");
#ifdef SUPPORT_PCRE2_32
printf(" -32 use the 32-bit library\n");
#endif
-printf(" -ac set default pattern option PCRE2_AUTO_CALLOUT\n");
+printf(" -ac set default pattern modifier PCRE2_AUTO_CALLOUT\n");
+printf(" -AC as -ac, but also set subject 'callout_extra' modifier\n");
printf(" -b set default pattern modifier 'fullbincode'\n");
printf(" -C show PCRE2 compile-time options and exit\n");
printf(" -C arg show a specific compile-time option and exit with its\n");
@@ -8181,6 +8213,11 @@ while (argc > 1 && argv[op][0] == '-' && argv[op][1] != 0)
/* Set some common pattern and subject controls */
+ else if (strcmp(arg, "-AC") == 0)
+ {
+ def_patctl.options |= PCRE2_AUTO_CALLOUT;
+ def_datctl.control2 |= CTL2_CALLOUT_EXTRA;
+ }
else if (strcmp(arg, "-ac") == 0) def_patctl.options |= PCRE2_AUTO_CALLOUT;
else if (strcmp(arg, "-b") == 0) def_patctl.control |= CTL_FULLBINCODE;
else if (strcmp(arg, "-d") == 0) def_patctl.control |= CTL_DEBUG;
diff --git a/testdata/testinput2 b/testdata/testinput2
index d3bdc96..942ec45 100644
--- a/testdata/testinput2
+++ b/testdata/testinput2
@@ -5383,6 +5383,16 @@ a)"xI
"(?=(a))\1?b"I
ab
- aaab
+ aaab
+
+# JIT does not support callout_extra
+
+/(*NO_JIT)(a+)b/auto_callout,no_start_optimize,no_auto_possess
+\= Expect no match
+ aac\=callout_extra
+
+/(*NO_JIT)a+(?C'XXX')b/no_start_optimize,no_auto_possess
+\= Expect no match
+ aac\=callout_extra
# End of testinput2
diff --git a/testdata/testoutput15 b/testdata/testoutput15
index f4f68da..b2068d0 100644
--- a/testdata/testoutput15
+++ b/testdata/testoutput15
@@ -361,12 +361,12 @@ Starting code units: 0 1 2 3 4 5 6 7 8 9 A B C D E F G H I J K L M N O P
Subject length lower bound = 1
abc\=callout_fail=1
--->abc
- 1 ^ ^
- 1 ^ ^
- 1 ^^
- 1 ^ ^
- 1 ^^
- 1 ^^
+ 1 ^ ^ End of pattern
+ 1 ^ ^ End of pattern
+ 1 ^^ End of pattern
+ 1 ^ ^ End of pattern
+ 1 ^^ End of pattern
+ 1 ^^ End of pattern
No match
/(*NO_AUTO_POSSESS)\w+(?C1)/BI
@@ -385,12 +385,12 @@ Starting code units: 0 1 2 3 4 5 6 7 8 9 A B C D E F G H I J K L M N O P
Subject length lower bound = 1
abc\=callout_fail=1
--->abc
- 1 ^ ^
- 1 ^ ^
- 1 ^^
- 1 ^ ^
- 1 ^^
- 1 ^^
+ 1 ^ ^ End of pattern
+ 1 ^ ^ End of pattern
+ 1 ^^ End of pattern
+ 1 ^ ^ End of pattern
+ 1 ^^ End of pattern
+ 1 ^^ End of pattern
No match
# This test breaks the JIT stack limit
diff --git a/testdata/testoutput2 b/testdata/testoutput2
index f3b1854..b7177ce 100644
--- a/testdata/testoutput2
+++ b/testdata/testoutput2
@@ -3832,7 +3832,7 @@ Subject length lower bound = 2
\= Expect no match
abbbbbccc\=callout_data=1
--->abbbbbccc
- 1 ^ ^
+ 1 ^ ^ End of pattern
Callout data = 1
No match
@@ -3844,21 +3844,21 @@ Subject length lower bound = 2
\= Expect no match
abbbbbccc\=callout_data=1
--->abbbbbccc
- 1 ^ ^
+ 1 ^ ^ End of pattern
Callout data = 1
- 1 ^ ^
+ 1 ^ ^ End of pattern
Callout data = 1
- 1 ^ ^
+ 1 ^ ^ End of pattern
Callout data = 1
- 1 ^ ^
+ 1 ^ ^ End of pattern
Callout data = 1
- 1 ^ ^
+ 1 ^ ^ End of pattern
Callout data = 1
- 1 ^ ^
+ 1 ^ ^ End of pattern
Callout data = 1
- 1 ^ ^
+ 1 ^ ^ End of pattern
Callout data = 1
- 1 ^ ^
+ 1 ^ ^ End of pattern
Callout data = 1
No match
@@ -4718,7 +4718,7 @@ Subject length lower bound = 5
+2 ^ ^ c
+3 ^ ^ d
+4 ^ ^ e
- +5 ^ ^
+ +5 ^ ^ End of pattern
0: abcde
\= Expect no match
abcdfe
@@ -4750,13 +4750,13 @@ Subject length lower bound = 1
--->ab
+0 ^ a*
+2 ^^ b
- +3 ^ ^
+ +3 ^ ^ End of pattern
0: ab
aaaab
--->aaaab
+0 ^ a*
+2 ^ ^ b
- +3 ^ ^
+ +3 ^ ^ End of pattern
0: aaaab
aaaacb
--->aaaacb
@@ -4770,7 +4770,7 @@ Subject length lower bound = 1
+2 ^^ b
+0 ^ a*
+2 ^ b
- +3 ^^
+ +3 ^^ End of pattern
0: b
/a*b/IB,auto_callout
@@ -4793,13 +4793,13 @@ Subject length lower bound = 1
--->ab
+0 ^ a*
+2 ^^ b
- +3 ^ ^
+ +3 ^ ^ End of pattern
0: ab
aaaab
--->aaaab
+0 ^ a*
+2 ^ ^ b
- +3 ^ ^
+ +3 ^ ^ End of pattern
0: aaaab
aaaacb
--->aaaacb
@@ -4813,7 +4813,7 @@ Subject length lower bound = 1
+2 ^^ b
+0 ^ a*
+2 ^ b
- +3 ^^
+ +3 ^^ End of pattern
0: b
/a+b/IB,auto_callout
@@ -4836,13 +4836,13 @@ Subject length lower bound = 2
--->ab
+0 ^ a+
+2 ^^ b
- +3 ^ ^
+ +3 ^ ^ End of pattern
0: ab
aaaab
--->aaaab
+0 ^ a+
+2 ^ ^ b
- +3 ^ ^
+ +3 ^ ^ End of pattern
0: aaaab
\= Expect no match
aaaacb
@@ -4897,7 +4897,7 @@ Subject length lower bound = 4
+3 ^ ^ c
+4 ^ ^ |
+9 ^ ^ x
-+10 ^ ^
++10 ^ ^ End of pattern
0: abcx
1: abc
defx
@@ -4909,7 +4909,7 @@ Subject length lower bound = 4
+7 ^ ^ f
+8 ^ ^ )
+9 ^ ^ x
-+10 ^ ^
++10 ^ ^ End of pattern
0: defx
1: def
\= Expect no match
@@ -4971,7 +4971,7 @@ Subject length lower bound = 4
+3 ^ ^ c
+4 ^ ^ |
+9 ^ ^ x
-+10 ^ ^
++10 ^ ^ End of pattern
0: abcx
1: abc
defx
@@ -4983,7 +4983,7 @@ Subject length lower bound = 4
+7 ^ ^ f
+8 ^ ^ )
+9 ^ ^ x
-+10 ^ ^
++10 ^ ^ End of pattern
0: defx
1: def
\= Expect no match
@@ -5024,7 +5024,7 @@ Subject length lower bound = 6
+3 ^ ^ |
+1 ^ ^ a
+4 ^ ^ c
-+12 ^ ^
++12 ^ ^ End of pattern
0: ababab
1: ab
abcdabcd
@@ -5044,7 +5044,7 @@ Subject length lower bound = 6
+4 ^ ^ c
+5 ^ ^ d
+6 ^ ^ ){3,4}
-+12 ^ ^
++12 ^ ^ End of pattern
0: abcdabcd
1: cd
abcdcdcdcdcd
@@ -5065,7 +5065,7 @@ Subject length lower bound = 6
+4 ^ ^ c
+5 ^ ^ d
+6 ^ ^ ){3,4}
-+12 ^ ^
++12 ^ ^ End of pattern
0: abcdcdcd
1: cd
@@ -5276,7 +5276,7 @@ Subject length lower bound = 11
+21 ^ ^ 1
+22 ^ ^ 2
+23 ^ ^ 3
-+24 ^ ^
++24 ^ ^ End of pattern
0: aacaacaacaacaac123
1: aac
@@ -8900,7 +8900,7 @@ Subject length lower bound = 0
+7 ^ b
+11 ^ ^
+12 ^ )
-+13 ^
++13 ^ End of pattern
0:
abc
--->abc
@@ -8921,7 +8921,7 @@ Subject length lower bound = 0
+8 ^^ )
+9 ^ b
+10 ^^ |
-+13 ^^
++13 ^^ End of pattern
0: b
/(?(?=b).*b|^d)/I
@@ -8938,14 +8938,14 @@ Subject length lower bound = 1
+0 ^ x
+1 ^^ y
+2 ^ ^ z
- +3 ^ ^
+ +3 ^ ^ End of pattern
0: xyz
abcxyz
--->abcxyz
+0 ^ x
+1 ^^ y
+2 ^ ^ z
- +3 ^ ^
+ +3 ^ ^ End of pattern
0: xyz
\= Expect no match
abc
@@ -8962,7 +8962,7 @@ No match
+0 ^ x
+1 ^^ y
+2 ^ ^ z
- +3 ^ ^
+ +3 ^ ^ End of pattern
0: xyz
\= Expect no match
abc
@@ -8996,7 +8996,7 @@ No match
+15 ^ x
+16 ^^ y
+17 ^ ^ z
-+18 ^ ^
++18 ^ ^ End of pattern
0: xyz
/(*NO_AUTO_POSSESS)a+b/B
@@ -9017,7 +9017,7 @@ No match
+0 ^ x
+1 ^^ y
+2 ^ ^ z
- +3 ^ ^
+ +3 ^ ^ End of pattern
0: xyz
/^"((?(?=[a])[^"])|b)*"$/auto_callout
@@ -9046,7 +9046,7 @@ No match
+17 ^ ^ |
+21 ^ ^ "
+22 ^ ^ $
-+23 ^ ^
++23 ^ ^ End of pattern
0: "ab"
1:
@@ -11136,7 +11136,7 @@ Latest Mark: A
+10 ^ ^ |
+18 ^ ^ z
+19 ^ ^ |
-+24 ^ ^
++24 ^ ^ End of pattern
0: adz
1: adz
2: d
@@ -11155,7 +11155,7 @@ Latest Mark: A
Latest Mark: B
+18 ^ ^ z
+19 ^ ^ |
-+24 ^ ^
++24 ^ ^ End of pattern
0: aez
1: aez
2: e
@@ -11177,7 +11177,7 @@ Latest Mark: B
+21 ^^ e
+22 ^ ^ q
+23 ^ ^ )
-+24 ^ ^
++24 ^ ^ End of pattern
0: aeq
1: aeq
@@ -11951,7 +11951,7 @@ Partial match: 123a
+11 ^ b
+12 ^^ b
+13 ^ ^ )
-+14 ^ ^
++14 ^ ^ End of pattern
0: bb
/(?C1)^(?C2)(?(?C99)(?=(?C3)a(?C4))(?C5)a(?C6)a(?C7)|(?C8)b(?C9)b(?C10))(?C11)/
@@ -11964,7 +11964,7 @@ Partial match: 123a
8 ^ b
9 ^^ b
10 ^ ^ )
- 11 ^ ^
+ 11 ^ ^ End of pattern
0: bb
# Perl seems to have a bug with this one.
@@ -15144,7 +15144,7 @@ Subject length lower bound = 0
+0 ^ (
+1 ^ )\Q\E*
+7 ^ ]
- +8 ^^
+ +8 ^^ End of pattern
0: ]
1:
@@ -15428,7 +15428,7 @@ Failed: error 125 at offset 13: lookbehind assertion is not fixed length
+0 ^ a
+1 ^^ b
1 ^ ^ c
- +8 ^ ^
+ +8 ^ ^ End of pattern
0: abc
/'ab(?C1)c'/hex,auto_callout
@@ -15437,7 +15437,7 @@ Failed: error 125 at offset 13: lookbehind assertion is not fixed length
+0 ^ a
+1 ^^ b
1 ^ ^ c
- +8 ^ ^
+ +8 ^ ^ End of pattern
0: abc
# Perl accepts these, but gives a warning. We can't warn, so give an error.
@@ -16256,7 +16256,7 @@ Failed: error 192 at offset 0: invalid option bits with PCRE2_LITERAL
+2 ^ ^ b
+3 ^ ^ (
+4 ^ ^ c
- +5 ^ ^
+ +5 ^ ^ End of pattern
0: a\b(c
/a\b(c/literal,auto_callout
@@ -16267,7 +16267,7 @@ Failed: error 192 at offset 0: invalid option bits with PCRE2_LITERAL
+2 ^ ^ b
+3 ^ ^ (
+4 ^ ^ c
- +5 ^ ^
+ +5 ^ ^ End of pattern
0: a\b(c
/(*CR)abc/literal
@@ -16380,9 +16380,65 @@ Subject length lower bound = 1
ab
0: ab
1: a
- aaab
+ aaab
0: ab
1: a
+
+# JIT does not support callout_extra
+
+/(*NO_JIT)(a+)b/auto_callout,no_start_optimize,no_auto_possess
+\= Expect no match
+ aac\=callout_extra
+New match attempt
+--->aac
+ +9 ^ (
++10 ^ a+
++12 ^ ^ )
++13 ^ ^ b
+Backtrack
+--->aac
++12 ^^ )
++13 ^^ b
+Backtrack
+No other matching paths
+New match attempt
+--->aac
+ +9 ^ (
++10 ^ a+
++12 ^^ )
++13 ^^ b
+Backtrack
+No other matching paths
+New match attempt
+--->aac
+ +9 ^ (
++10 ^ a+
+Backtrack
+No other matching paths
+New match attempt
+--->aac
+ +9 ^ (
++10 ^ a+
+No match
+
+/(*NO_JIT)a+(?C'XXX')b/no_start_optimize,no_auto_possess
+\= Expect no match
+ aac\=callout_extra
+New match attempt
+Callout (15): 'XXX'
+--->aac
+ ^ ^ b
+Backtrack
+Callout (15): 'XXX'
+--->aac
+ ^^ b
+Backtrack
+No other matching paths
+New match attempt
+Callout (15): 'XXX'
+--->aac
+ ^^ b
+No match
# End of testinput2
Error -65: PCRE2_ERROR_BADDATA (unknown error number)
diff --git a/testdata/testoutput5 b/testdata/testoutput5
index f80384c..911040a 100644
--- a/testdata/testoutput5
+++ b/testdata/testoutput5
@@ -3763,7 +3763,7 @@ No match
abcd
--->abcd
+0 ^ \w+
- +3 ^ ^
+ +3 ^ ^ End of pattern
0: abcd
/[\p{N}]?+/B,no_auto_possess
@@ -4165,7 +4165,7 @@ Failed: error 125 at offset 2: lookbehind assertion is not fixed length
+0 ^ .
+0 ^ .
+1 ^ ^ .
- +2 ^ ^
+ +2 ^ ^ End of pattern
0: \x{123}\x{123}
# This tests processing wide characters in extended mode.
diff --git a/testdata/testoutput6 b/testdata/testoutput6
index b912944..60e8349 100644
--- a/testdata/testoutput6
+++ b/testdata/testoutput6
@@ -726,7 +726,7 @@ No match
+4 ^ ^ c
+2 ^ ^ b
+3 ^ ^ |
-+12 ^ ^
++12 ^ ^ End of pattern
+1 ^ ^ a
+4 ^ ^ c
0: ababab
@@ -745,12 +745,12 @@ No match
+4 ^ ^ c
+2 ^ ^ b
+3 ^ ^ |
-+12 ^ ^
++12 ^ ^ End of pattern
+1 ^ ^ a
+4 ^ ^ c
+5 ^ ^ d
+6 ^ ^ ){3,4}
-+12 ^ ^
++12 ^ ^ End of pattern
0: abcdabcd
1: abcdab
abcdcdcdcdcd
@@ -768,12 +768,12 @@ No match
+4 ^ ^ c
+5 ^ ^ d
+6 ^ ^ ){3,4}
-+12 ^ ^
++12 ^ ^ End of pattern
+1 ^ ^ a
+4 ^ ^ c
+5 ^ ^ d
+6 ^ ^ ){3,4}
-+12 ^ ^
++12 ^ ^ End of pattern
0: abcdcdcd
1: abcdcd
@@ -6610,14 +6610,14 @@ No match
+0 ^ x
+1 ^^ y
+2 ^ ^ z
- +3 ^ ^
+ +3 ^ ^ End of pattern
0: xyz
abcxyz
--->abcxyz
+0 ^ x
+1 ^^ y
+2 ^ ^ z
- +3 ^ ^
+ +3 ^ ^ End of pattern
0: xyz
\= Expect no match
abc
@@ -6634,7 +6634,7 @@ No match
+0 ^ x
+1 ^^ y
+2 ^ ^ z
- +3 ^ ^
+ +3 ^ ^ End of pattern
0: xyz
\= Expect no match
abc
@@ -6668,7 +6668,7 @@ No match
+15 ^ x
+16 ^^ y
+17 ^ ^ z
-+18 ^ ^
++18 ^ ^ End of pattern
0: xyz
/(?C)ab/
@@ -6684,7 +6684,7 @@ No match
--->ab
+0 ^ a
+1 ^^ b
- +2 ^ ^
+ +2 ^ ^ End of pattern
0: ab
ab\=callout_none
0: ab
@@ -6717,7 +6717,7 @@ No match
+8 ^ [a]
+17 ^ ^ |
+22 ^ ^ $
-+23 ^ ^
++23 ^ ^ End of pattern
0: "ab"
"ab"\=callout_none
0: "ab"