summaryrefslogtreecommitdiff
path: root/doc/pcre2test.txt
diff options
context:
space:
mode:
authorph10 <ph10@6239d852-aaf2-0410-a92c-79f79f948069>2017-06-16 17:57:18 +0000
committerph10 <ph10@6239d852-aaf2-0410-a92c-79f79f948069>2017-06-16 17:57:18 +0000
commit13c619fc2258ede7a99a1e64341f9631f609830b (patch)
tree958ca0f263c80520cb7139d1ee95740983715711 /doc/pcre2test.txt
parentde41ae21b02996d1167b98d481b28f34836dce75 (diff)
downloadpcre2-13c619fc2258ede7a99a1e64341f9631f609830b.tar.gz
Documentation update.
git-svn-id: svn://vcs.exim.org/pcre2/code/trunk@829 6239d852-aaf2-0410-a92c-79f79f948069
Diffstat (limited to 'doc/pcre2test.txt')
-rw-r--r--doc/pcre2test.txt525
1 files changed, 275 insertions, 250 deletions
diff --git a/doc/pcre2test.txt b/doc/pcre2test.txt
index 60d81ac..32ac6d6 100644
--- a/doc/pcre2test.txt
+++ b/doc/pcre2test.txt
@@ -64,12 +64,12 @@ INPUT ENCODING
unless you really want that action.
The input is processed using using C's string functions, so must not
- contain binary zeroes, even though in Unix-like environments, fgets()
+ contain binary zeros, even though in Unix-like environments, fgets()
treats any bytes other than newline as data characters. An error is
- generated if a binary zero is encountered. Subject lines are processed
- for backslash escapes, which makes it possible to include any data
- value in strings that are passed to the library for matching. For pat-
- terns, there is a facility for specifying some or all of the 8-bit
+ generated if a binary zero is encountered. By default subject lines are
+ processed for backslash escapes, which makes it possible to include any
+ data value in strings that are passed to the library for matching. For
+ patterns, there is a facility for specifying some or all of the 8-bit
input characters as hexadecimal pairs, which makes it possible to
include binary zeros.
@@ -319,9 +319,9 @@ COMMAND LINES
When the POSIX API is being tested there is no way to override the
default newline convention, though it is possible to set the newline
- convention from within the pattern. A warning is given if the posix
- modifier is used when #newline_default would set a default for the non-
- POSIX API.
+ convention from within the pattern. A warning is given if the posix or
+ posix_nosub modifier is used when #newline_default would set a default
+ for the non-POSIX API.
#pattern <modifier-list>
@@ -424,8 +424,9 @@ SUBJECT LINE SYNTAX
Before each subject line is passed to pcre2_match() or
pcre2_dfa_match(), leading and trailing white space is removed, and the
- line is scanned for backslash escapes. The following provide a means of
- encoding non-printing characters in a visible way:
+ line is scanned for backslash escapes, unless the subject_literal modi-
+ fier was set for the pattern. The following provide a means of encoding
+ non-printing characters in a visible way:
\a alarm (BEL, \x07)
\b backspace (\x08)
@@ -442,23 +443,23 @@ SUBJECT LINE SYNTAX
\x{hh...} hexadecimal character (any number of hex digits)
The use of \x{hh...} is not dependent on the use of the utf modifier on
- the pattern. It is recognized always. There may be any number of hexa-
- decimal digits inside the braces; invalid values provoke error mes-
+ the pattern. It is recognized always. There may be any number of hexa-
+ decimal digits inside the braces; invalid values provoke error mes-
sages.
- Note that \xhh specifies one byte rather than one character in UTF-8
- mode; this makes it possible to construct invalid UTF-8 sequences for
- testing purposes. On the other hand, \x{hh} is interpreted as a UTF-8
- character in UTF-8 mode, generating more than one byte if the value is
- greater than 127. When testing the 8-bit library not in UTF-8 mode,
+ Note that \xhh specifies one byte rather than one character in UTF-8
+ mode; this makes it possible to construct invalid UTF-8 sequences for
+ testing purposes. On the other hand, \x{hh} is interpreted as a UTF-8
+ character in UTF-8 mode, generating more than one byte if the value is
+ greater than 127. When testing the 8-bit library not in UTF-8 mode,
\x{hh} generates one byte for values less than 256, and causes an error
for greater values.
In UTF-16 mode, all 4-digit \x{hhhh} values are accepted. This makes it
possible to construct invalid UTF-16 sequences for testing purposes.
- In UTF-32 mode, all 4- to 8-digit \x{...} values are accepted. This
- makes it possible to construct invalid UTF-32 sequences for testing
+ In UTF-32 mode, all 4- to 8-digit \x{...} values are accepted. This
+ makes it possible to construct invalid UTF-32 sequences for testing
purposes.
There is a special backslash sequence that specifies replication of one
@@ -466,33 +467,38 @@ SUBJECT LINE SYNTAX
\[<characters>]{<count>}
- This makes it possible to test long strings without having to provide
+ This makes it possible to test long strings without having to provide
them as part of the file. For example:
\[abc]{4}
- is converted to "abcabcabcabc". This feature does not support nesting.
+ is converted to "abcabcabcabc". This feature does not support nesting.
To include a closing square bracket in the characters, code it as \x5D.
- A backslash followed by an equals sign marks the end of the subject
+ A backslash followed by an equals sign marks the end of the subject
string and the start of a modifier list. For example:
abc\=notbol,notempty
- If the subject string is empty and \= is followed by whitespace, the
- line is treated as a comment line, and is not used for matching. For
+ If the subject string is empty and \= is followed by whitespace, the
+ line is treated as a comment line, and is not used for matching. For
example:
\= This is a comment.
abc\= This is an invalid modifier list.
- A backslash followed by any other non-alphanumeric character just
+ A backslash followed by any other non-alphanumeric character just
escapes that character. A backslash followed by anything else causes an
- error. However, if the very last character in the line is a backslash
- (and there is no modifier list), it is ignored. This gives a way of
- passing an empty line as data, since a real empty line terminates the
+ error. However, if the very last character in the line is a backslash
+ (and there is no modifier list), it is ignored. This gives a way of
+ passing an empty line as data, since a real empty line terminates the
data input.
+ If the subject_literal modifier is set for a pattern, all subject lines
+ that follow are treated as literals, with no special treatment of back-
+ slashes. No replication is possible, and any subject modifiers must be
+ set as defaults by a #subject command.
+
PATTERN MODIFIERS
@@ -530,7 +536,10 @@ PATTERN MODIFIERS
/x extended set PCRE2_EXTENDED
/xx extended_more set PCRE2_EXTENDED_MORE
firstline set PCRE2_FIRSTLINE
+ literal set PCRE2_LITERAL
+ match_line set PCRE2_EXTRA_MATCH_LINE
match_unset_backref set PCRE2_MATCH_UNSET_BACKREF
+ match_word set PCRE2_EXTRA_MATCH_WORD
/m multiline set PCRE2_MULTILINE
never_backslash_c set PCRE2_NEVER_BACKSLASH_C
never_ucp set PCRE2_NEVER_UCP
@@ -580,6 +589,7 @@ PATTERN MODIFIERS
push push compiled pattern onto the stack
pushcopy push a copy onto the stack
stackguard=<number> test the stackguard feature
+ subject_literal treat all subject lines as literal
tables=[0|1|2] select internal tables
use_length do not zero-terminate the pattern
utf8_input treat input as UTF-8
@@ -659,16 +669,6 @@ PATTERN MODIFIERS
testing that pcre2_compile() behaves correctly in this case (it uses
default values).
- Specifying the pattern's length
-
- By default, patterns are passed to the compiling functions as zero-ter-
- minated strings. When using the POSIX wrapper API, there is no other
- option. However, when using PCRE2's native API, patterns can be passed
- by length instead of being zero-terminated. The use_length modifier
- causes this to happen. Using a length happens automatically (whether
- or not use_length is set) when hex is set, because patterns specified
- in hexadecimal may contain binary zeros.
-
Specifying pattern characters in hexadecimal
The hex modifier specifies that the characters of the pattern, except
@@ -690,61 +690,68 @@ PATTERN MODIFIERS
ing the delimiter within a substring. The hex and expand modifiers are
mutually exclusive.
- The POSIX API cannot be used with patterns specified in hexadecimal
- because they may contain binary zeros, which conflicts with regcomp()'s
- requirement for a zero-terminated string. Such patterns are always
- passed to pcre2_compile() as a string with a length, not as zero-termi-
- nated.
+ Specifying the pattern's length
+
+ By default, patterns are passed to the compiling functions as zero-ter-
+ minated strings but can be passed by length instead of being zero-ter-
+ minated. The use_length modifier causes this to happen. Using a length
+ happens automatically (whether or not use_length is set) when hex is
+ set, because patterns specified in hexadecimal may contain binary
+ zeros.
+
+ If hex or use_length is used with the POSIX wrapper API (see "Using the
+ POSIX wrapper API" below), the REG_PEND extension is used to pass the
+ pattern's length.
Specifying wide characters in 16-bit and 32-bit modes
In 16-bit and 32-bit modes, all input is automatically treated as UTF-8
- and translated to UTF-16 or UTF-32 when the utf modifier is set. For
+ and translated to UTF-16 or UTF-32 when the utf modifier is set. For
testing the 16-bit and 32-bit libraries in non-UTF mode, the utf8_input
- modifier can be used. It is mutually exclusive with utf. Input lines
+ modifier can be used. It is mutually exclusive with utf. Input lines
are interpreted as UTF-8 as a means of specifying wide characters. More
details are given in "Input encoding" above.
Generating long repetitive patterns
- Some tests use long patterns that are very repetitive. Instead of cre-
- ating a very long input line for such a pattern, you can use a special
- repetition feature, similar to the one described for subject lines
- above. If the expand modifier is present on a pattern, parts of the
+ Some tests use long patterns that are very repetitive. Instead of cre-
+ ating a very long input line for such a pattern, you can use a special
+ repetition feature, similar to the one described for subject lines
+ above. If the expand modifier is present on a pattern, parts of the
pattern that have the form
\[<characters>]{<count>}
are expanded before the pattern is passed to pcre2_compile(). For exam-
ple, \[AB]{6000} is expanded to "ABAB..." 6000 times. This construction
- cannot be nested. An initial "\[" sequence is recognized only if "]{"
- followed by decimal digits and "}" is found later in the pattern. If
+ cannot be nested. An initial "\[" sequence is recognized only if "]{"
+ followed by decimal digits and "}" is found later in the pattern. If
not, the characters remain in the pattern unaltered. The expand and hex
modifiers are mutually exclusive.
- If part of an expanded pattern looks like an expansion, but is really
+ If part of an expanded pattern looks like an expansion, but is really
part of the actual pattern, unwanted expansion can be avoided by giving
two values in the quantifier. For example, \[AB]{6000,6000} is not rec-
ognized as an expansion item.
- If the info modifier is set on an expanded pattern, the result of the
+ If the info modifier is set on an expanded pattern, the result of the
expansion is included in the information that is output.
JIT compilation
- Just-in-time (JIT) compiling is a heavyweight optimization that can
- greatly speed up pattern matching. See the pcre2jit documentation for
- details. JIT compiling happens, optionally, after a pattern has been
- successfully compiled into an internal form. The JIT compiler converts
+ Just-in-time (JIT) compiling is a heavyweight optimization that can
+ greatly speed up pattern matching. See the pcre2jit documentation for
+ details. JIT compiling happens, optionally, after a pattern has been
+ successfully compiled into an internal form. The JIT compiler converts
this to optimized machine code. It needs to know whether the match-time
options PCRE2_PARTIAL_HARD and PCRE2_PARTIAL_SOFT are going to be used,
- because different code is generated for the different cases. See the
- partial modifier in "Subject Modifiers" below for details of how these
+ because different code is generated for the different cases. See the
+ partial modifier in "Subject Modifiers" below for details of how these
options are specified for each match attempt.
- JIT compilation is requested by the /jit pattern modifier, which may
+ JIT compilation is requested by the jit pattern modifier, which may
optionally be followed by an equals sign and a number in the range 0 to
- 7. The three bits that make up the number specify which of the three
+ 7. The three bits that make up the number specify which of the three
JIT operating modes are to be compiled:
1 compile JIT code for non-partial matching
@@ -761,31 +768,31 @@ PATTERN MODIFIERS
6 soft and hard partial matching only
7 all three modes
- If no number is given, 7 is assumed. The phrase "partial matching"
+ If no number is given, 7 is assumed. The phrase "partial matching"
means a call to pcre2_match() with either the PCRE2_PARTIAL_SOFT or the
- PCRE2_PARTIAL_HARD option set. Note that such a call may return a com-
+ PCRE2_PARTIAL_HARD option set. Note that such a call may return a com-
plete match; the options enable the possibility of a partial match, but
- do not require it. Note also that if you request JIT compilation only
- for partial matching (for example, /jit=2) but do not set the partial
- modifier on a subject line, that match will not use JIT code because
+ do not require it. Note also that if you request JIT compilation only
+ for partial matching (for example, jit=2) but do not set the partial
+ modifier on a subject line, that match will not use JIT code because
none was compiled for non-partial matching.
- If JIT compilation is successful, the compiled JIT code will automati-
- cally be used when an appropriate type of match is run, except when
- incompatible run-time options are specified. For more details, see the
- pcre2jit documentation. See also the jitstack modifier below for a way
+ If JIT compilation is successful, the compiled JIT code will automati-
+ cally be used when an appropriate type of match is run, except when
+ incompatible run-time options are specified. For more details, see the
+ pcre2jit documentation. See also the jitstack modifier below for a way
of setting the size of the JIT stack.
- If the jitfast modifier is specified, matching is done using the JIT
- "fast path" interface, pcre2_jit_match(), which skips some of the san-
- ity checks that are done by pcre2_match(), and of course does not work
- when JIT is not supported. If jitfast is specified without jit, jit=7
+ If the jitfast modifier is specified, matching is done using the JIT
+ "fast path" interface, pcre2_jit_match(), which skips some of the san-
+ ity checks that are done by pcre2_match(), and of course does not work
+ when JIT is not supported. If jitfast is specified without jit, jit=7
is assumed.
- If the jitverify modifier is specified, information about the compiled
- pattern shows whether JIT compilation was or was not successful. If
- jitverify is specified without jit, jit=7 is assumed. If JIT compila-
- tion is successful when jitverify is set, the text "(JIT)" is added to
+ If the jitverify modifier is specified, information about the compiled
+ pattern shows whether JIT compilation was or was not successful. If
+ jitverify is specified without jit, jit=7 is assumed. If JIT compila-
+ tion is successful when jitverify is set, the text "(JIT)" is added to
the first output line after a match or non match when JIT-compiled code
was actually used in the match.
@@ -796,19 +803,19 @@ PATTERN MODIFIERS
/pattern/locale=fr_FR
The given locale is set, pcre2_maketables() is called to build a set of
- character tables for the locale, and this is then passed to pcre2_com-
- pile() when compiling the regular expression. The same tables are used
- when matching the following subject lines. The locale modifier applies
+ character tables for the locale, and this is then passed to pcre2_com-
+ pile() when compiling the regular expression. The same tables are used
+ when matching the following subject lines. The locale modifier applies
only to the pattern on which it appears, but can be given in a #pattern
- command if a default is needed. Setting a locale and alternate charac-
+ command if a default is needed. Setting a locale and alternate charac-
ter tables are mutually exclusive.
Showing pattern memory
The memory modifier causes the size in bytes of the memory used to hold
- the compiled pattern to be output. This does not include the size of
- the pcre2_code block; it is just the actual compiled data. If the pat-
- tern is subsequently passed to the JIT compiler, the size of the JIT
+ the compiled pattern to be output. This does not include the size of
+ the pcre2_code block; it is just the actual compiled data. If the pat-
+ tern is subsequently passed to the JIT compiler, the size of the JIT
compiled code is also output. Here is an example:
re> /a(b)c/jit,memory
@@ -818,27 +825,27 @@ PATTERN MODIFIERS
Limiting nested parentheses
- The parens_nest_limit modifier sets a limit on the depth of nested
- parentheses in a pattern. Breaching the limit causes a compilation
- error. The default for the library is set when PCRE2 is built, but
- pcre2test sets its own default of 220, which is required for running
+ The parens_nest_limit modifier sets a limit on the depth of nested
+ parentheses in a pattern. Breaching the limit causes a compilation
+ error. The default for the library is set when PCRE2 is built, but
+ pcre2test sets its own default of 220, which is required for running
the standard test suite.
Limiting the pattern length
- The max_pattern_length modifier sets a limit, in code units, to the
+ The max_pattern_length modifier sets a limit, in code units, to the
length of pattern that pcre2_compile() will accept. Breaching the limit
- causes a compilation error. The default is the largest number a
+ causes a compilation error. The default is the largest number a
PCRE2_SIZE variable can hold (essentially unlimited).
Using the POSIX wrapper API
- The /posix and posix_nosub modifiers cause pcre2test to call PCRE2 via
- the POSIX wrapper API rather than its native API. When posix_nosub is
- used, the POSIX option REG_NOSUB is passed to regcomp(). The POSIX
- wrapper supports only the 8-bit library. Note that it does not imply
+ The posix and posix_nosub modifiers cause pcre2test to call PCRE2 via
+ the POSIX wrapper API rather than its native API. When posix_nosub is
+ used, the POSIX option REG_NOSUB is passed to regcomp(). The POSIX
+ wrapper supports only the 8-bit library. Note that it does not imply
POSIX matching semantics; for more detail see the pcre2posix documenta-
- tion. The following pattern modifiers set options for the regcomp()
+ tion. The following pattern modifiers set options for the regcomp()
function:
caseless REG_ICASE
@@ -848,35 +855,39 @@ PATTERN MODIFIERS
ucp REG_UCP ) the POSIX standard
utf REG_UTF8 )
- The regerror_buffsize modifier specifies a size for the error buffer
- that is passed to regerror() in the event of a compilation error. For
+ The regerror_buffsize modifier specifies a size for the error buffer
+ that is passed to regerror() in the event of a compilation error. For
example:
/abc/posix,regerror_buffsize=20
- This provides a means of testing the behaviour of regerror() when the
- buffer is too small for the error message. If this modifier has not
+ This provides a means of testing the behaviour of regerror() when the
+ buffer is too small for the error message. If this modifier has not
been set, a large buffer is used.
- The aftertext and allaftertext subject modifiers work as described
- below. All other modifiers are either ignored, with a warning message,
+ The aftertext and allaftertext subject modifiers work as described
+ below. All other modifiers are either ignored, with a warning message,
or cause an error.
+ The pattern is passed to regcomp() as a zero-terminated string by
+ default, but if the use_length or hex modifiers are set, the REG_PEND
+ extension is used to pass it by length.
+
Testing the stack guard feature
- The stackguard modifier is used to test the use of pcre2_set_com-
- pile_recursion_guard(), a function that is provided to enable stack
- availability to be checked during compilation (see the pcre2api docu-
- mentation for details). If the number specified by the modifier is
+ The stackguard modifier is used to test the use of pcre2_set_com-
+ pile_recursion_guard(), a function that is provided to enable stack
+ availability to be checked during compilation (see the pcre2api docu-
+ mentation for details). If the number specified by the modifier is
greater than zero, pcre2_set_compile_recursion_guard() is called to set
- up callback from pcre2_compile() to a local function. The argument it
- receives is the current nesting parenthesis depth; if this is greater
+ up callback from pcre2_compile() to a local function. The argument it
+ receives is the current nesting parenthesis depth; if this is greater
than the value given by the modifier, non-zero is returned, causing the
compilation to be aborted.
Using alternative character tables
- The value specified for the tables modifier must be one of the digits
+ The value specified for the tables modifier must be one of the digits
0, 1, or 2. It causes a specific set of built-in character tables to be
passed to pcre2_compile(). This is used in the PCRE2 tests to check be-
haviour with different character tables. The digit specifies the tables
@@ -887,23 +898,25 @@ PATTERN MODIFIERS
pcre2_chartables.c.dist
2 a set of tables defining ISO 8859 characters
- In table 2, some characters whose codes are greater than 128 are iden-
- tified as letters, digits, spaces, etc. Setting alternate character
+ In table 2, some characters whose codes are greater than 128 are iden-
+ tified as letters, digits, spaces, etc. Setting alternate character
tables and a locale are mutually exclusive.
Setting certain match controls
The following modifiers are really subject modifiers, and are described
- below. However, they may be included in a pattern's modifier list, in
- which case they are applied to every subject line that is processed
- with that pattern. They may not appear in #pattern commands. These mod-
- ifiers do not affect the compilation process.
+ under "Subject Modifiers" below. However, they may be included in a
+ pattern's modifier list, in which case they are applied to every sub-
+ ject line that is processed with that pattern. They may not appear in
+ #pattern commands. These modifiers do not affect the compilation
+ process.
aftertext show text after match
allaftertext show text after captures
allcaptures show all captures
allusedtext show all consulted text
/g global global matching
+ jitstack=<n> set size of JIT stack
mark show mark values
replace=<string> specify a replacement string
startchar show starting character when relevant
@@ -915,6 +928,14 @@ PATTERN MODIFIERS
These modifiers may not appear in a #pattern command. If you want them
as defaults, set them in a #subject command.
+ Specifying literal subject lines
+
+ If the subject_literal modifier is present on a pattern, all the sub-
+ ject lines that it matches are taken as literal strings, with no inter-
+ pretation of backslashes. It is not possible to set subject modifiers
+ on such lines, but any that are set as defaults by a #subject command
+ are recognized.
+
Saving a compiled pattern
When a pattern with the push modifier is successfully compiled, it is
@@ -959,11 +980,11 @@ SUBJECT MODIFIERS
The partial matching modifiers are provided with abbreviations because
they appear frequently in tests.
- If the posix modifier was present on the pattern, causing the POSIX
- wrapper API to be used, the only option-setting modifiers that have any
- effect are notbol, notempty, and noteol, causing REG_NOTBOL,
- REG_NOTEMPTY, and REG_NOTEOL, respectively, to be passed to regexec().
- The other modifiers are ignored, with a warning message.
+ If the posix or posix_nosub modifier was present on the pattern, caus-
+ ing the POSIX wrapper API to be used, the only option-setting modifiers
+ that have any effect are notbol, notempty, and noteol, causing REG_NOT-
+ BOL, REG_NOTEMPTY, and REG_NOTEOL, respectively, to be passed to
+ regexec(). The other modifiers are ignored, with a warning message.
There is one additional modifier that can be used with the POSIX wrap-
per. It is ignored (with a warning) if used for non-POSIX matching.
@@ -971,16 +992,19 @@ SUBJECT MODIFIERS
posix_startend=<n>[:<m>]
This causes the subject string to be passed to regexec() using the
- REG_STARTEND option, which uses offsets to restrict which part of the
+ REG_STARTEND option, which uses offsets to specify which part of the
string is searched. If only one number is given, the end offset is
passed as the end of the subject string. For more detail of REG_STAR-
- TEND, see the pcre2posix documentation.
+ TEND, see the pcre2posix documentation. If the subject string contains
+ binary zeros (coded as escapes such as \x{00} because pcre2test does
+ not support actual binary zeros in its input), you must use posix_star-
+ tend to specify its length.
Setting match controls
- The following modifiers affect the matching process or request addi-
- tional information. Some of them may also be specified on a pattern
- line (see above), in which case they apply to every subject line that
+ The following modifiers affect the matching process or request addi-
+ tional information. Some of them may also be specified on a pattern
+ line (see above), in which case they apply to every subject line that
is matched against that pattern.
aftertext show text after match
@@ -1020,29 +1044,29 @@ SUBJECT MODIFIERS
zero_terminate pass the subject as zero-terminated
The effects of these modifiers are described in the following sections.
- When matching via the POSIX wrapper API, the aftertext, allaftertext,
- and ovector subject modifiers work as described below. All other modi-
+ When matching via the POSIX wrapper API, the aftertext, allaftertext,
+ and ovector subject modifiers work as described below. All other modi-
fiers are either ignored, with a warning message, or cause an error.
Showing more text
- The aftertext modifier requests that as well as outputting the part of
+ The aftertext modifier requests that as well as outputting the part of
the subject string that matched the entire pattern, pcre2test should in
addition output the remainder of the subject string. This is useful for
tests where the subject contains multiple copies of the same substring.
- The allaftertext modifier requests the same action for captured sub-
+ The allaftertext modifier requests the same action for captured sub-
strings as well as the main matched substring. In each case the remain-
der is output on the following line with a plus character following the
capture number.
- The allusedtext modifier requests that all the text that was consulted
- during a successful pattern match by the interpreter should be shown.
- This feature is not supported for JIT matching, and if requested with
- JIT it is ignored (with a warning message). Setting this modifier
+ The allusedtext modifier requests that all the text that was consulted
+ during a successful pattern match by the interpreter should be shown.
+ This feature is not supported for JIT matching, and if requested with
+ JIT it is ignored (with a warning message). Setting this modifier
affects the output if there is a lookbehind at the start of a match, or
- a lookahead at the end, or if \K is used in the pattern. Characters
- that precede or follow the start and end of the actual match are indi-
- cated in the output by '<' or '>' characters underneath them. Here is
+ a lookahead at the end, or if \K is used in the pattern. Characters
+ that precede or follow the start and end of the actual match are indi-
+ cated in the output by '<' or '>' characters underneath them. Here is
an example:
re> /(?<=pqr)abc(?=xyz)/
@@ -1050,16 +1074,16 @@ SUBJECT MODIFIERS
0: pqrabcxyz
<<< >>>
- This shows that the matched string is "abc", with the preceding and
- following strings "pqr" and "xyz" having been consulted during the
+ This shows that the matched string is "abc", with the preceding and
+ following strings "pqr" and "xyz" having been consulted during the
match (when processing the assertions).
- The startchar modifier requests that the starting character for the
- match be indicated, if it is different to the start of the matched
+ The startchar modifier requests that the starting character for the
+ match be indicated, if it is different to the start of the matched
string. The only time when this occurs is when \K has been processed as
part of the match. In this situation, the output for the matched string
- is displayed from the starting character instead of from the match
- point, with circumflex characters under the earlier characters. For
+ is displayed from the starting character instead of from the match
+ point, with circumflex characters under the earlier characters. For
example:
re> /abc\Kxyz/
@@ -1067,7 +1091,7 @@ SUBJECT MODIFIERS
0: abcxyz
^^^
- Unlike allusedtext, the startchar modifier can be used with JIT. How-
+ Unlike allusedtext, the startchar modifier can be used with JIT. How-
ever, these two modifiers are mutually exclusive.
Showing the value of all capture groups
@@ -1075,98 +1099,98 @@ SUBJECT MODIFIERS
The allcaptures modifier requests that the values of all potential cap-
tured parentheses be output after a match. By default, only those up to
the highest one actually used in the match are output (corresponding to
- the return code from pcre2_match()). Groups that did not take part in
- the match are output as "<unset>". This modifier is not relevant for
- DFA matching (which does no capturing); it is ignored, with a warning
+ the return code from pcre2_match()). Groups that did not take part in
+ the match are output as "<unset>". This modifier is not relevant for
+ DFA matching (which does no capturing); it is ignored, with a warning
message, if present.
Testing callouts
- A callout function is supplied when pcre2test calls the library match-
- ing functions, unless callout_none is specified. If callout_capture is
- set, the current captured groups are output when a callout occurs. The
+ A callout function is supplied when pcre2test calls the library match-
+ ing functions, unless callout_none is specified. If callout_capture is
+ set, the current captured groups are output when a callout occurs. The
default return from the callout function is zero, which allows matching
to continue.
- The callout_fail modifier can be given one or two numbers. If there is
- only one number, 1 is returned instead of 0 (causing matching to back-
- track) when a callout of that number is reached. If two numbers
- (<n>:<m>) are given, 1 is returned when callout <n> is reached and
- there have been at least <m> callouts. The callout_error modifier is
- similar, except that PCRE2_ERROR_CALLOUT is returned, causing the
- entire matching process to be aborted. If both these modifiers are set
+ The callout_fail modifier can be given one or two numbers. If there is
+ only one number, 1 is returned instead of 0 (causing matching to back-
+ track) when a callout of that number is reached. If two numbers
+ (<n>:<m>) are given, 1 is returned when callout <n> is reached and
+ there have been at least <m> callouts. The callout_error modifier is
+ similar, except that PCRE2_ERROR_CALLOUT is returned, causing the
+ entire matching process to be aborted. If both these modifiers are set
for the same callout number, callout_error takes precedence.
- Note that callouts with string arguments are always given the number
+ Note that callouts with string arguments are always given the number
zero. See "Callouts" below for a description of the output when a call-
out it taken.
- The callout_data modifier can be given an unsigned or a negative num-
- ber. This is set as the "user data" that is passed to the matching
- function, and passed back when the callout function is invoked. Any
- value other than zero is used as a return from pcre2test's callout
+ The callout_data modifier can be given an unsigned or a negative num-
+ ber. This is set as the "user data" that is passed to the matching
+ function, and passed back when the callout function is invoked. Any
+ value other than zero is used as a return from pcre2test's callout
function.
Finding all matches in a string
Searching for all possible matches within a subject can be requested by
- the global or altglobal modifier. After finding a match, the matching
- function is called again to search the remainder of the subject. The
- difference between global and altglobal is that the former uses the
- start_offset argument to pcre2_match() or pcre2_dfa_match() to start
- searching at a new point within the entire string (which is what Perl
+ the global or altglobal modifier. After finding a match, the matching
+ function is called again to search the remainder of the subject. The
+ difference between global and altglobal is that the former uses the
+ start_offset argument to pcre2_match() or pcre2_dfa_match() to start
+ searching at a new point within the entire string (which is what Perl
does), whereas the latter passes over a shortened subject. This makes a
difference to the matching process if the pattern begins with a lookbe-
hind assertion (including \b or \B).
- If an empty string is matched, the next match is done with the
+ If an empty string is matched, the next match is done with the
PCRE2_NOTEMPTY_ATSTART and PCRE2_ANCHORED flags set, in order to search
for another, non-empty, match at the same point in the subject. If this
- match fails, the start offset is advanced, and the normal match is
- retried. This imitates the way Perl handles such cases when using the
- /g modifier or the split() function. Normally, the start offset is
- advanced by one character, but if the newline convention recognizes
- CRLF as a newline, and the current character is CR followed by LF, an
+ match fails, the start offset is advanced, and the normal match is
+ retried. This imitates the way Perl handles such cases when using the
+ /g modifier or the split() function. Normally, the start offset is
+ advanced by one character, but if the newline convention recognizes
+ CRLF as a newline, and the current character is CR followed by LF, an
advance of two characters occurs.
Testing substring extraction functions
- The copy and get modifiers can be used to test the pcre2_sub-
+ The copy and get modifiers can be used to test the pcre2_sub-
string_copy_xxx() and pcre2_substring_get_xxx() functions. They can be
- given more than once, and each can specify a group name or number, for
+ given more than once, and each can specify a group name or number, for
example:
abcd\=copy=1,copy=3,get=G1
- If the #subject command is used to set default copy and/or get lists,
- these can be unset by specifying a negative number to cancel all num-
+ If the #subject command is used to set default copy and/or get lists,
+ these can be unset by specifying a negative number to cancel all num-
bered groups and an empty name to cancel all named groups.
- The getall modifier tests pcre2_substring_list_get(), which extracts
+ The getall modifier tests pcre2_substring_list_get(), which extracts
all captured substrings.
- If the subject line is successfully matched, the substrings extracted
- by the convenience functions are output with C, G, or L after the
- string number instead of a colon. This is in addition to the normal
- full list. The string length (that is, the return from the extraction
+ If the subject line is successfully matched, the substrings extracted
+ by the convenience functions are output with C, G, or L after the
+ string number instead of a colon. This is in addition to the normal
+ full list. The string length (that is, the return from the extraction
function) is given in parentheses after each substring, followed by the
name when the extraction was by name.
Testing the substitution function
- If the replace modifier is set, the pcre2_substitute() function is
- called instead of one of the matching functions. Note that replacement
- strings cannot contain commas, because a comma signifies the end of a
+ If the replace modifier is set, the pcre2_substitute() function is
+ called instead of one of the matching functions. Note that replacement
+ strings cannot contain commas, because a comma signifies the end of a
modifier. This is not thought to be an issue in a test program.
- Unlike subject strings, pcre2test does not process replacement strings
- for escape sequences. In UTF mode, a replacement string is checked to
- see if it is a valid UTF-8 string. If so, it is correctly converted to
- a UTF string of the appropriate code unit width. If it is not a valid
- UTF-8 string, the individual code units are copied directly. This pro-
+ Unlike subject strings, pcre2test does not process replacement strings
+ for escape sequences. In UTF mode, a replacement string is checked to
+ see if it is a valid UTF-8 string. If so, it is correctly converted to
+ a UTF string of the appropriate code unit width. If it is not a valid
+ UTF-8 string, the individual code units are copied directly. This pro-
vides a means of passing an invalid UTF-8 string for testing purposes.
- The following modifiers set options (in additional to the normal match
+ The following modifiers set options (in additional to the normal match
options) for pcre2_substitute():
global PCRE2_SUBSTITUTE_GLOBAL
@@ -1176,8 +1200,8 @@ SUBJECT MODIFIERS
substitute_unset_empty PCRE2_SUBSTITUTE_UNSET_EMPTY
- After a successful substitution, the modified string is output, pre-
- ceded by the number of replacements. This may be zero if there were no
+ After a successful substitution, the modified string is output, pre-
+ ceded by the number of replacements. This may be zero if there were no
matches. Here is a simple example of a substitution test:
/abc/replace=xxx
@@ -1186,12 +1210,12 @@ SUBJECT MODIFIERS
=abc=abc=\=global
2: =xxx=xxx=
- Subject and replacement strings should be kept relatively short (fewer
- than 256 characters) for substitution tests, as fixed-size buffers are
- used. To make it easy to test for buffer overflow, if the replacement
- string starts with a number in square brackets, that number is passed
- to pcre2_substitute() as the size of the output buffer, with the
- replacement string starting at the next character. Here is an example
+ Subject and replacement strings should be kept relatively short (fewer
+ than 256 characters) for substitution tests, as fixed-size buffers are
+ used. To make it easy to test for buffer overflow, if the replacement
+ string starts with a number in square brackets, that number is passed
+ to pcre2_substitute() as the size of the output buffer, with the
+ replacement string starting at the next character. Here is an example
that tests the edge case:
/abc/
@@ -1200,11 +1224,11 @@ SUBJECT MODIFIERS
123abc123\=replace=[9]XYZ
Failed: error -47: no more memory
- The default action of pcre2_substitute() is to return
- PCRE2_ERROR_NOMEMORY when the output buffer is too small. However, if
- the PCRE2_SUBSTITUTE_OVERFLOW_LENGTH option is set (by using the sub-
- stitute_overflow_length modifier), pcre2_substitute() continues to go
- through the motions of matching and substituting, in order to compute
+ The default action of pcre2_substitute() is to return
+ PCRE2_ERROR_NOMEMORY when the output buffer is too small. However, if
+ the PCRE2_SUBSTITUTE_OVERFLOW_LENGTH option is set (by using the sub-
+ stitute_overflow_length modifier), pcre2_substitute() continues to go
+ through the motions of matching and substituting, in order to compute
the size of buffer that is required. When this happens, pcre2test shows
the required buffer length (which includes space for the trailing zero)
as part of the error message. For example:
@@ -1214,105 +1238,106 @@ SUBJECT MODIFIERS
Failed: error -47: no more memory: 10 code units are needed
A replacement string is ignored with POSIX and DFA matching. Specifying
- partial matching provokes an error return ("bad option value") from
+ partial matching provokes an error return ("bad option value") from
pcre2_substitute().
Setting the JIT stack size
- The jitstack modifier provides a way of setting the maximum stack size
- that is used by the just-in-time optimization code. It is ignored if
+ The jitstack modifier provides a way of setting the maximum stack size
+ that is used by the just-in-time optimization code. It is ignored if
JIT optimization is not being used. The value is a number of kilobytes.
- Providing a stack that is larger than the default 32K is necessary only
- for very complicated patterns.
+ Setting zero reverts to the default of 32K. Providing a stack that is
+ larger than the default is necessary only for very complicated pat-
+ terns. If jitstack is set non-zero on a subject line it overrides any
+ value that was set on the pattern.
Setting heap, match, and depth limits
- The heap_limit, match_limit, and depth_limit modifiers set the appro-
- priate limits in the match context. These values are ignored when the
+ The heap_limit, match_limit, and depth_limit modifiers set the appro-
+ priate limits in the match context. These values are ignored when the
find_limits modifier is specified.
Finding minimum limits
- If the find_limits modifier is present on a subject line, pcre2test
- calls the relevant matching function several times, setting different
- values in the match context via pcre2_set_heap_limit(),
- pcre2_set_match_limit(), or pcre2_set_depth_limit() until it finds the
- minimum values for each parameter that allows the match to complete
+ If the find_limits modifier is present on a subject line, pcre2test
+ calls the relevant matching function several times, setting different
+ values in the match context via pcre2_set_heap_limit(),
+ pcre2_set_match_limit(), or pcre2_set_depth_limit() until it finds the
+ minimum values for each parameter that allows the match to complete
without error.
If JIT is being used, only the match limit is relevant. If DFA matching
is being used, only the depth limit is relevant.
- The match_limit number is a measure of the amount of backtracking that
- takes place, and learning the minimum value can be instructive. For
- most simple matches, the number is quite small, but for patterns with
- very large numbers of matching possibilities, it can become large very
+ The match_limit number is a measure of the amount of backtracking that
+ takes place, and learning the minimum value can be instructive. For
+ most simple matches, the number is quite small, but for patterns with
+ very large numbers of matching possibilities, it can become large very
quickly with increasing length of subject string.
- For non-DFA matching, the minimum depth_limit number is a measure of
+ For non-DFA matching, the minimum depth_limit number is a measure of
how much nested backtracking happens (that is, how deeply the pattern's
- tree is searched). In the case of DFA matching, depth_limit controls
- the depth of recursive calls of the internal function that is used for
+ tree is searched). In the case of DFA matching, depth_limit controls
+ the depth of recursive calls of the internal function that is used for
handling pattern recursion, lookaround assertions, and atomic groups.
Showing MARK names
The mark modifier causes the names from backtracking control verbs that
- are returned from calls to pcre2_match() to be displayed. If a mark is
- returned for a match, non-match, or partial match, pcre2test shows it.
- For a match, it is on a line by itself, tagged with "MK:". Otherwise,
+ are returned from calls to pcre2_match() to be displayed. If a mark is
+ returned for a match, non-match, or partial match, pcre2test shows it.
+ For a match, it is on a line by itself, tagged with "MK:". Otherwise,
it is added to the non-match message.
Showing memory usage
- The memory modifier causes pcre2test to log the sizes of all heap mem-
- ory allocation and freeing calls that occur during a call to
- pcre2_match(). These occur only when a match requires a bigger vector
- than the default for remembering backtracking points. In many cases
- there will be no heap memory used and therefore no additional output.
- No heap memory is allocated during matching with pcre2_dfa_match or
- with JIT, so in those cases the memory modifier never has any effect.
+ The memory modifier causes pcre2test to log the sizes of all heap mem-
+ ory allocation and freeing calls that occur during a call to
+ pcre2_match(). These occur only when a match requires a bigger vector
+ than the default for remembering backtracking points. In many cases
+ there will be no heap memory used and therefore no additional output.
+ No heap memory is allocated during matching with pcre2_dfa_match or
+ with JIT, so in those cases the memory modifier never has any effect.
For this modifier to work, the null_context modifier must not be set on
- both the pattern and the subject, though it can be set on one or the
+ both the pattern and the subject, though it can be set on one or the
other.
Setting a starting offset
- The offset modifier sets an offset in the subject string at which
+ The offset modifier sets an offset in the subject string at which
matching starts. Its value is a number of code units, not characters.
Setting an offset limit
- The offset_limit modifier sets a limit for unanchored matches. If a
+ The offset_limit modifier sets a limit for unanchored matches. If a
match cannot be found starting at or before this offset in the subject,
a "no match" return is given. The data value is a number of code units,
- not characters. When this modifier is used, the use_offset_limit modi-
+ not characters. When this modifier is used, the use_offset_limit modi-
fier must have been set for the pattern; if not, an error is generated.
Setting the size of the output vector
- The ovector modifier applies only to the subject line in which it
- appears, though of course it can also be used to set a default in a
- #subject command. It specifies the number of pairs of offsets that are
+ The ovector modifier applies only to the subject line in which it
+ appears, though of course it can also be used to set a default in a
+ #subject command. It specifies the number of pairs of offsets that are
available for storing matching information. The default is 15.
- A value of zero is useful when testing the POSIX API because it causes
+ A value of zero is useful when testing the POSIX API because it causes
regexec() to be called with a NULL capture vector. When not testing the
- POSIX API, a value of zero is used to cause pcre2_match_data_cre-
- ate_from_pattern() to be called, in order to create a match block of
+ POSIX API, a value of zero is used to cause pcre2_match_data_cre-
+ ate_from_pattern() to be called, in order to create a match block of
exactly the right size for the pattern. (It is not possible to create a
- match block with a zero-length ovector; there is always at least one
+ match block with a zero-length ovector; there is always at least one
pair of offsets.)
Passing the subject as zero-terminated
By default, the subject string is passed to a native API matching func-
tion with its correct length. In order to test the facility for passing
- a zero-terminated string, the zero_terminate modifier is provided. It
- causes the length to be passed as PCRE2_ZERO_TERMINATED. (When matching
- via the POSIX interface, this modifier has no effect, as there is no
- facility for passing a length.)
+ a zero-terminated string, the zero_terminate modifier is provided. It
+ causes the length to be passed as PCRE2_ZERO_TERMINATED. When matching
+ via the POSIX interface, this modifier is ignored, with a warning.
When testing pcre2_substitute(), this modifier also has the effect of
passing the replacement string as zero-terminated.
@@ -1513,8 +1538,8 @@ CALLOUTS
position, which can happen if the callout is in a lookbehind assertion.
Callouts numbered 255 are assumed to be automatic callouts, inserted as
- a result of the /auto_callout pattern modifier. In this case, instead
- of showing the callout number, the offset in the pattern, preceded by a
+ a result of the auto_callout pattern modifier. In this case, instead of
+ showing the callout number, the offset in the pattern, preceded by a
plus, is output. For example:
re> /\d?[A-E]\*/auto_callout
@@ -1662,5 +1687,5 @@ AUTHOR
REVISION
- Last updated: 03 June 2017
+ Last updated: 16 June 2017
Copyright (c) 1997-2017 University of Cambridge.