diff options
author | ph10 <ph10@6239d852-aaf2-0410-a92c-79f79f948069> | 2014-09-23 11:35:51 +0000 |
---|---|---|
committer | ph10 <ph10@6239d852-aaf2-0410-a92c-79f79f948069> | 2014-09-23 11:35:51 +0000 |
commit | fd8438eb9b6bec69a456b69a7dece77aadc06a36 (patch) | |
tree | b0f09f3d92934ea3ad0570599c861891cf360362 /doc/html/pcre2test.html | |
parent | cf3d2f48e3a1281a47cd544cfd2457b8342037f9 (diff) | |
download | pcre2-fd8438eb9b6bec69a456b69a7dece77aadc06a36.tar.gz |
Documentation scripts
git-svn-id: svn://vcs.exim.org/pcre2/code/trunk@79 6239d852-aaf2-0410-a92c-79f79f948069
Diffstat (limited to 'doc/html/pcre2test.html')
-rw-r--r-- | doc/html/pcre2test.html | 1199 |
1 files changed, 1199 insertions, 0 deletions
diff --git a/doc/html/pcre2test.html b/doc/html/pcre2test.html new file mode 100644 index 0000000..30b527d --- /dev/null +++ b/doc/html/pcre2test.html @@ -0,0 +1,1199 @@ +<html> +<head> +<title>pcre2test specification</title> +</head> +<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB"> +<h1>pcre2test man page</h1> +<p> +Return to the <a href="index.html">PCRE2 index page</a>. +</p> +<p> +This page is part of the PCRE2 HTML documentation. It was generated +automatically from the original man page. If there is any nonsense in it, +please consult the man page, in case the conversion went wrong. +<br> +<ul> +<li><a name="TOC1" href="#SEC1">SYNOPSIS</a> +<li><a name="TOC2" href="#SEC2">PCRE2's 8-BIT, 16-BIT AND 32-BIT LIBRARIES</a> +<li><a name="TOC3" href="#SEC3">INPUT ENCODING</a> +<li><a name="TOC4" href="#SEC4">COMMAND LINE OPTIONS</a> +<li><a name="TOC5" href="#SEC5">DESCRIPTION</a> +<li><a name="TOC6" href="#SEC6">COMMAND LINES</a> +<li><a name="TOC7" href="#SEC7">MODIFIER SYNTAX</a> +<li><a name="TOC8" href="#SEC8">PATTERN SYNTAX</a> +<li><a name="TOC9" href="#SEC9">SUBJECT LINE SYNTAX</a> +<li><a name="TOC10" href="#SEC10">PATTERN MODIFIERS</a> +<li><a name="TOC11" href="#SEC11">SUBJECT MODIFIERS</a> +<li><a name="TOC12" href="#SEC12">THE ALTERNATIVE MATCHING FUNCTION</a> +<li><a name="TOC13" href="#SEC13">DEFAULT OUTPUT FROM pcre2test</a> +<li><a name="TOC14" href="#SEC14">OUTPUT FROM THE ALTERNATIVE MATCHING FUNCTION</a> +<li><a name="TOC15" href="#SEC15">RESTARTING AFTER A PARTIAL MATCH</a> +<li><a name="TOC16" href="#SEC16">CALLOUTS</a> +<li><a name="TOC17" href="#SEC17">NON-PRINTING CHARACTERS</a> +<li><a name="TOC18" href="#SEC18">SEE ALSO</a> +<li><a name="TOC19" href="#SEC19">AUTHOR</a> +<li><a name="TOC20" href="#SEC20">REVISION</a> +</ul> +<br><a name="SEC1" href="#TOC1">SYNOPSIS</a><br> +<P> +<b>pcre2test [options] [input file [output file]]</b> +<br> +<br> +<b>pcre2test</b> is a test program for the PCRE2 regular expression libraries, +but it can also be used for experimenting with regular expressions. This +document describes the features of the test program; for details of the regular +expressions themselves, see the +<a href="pcre2pattern.html"><b>pcre2pattern</b></a> +documentation. For details of the PCRE2 library function calls and their +options, see the +<a href="pcre2api.html"><b>pcre2api</b></a> +documentation. +</P> +<P> +The input for <b>pcre2test</b> is a sequence of regular expression patterns and +subject strings to be matched. The output shows the result of each match +attempt. Modifiers on the command line, the patterns, and the subject lines +specify PCRE2 function options, control how the subject is processed, and what +output is produced. +</P> +<P> +As the original fairly simple PCRE library evolved, it acquired many different +features, and as a result, the original <b>pcretest</b> program ended up with a +lot of options in a messy, arcane syntax, for testing all the features. The +move to the new PCRE2 API provided an opportunity to re-implement the test +program as <b>pcre2test</b>, with a cleaner modifier syntax. Nevertheless, there +are still many obscure modifiers, some of which are specifically designed for +use in conjunction with the test script and data files that are distributed as +part of PCRE2. All the modifiers are documented here, some without much +justification, but many of them are unlikely to be of use except when testing +the libraries. +</P> +<br><a name="SEC2" href="#TOC1">PCRE2's 8-BIT, 16-BIT AND 32-BIT LIBRARIES</a><br> +<P> +Different versions of the PCRE2 library can be built to support character +strings that are encoded in 8-bit, 16-bit, or 32-bit code units. One, two, or +all three of these libraries may be simultaneously installed. The +<b>pcre2test</b> program can be used to test all the libraries. However, its own +input and output are always in 8-bit format. When testing the 16-bit or 32-bit +libraries, patterns and subject strings are converted to 16- or 32-bit format +before being passed to the library functions. Results are converted back to +8-bit code units for output. +</P> +<P> +In the rest of this document, the names of library functions and structures +are given in generic form, for example, <b>pcre_compile()</b>. The actual +names used in the libraries have a suffix _8, _16, or _32, as appropriate. +</P> +<br><a name="SEC3" href="#TOC1">INPUT ENCODING</a><br> +<P> +Input to <b>pcre2test</b> is processed line by line, either by calling the C +library's <b>fgets()</b> function, or via the <b>libreadline</b> library (see +below). In Unix-like environments, <b>fgets()</b> treats any bytes other than +newline as data characters. However, in some Windows environments character 26 +(hex 1A) causes an immediate end of file, and no further data is read. For +maximum portability, therefore, it is safest to avoid non-printing characters +in <b>pcre2test</b> input files. +</P> +<br><a name="SEC4" href="#TOC1">COMMAND LINE OPTIONS</a><br> +<P> +<b>-8</b> +If the 8-bit library has been built, this option causes it to be used (this is +the default). If the 8-bit library has not been built, this option causes an +error. +</P> +<P> +<b>-16</b> +If the 16-bit library has been built, this option causes it to be used. If only +the 16-bit library has been built, this is the default. If the 16-bit library +has not been built, this option causes an error. +</P> +<P> +<b>-32</b> +If the 32-bit library has been built, this option causes it to be used. If only +the 32-bit library has been built, this is the default. If the 32-bit library +has not been built, this option causes an error. +</P> +<P> +<b>-b</b> +Behave as if each pattern has the <b>/fullbincode</b> modifier; the full +internal binary form of the pattern is output after compilation. +</P> +<P> +<b>-C</b> +Output the version number of the PCRE2 library, and all available information +about the optional features that are included, and then exit with zero exit +code. All other options are ignored. +</P> +<P> +<b>-C</b> <i>option</i> +Output information about a specific build-time option, then exit. This +functionality is intended for use in scripts such as <b>RunTest</b>. The +following options output the value and set the exit code as indicated: +<pre> + ebcdic-nl the code for LF (= NL) in an EBCDIC environment: + 0x15 or 0x25 + 0 if used in an ASCII environment + exit code is always 0 + linksize the configured internal link size (2, 3, or 4) + exit code is set to the link size + newline the default newline setting: + CR, LF, CRLF, ANYCRLF, or ANY + exit code is always 0 + bsr the default setting for what \R matches: + ANYCRLF or ANY + exit code is always 0 +</pre> +The following options output 1 for true or 0 for false, and set the exit code +to the same value: +<pre> + ebcdic compiled for an EBCDIC environment + jit just-in-time support is available + pcre16 the 16-bit library was built + pcre32 the 32-bit library was built + pcre8 the 8-bit library was built + unicode Unicode support is available +</pre> +If an unknown option is given, an error message is output; the exit code is 0. +</P> +<P> +<b>-d</b> +Behave as if each pattern has the <b>debug</b> modifier; the internal +form and information about the compiled pattern is output after compilation; +<b>-d</b> is equivalent to <b>-b -i</b>. +</P> +<P> +<b>-dfa</b> +Behave as if each subject line has the <b>dfa</b> modifier; matching is done +using the <b>pcre2_dfa_match()</b> function instead of the default +<b>pcre2_match()</b>. +</P> +<P> +<b>-help</b> +Output a brief summary these options and then exit. +</P> +<P> +<b>-i</b> +Behave as if each pattern has the <b>/info</b> modifier; information about the +compiled pattern is given after compilation. +</P> +<P> +<b>-jit</b> +Behave as if each pattern line has the <b>jit</b> modifier; after successful +compilation, each pattern is passed to the just-in-time compiler, if available. +</P> +<P> +\fB-pattern\fB <i>modifier-list</i> +Behave as if each pattern line contains the given modifiers. +</P> +<P> +<b>-q</b> +Do not output the version number of <b>pcre2test</b> at the start of execution. +</P> +<P> +<b>-S</b> <i>size</i> +On Unix-like systems, set the size of the run-time stack to <i>size</i> +megabytes. +</P> +<P> +<b>-subject</b> <i>modifier-list</i> +Behave as if each subject line contains the given modifiers. +</P> +<P> +<b>-t</b> +Run each compile and match many times with a timer, and output the resulting +times per compile or match. You can control the number of iterations that are +used for timing by following <b>-t</b> with a number (as a separate item on the +command line). For example, "-t 1000" iterates 1000 times. The default is to +iterate 500,000 times. +</P> +<P> +<b>-tm</b> +This is like <b>-t</b> except that it times only the matching phase, not the +compile phase. +</P> +<P> +<b>-T</b> <b>-TM</b> +These behave like <b>-t</b> and <b>-tm</b>, but in addition, at the end of a run, +the total times for all compiles and matches are output. +</P> +<P> +<b>-version</b> +Output the PCRE2 version number and then exit. +</P> +<br><a name="SEC5" href="#TOC1">DESCRIPTION</a><br> +<P> +If <b>pcre2test</b> is given two filename arguments, it reads from the first and +writes to the second. If it is given only one filename argument, it reads from +that file and writes to stdout. Otherwise, it reads from stdin and writes to +stdout, and prompts for each line of input, using "re>" to prompt for regular +expression patterns, and "data>" to prompt for subject lines. +</P> +<P> +When <b>pcre2test</b> is built, a configuration option can specify that it +should be linked with the <b>libreadline</b> or <b>libedit</b> library. When this +is done, if the input is from a terminal, it is read using the <b>readline()</b> +function. This provides line-editing and history facilities. The output from +the <b>-help</b> option states whether or not <b>readline()</b> will be used. +</P> +<P> +The program handles any number of tests, each of which consists of a set of +input lines. Each set starts with a regular expression pattern, followed by any +number of subject lines to be matched against that pattern. In between sets of +test data, command lines that begin with a hash (#) character may appear. This +file format, with some restrictions, can also be processed by the +<b>perltest.pl</b> script that is distributed with PCRE2 as a means of checking +that the behaviour of PCRE2 and Perl is the same. +</P> +<P> +Each subject line is matched separately and independently. If you want to do +multi-line matches, you have to use the \n escape sequence (or \r or \r\n, +etc., depending on the newline setting) in a single line of input to encode the +newline sequences. There is no limit on the length of subject lines; the input +buffer is automatically extended if it is too small. There is a replication +feature that makes it possible to generate long subject lines without having to +supply them explicitly. +</P> +<P> +An empty line or the end of the file signals the end of the subject lines for a +test, at which point a new pattern or command line is expected if there is +still input to be read. +</P> +<br><a name="SEC6" href="#TOC1">COMMAND LINES</a><br> +<P> +In between sets of test data, a line that begins with a hash (#) character is +interpreted as a command line. If the first character is followed by white +space or an exclamation mark, the line is treated as a comment, and ignored. +Otherwise, the following commands are recognized: +<pre> + #forbid_utf +</pre> +Subsequent patterns automatically have the PCRE2_NEVER_UTF and PCRE2_NEVER_UCP +options set, which locks out the use of UTF and Unicode property features. This +is a trigger guard that is used in test files to ensure that UTF/Unicode tests +are not accidentally added to files that are used when UTF support is not +included in the library. This effect can also be obtained by the use of +<b>#pattern</b>; the difference is that <b>#forbid_utf</b> cannot be unset, and +the automatic options are not displayed in pattern information, to avoid +cluttering up test output. +<pre> + #pattern <modifier-list> +</pre> +This command sets a default modifier list that applies to all subsequent +patterns. Modifiers on a pattern can change these settings. +<pre> + #perltest +</pre> +The appearance of this line causes all subsequent modifier settings to be +checked for compatibility with the <b>perltest.pl</b> script, which is used to +confirm that Perl gives the same results as PCRE2. Also, apart from comment +lines, none of the other command lines are permitted, because they and many +of the modifiers are specific to <b>pcre2test</b>, and should not be used in +test files that are also processed by <b>perltest.pl</b>. The \fP#perltest\fB +command helps detect tests that are accidentally put in the wrong file. +<pre> + #subject <modifier-list> +</pre> +This command sets a default modifier list that applies to all subsequent +subject lines. Modifiers on a subject line can change these settings. +</P> +<br><a name="SEC7" href="#TOC1">MODIFIER SYNTAX</a><br> +<P> +Modifier lists are used with both pattern and subject lines. Items in a list +are separated by commas and optional white space. Some modifiers may be given +for both patterns and subject lines, whereas others are valid for one or the +other only. Each modifier has a long name, for example "anchored", and some of +them must be followed by an equals sign and a value, for example, "offset=12". +Modifiers that do not take values may be preceded by a minus sign to turn off a +previous default setting. +</P> +<P> +A few of the more common modifiers can also be specified as single letters, for +example "i" for "caseless". In documentation, following the Perl convention, +these are written with a slash ("the /i modifier") for clarity. Abbreviated +modifiers must all be concatenated in the first item of a modifier list. If the +first item is not recognized as a long modifier name, it is interpreted as a +sequence of these abbreviations. For example: +<pre> + /abc/ig,newline=cr,jit=3 +</pre> +This is a pattern line whose modifier list starts with two one-letter modifiers +(/i and /g). The lower-case abbreviated modifiers are the same as used in Perl. +</P> +<br><a name="SEC8" href="#TOC1">PATTERN SYNTAX</a><br> +<P> +A pattern line must start with one of the following characters (common symbols, +excluding pattern meta-characters): +<pre> + / ! " ' ` - = _ : ; , % & @ ~ +</pre> +This is interpreted as the pattern's delimiter. A regular expression may be +continued over several input lines, in which case the newline characters are +included within it. It is possible to include the delimiter within the pattern +by escaping it with a backslash, for example +<pre> + /abc\/def/ +</pre> +If you do this, the escape and the delimiter form part of the pattern, but +since the delimiters are all non-alphanumeric, this does not affect its +interpretation. If the terminating delimiter is immediately followed by a +backslash, for example, +<pre> + /abc/\ +</pre> +then a backslash is added to the end of the pattern. This is done to provide a +way of testing the error condition that arises if a pattern finishes with a +backslash, because +<pre> + /abc\/ +</pre> +is interpreted as the first line of a pattern that starts with "abc/", causing +pcre2test to read the next line as a continuation of the regular expression. +</P> +<P> +A pattern can be followed by a modifier list (details below). +</P> +<br><a name="SEC9" href="#TOC1">SUBJECT LINE SYNTAX</a><br> +<P> +Before each subject line is passed to <b>pcre2_match()</b> or +<b>pcre2_dfa_match()</b>, leading and trailing white space is removed, and the +line is scanned for backslash escapes. The following provide a means of +encoding non-printing characters in a visible way: +<pre> + \a alarm (BEL, \x07) + \b backspace (\x08) + \e escape (\x27) + \f form feed (\x0c) + \n newline (\x0a) + \r carriage return (\x0d) + \t tab (\x09) + \v vertical tab (\x0b) + \nnn octal character (up to 3 octal digits); always + a byte unless > 255 in UTF-8 or 16-bit or 32-bit mode + \o{dd...} octal character (any number of octal digits} + \xhh hexadecimal byte (up to 2 hex digits) + \x{hh...} hexadecimal character (any number of hex digits) +</pre> +The use of \x{hh...} is not dependent on the use of the utf modifier on +the pattern. It is recognized always. There may be any number of hexadecimal +digits inside the braces; invalid values provoke error messages. +</P> +<P> +Note that \xhh specifies one byte rather than one character in UTF-8 mode; +this makes it possible to construct invalid UTF-8 sequences for testing +purposes. On the other hand, \x{hh} is interpreted as a UTF-8 character in +UTF-8 mode, generating more than one byte if the value is greater than 127. +When testing the 8-bit library not in UTF-8 mode, \x{hh} generates one byte +for values less than 256, and causes an error for greater values. +</P> +<P> +In UTF-16 mode, all 4-digit \x{hhhh} values are accepted. This makes it +possible to construct invalid UTF-16 sequences for testing purposes. +</P> +<P> +In UTF-32 mode, all 4- to 8-digit \x{...} values are accepted. This makes it +possible to construct invalid UTF-32 sequences for testing purposes. +</P> +<P> +There is a special backslash sequence that specifies replication of one or more +characters: +<pre> + \[<characters>]{<count>} +</pre> +This makes it possible to test long strings without having to provide them as +part of the file. For example: +<pre> + \[abc]{4} +</pre> +is converted to "abcabcabcabc". This feature does not support nesting. To +include a closing square bracket in the characters, code it as \x5D. +</P> +<P> +A backslash followed by an equals sign marke the end of the subject string and +the start of a modifier list. For example: +<pre> + abc\=notbol,notempty +</pre> +A backslash followed by any other non-alphanumeric character just escapes that +character. A backslash followed by anything else causes an error. However, if +the very last character in the line is a backslash (and there is no modifier +list), it is ignored. This gives a way of passing an empty line as data, since +a real empty line terminates the data input. +</P> +<br><a name="SEC10" href="#TOC1">PATTERN MODIFIERS</a><br> +<P> +There are three types of modifier that can appear in pattern lines, two of +which may also be used in a <b>#pattern</b> command. A pattern's modifier list +can add to or override default modifiers that were set by a previous +<b>#pattern</b> command. +</P> +<br><b> +Setting compilation options +</b><br> +<P> +The following modifiers set options for <b>pcre2_compile()</b>. The most common +ones have single-letter abbreviations. See +<a href="pcreapi.html"><b>pcreapi</b></a> +for a description of their effects. +<pre> + allow_empty_class set PCRE2_ALLOW_EMPTY_CLASS + alt_bsux set PCRE2_ALT_BSUX + anchored set PCRE2_ANCHORED + auto_callout set PCRE2_AUTO_CALLOUT + /i caseless set PCRE2_CASELESS + dollar_endonly set PCRE2_DOLLAR_ENDONLY + /s dotall set PCRE2_DOTALL + dupnames set PCRE2_DUPNAMES + /x extended set PCRE2_EXTENDED + firstline set PCRE2_FIRSTLINE + match_unset_backref set PCRE2_MATCH_UNSET_BACKREF + /m multiline set PCRE2_MULTILINE + never_ucp set PCRE2_NEVER_UCP + never_utf set PCRE2_NEVER_UTF + no_auto_capture set PCRE2_NO_AUTO_CAPTURE + no_auto_possess set PCRE2_NO_AUTO_POSSESS + no_start_optimize set PCRE2_NO_START_OPTIMIZE + no_utf_check set PCRE2_NO_UTF_CHECK + ucp set PCRE2_UCP + ungreedy set PCRE2_UNGREEDY + utf set PCRE2_UTF +</pre> +As well as turning on the PCRE2_UTF option, the <b>utf</b> modifier causes all +non-printing characters in output strings to be printed using the \x{hh...} +notation. Otherwise, those less than 0x100 are output in hex without the curly +brackets. +</P> +<br><b> +Setting compilation controls +</b><br> +<P> +The following modifiers affect the compilation process or request information +about the pattern: +<pre> + bsr=[anycrlf|unicode] specify \R handling + /B bincode show binary code without lengths + debug same as info,fullbincode + fullbincode show binary code with lengths + /I info show info about compiled pattern + hex pattern is coded in hexadecimal + jit[=<number>] use JIT + locale=<name> use this locale + memory show memory used + newline=<type> set newline type + parens_nest_limit=<n> set maximum parentheses depth + perlcompat lock out non-Perl modifiers + posix use the POSIX API + stackguard=<number> test the stackguard feature + tables=[0|1|2] select internal tables + use_length use the pattern's length +</pre> +The effects of these modifiers are described in the following sections. +FIXME: Give more examples. +</P> +<br><b> +Newline and \R handling +</b><br> +<P> +The <b>bsr</b> modifier specifies what \R in a pattern should match. If it is +set to "anycrlf", \R matches CR, LF, or CRLF only. If it is set to "unicode", +\R matches any Unicode newline sequence. The default is specified when PCRE2 +is built, with the default default being Unicode. +</P> +<P> +The <b>newline</b> modifier specifies which characters are to be interpreted as +newlines, both in the pattern and (by default) in subject lines. The type must +be one of CR, LF, CRLF, ANYCRLF, or ANY. +</P> +<P> +Both the \R and newline settings can be changed at match time, but if this is +done, JIT matching is disabled. +</P> +<br><b> +Information about a pattern +</b><br> +<P> +The <b>debug</b> modifier is a shorthand for <b>info,fullbincode</b>, requesting +all available information. +</P> +<P> +The <b>bincode</b> modifier causes a representation of the compiled code to be +output after compilation. This information does not contain length and offset +values, which ensures that the same output is generated for different internal +link sizes and different code unit widths. By using <b>bincode</b>, the same +regression tests can be used in different environments. +</P> +<P> +The <b>fullbincode</b> modifier, by contrast, <i>does</i> include length and +offset values. This is used in a few special tests and is also useful for +one-off tests. +</P> +<P> +The <b>info</b> modifier requests information about the compiled pattern +(whether it is anchored, has a fixed first character, and so on). The +information is obtained from the <b>pcre2_pattern_info()</b> function. +</P> +<br><b> +Specifying a pattern in hex +</b><br> +<P> +The <b>hex</b> modifier specifies that the characters of the pattern are to be +interpreted as pairs of hexadecimal digits. White space is permitted between +pairs. For example: +<pre> + /ab 32 59/hex +</pre> +This feature is provided as a way of creating patterns that contain binary zero +characters. When <b>hex</b> is set, it implies <b>use_length</b>. +</P> +<br><b> +Using the pattern's length +</b><br> +<P> +By default, <b>pcre2test</b> passes patterns as zero-terminated strings to +<b>pcre2_compile()</b>, giving the length as -1. If <b>use_length</b> is set, the +length of the pattern is passed. This is implied if <b>hex</b> is set. +</P> +<br><b> +JIT compilation +</b><br> +<P> +The <b>/jit</b> modifier may optionally be followed by a number in the range 0 +to 7: +<pre> + 0 disable JIT + 1 normal match only + 2 soft partial match only + 3 normal match and soft partial match + 4 hard partial match only + 6 soft and hard partial match + 7 all three modes +</pre> +If no number is given, 7 is assumed. If JIT compilation is successful, the +compiled JIT code will automatically be used when <b>pcre2_match()</b> is run, +except when incompatible run-time options are specified. For more details, see +the +<a href="pcre2jit.html"><b>pcre2jit</b></a> +documentation. See also the <b>jitstack</b> modifier below for a way of +setting the size of the JIT stack. +</P> +<P> +If the <b>jitverify</b> modifier is specified, the text "(JIT)" is added to the +first output line after a match or non match when JIT-compiled code was +actually used. This modifier can also be set on a subject line. +</P> +<br><b> +Setting a locale +</b><br> +<P> +The <b>/locale</b> modifier must specify the name of a locale, for example: +<pre> + /pattern/locale=fr_FR +</pre> +The given locale is set, <b>pcre2_maketables()</b> is called to build a set of +character tables for the locale, and this is then passed to +<b>pcre2_compile()</b> when compiling the regular expression. The same tables +are used when matching the following subject lines. The <b>/locale</b> modifier +applies only to the pattern on which it appears, but can be given in a +<b>#pattern</b> command if a default is needed. Setting a locale and alternate +character tables are mutually exclusive. +</P> +<br><b> +Showing pattern memory +</b><br> +<P> +The <b>/memory</b> modifier causes the size in bytes of the memory block used to +hold the compiled pattern to be output. This does not include the size of the +<b>pcre2_code</b> block; it is just the actual compiled data. If the pattern is +subsequently passed to the JIT compiler, the size of the JIT compiled code is +also output. +</P> +<br><b> +Limiting nested parentheses +</b><br> +<P> +The <b>parens_nest_limit</b> modifier sets a limit on the depth of nested +parentheses in a pattern. Breaching the limit causes a compilation error. +</P> +<br><b> +Using the POSIX wrapper API +</b><br> +<P> +The <b>/posix</b> modifier causes <b>pcre2test</b> to call PCRE2 via the POSIX +wrapper API rather than its native API. This supports only the 8-bit library. +When the POSIX API is being used, the following pattern modifiers set options +for the <b>regcomp()</b> function: +<pre> + caseless REG_ICASE + multiline REG_NEWLINE + no_auto_capture REG_NOSUB + dotall REG_DOTALL ) + ungreedy REG_UNGREEDY ) These options are not part of + ucp REG_UCP ) the POSIX standard + utf REG_UTF8 ) +</pre> +The <b>aftertext</b> and <b>allaftertext</b> subject modifiers work as described +below. All other modifiers cause an error. +</P> +<br><b> +Testing the stack guard feature +</b><br> +<P> +The <b>/stackguard</b> modifier is used to test the use of +<b>pcre2_set_compile_recursion_guard()</b>, a function that is provided to +enable stack availability to be checked during compilation (see the +<a href="pcre2api.html"><b>pcre2api</b></a> +documentation for details). If the number specified by the modifier is greater +than zero, <b>pcre2_set_compile_recursion_guard()</b> is called to set up +callback from <b>pcre2_compile()</b> to a local function. The argument it is +passed is the current nesting parenthesis depth; if this is greater than the +value given by the modifier, non-zero is returned, causing the compilation to +be aborted. +</P> +<br><b> +Using alternative character tables +</b><br> +<P> +The <b>/tables</b> modifier must be followed by a single digit. It causes a +specific set of built-in character tables to be passed to +<b>pcre2_compile()</b>. This is used in the PCRE2 tests to check behaviour with +different character tables. The digit specifies the tables as follows: +<pre> + 0 do not pass any special character tables + 1 the default ASCII tables, as distributed in + pcre2_chartables.c.dist + 2 a set of tables defining ISO 8859 characters +</pre> +In table 2, some characters whose codes are greater than 128 are identified as +letters, digits, spaces, etc. Setting alternate character tables and a locale +are mutually exclusive. +</P> +<br><b> +Setting certain match controls +</b><br> +<P> +The following modifiers are really subject modifiers, and are described below. +However, they may be included in a pattern's modifier list, in which case they +are applied to every subject line that is processed with that pattern. They do +not affect the compilation process. +<pre> + aftertext show text after match + allaftertext show text after captures + allcaptures show all captures + allusedtext show all consulted text + /g global global matching + jitverify verify JIT usage + mark show mark values +</pre> +These modifiers may not appear in a <b>#pattern</b> command. If you want them as +defaults, set them in a <b>#subject</b> command. +</P> +<br><a name="SEC11" href="#TOC1">SUBJECT MODIFIERS</a><br> +<P> +The modifiers that can appear in subject lines and the <b>#subject</b> +command are of two types. +</P> +<br><b> +Setting match options +</b><br> +<P> +The following modifiers set options for <b>pcre2_match()</b> or +<b>pcre2_dfa_match()</b>. See +<a href="pcreapi.html"><b>pcreapi</b></a> +for a description of their effects. +<pre> + anchored set PCRE2_ANCHORED + dfa_restart set PCRE2_DFA_RESTART + dfa_shortest set PCRE2_DFA_SHORTEST + no_start_optimize set PCRE2_NO_START_OPTIMIZE + no_utf_check set PCRE2_NO_UTF_CHECK + notbol set PCRE2_NOTBOL + notempty set PCRE2_NOTEMPTY + notempty_atstart set PCRE2_NOTEMPTY_ATSTART + noteol set PCRE2_NOTEOL + partial_hard (or ph) set PCRE2_PARTIAL_HARD + partial_soft (or ps) set PCRE2_PARTIAL_SOFT +</pre> +The partial matching modifiers are provided with abbreviations because they +appear frequently in tests. +</P> +<P> +If the <b>/posix</b> modifier was present on the pattern, causing the POSIX +wrapper API to be used, the only option-setting modifiers that have any effect +are <b>notbol</b>, <b>notempty</b>, and <b>noteol</b>, causing REG_NOTBOL, +REG_NOTEMPTY, and REG_NOTEOL, respectively, to be passed to <b>regexec()</b>. +Any other modifiers cause an error. +</P> +<br><b> +Setting match controls +</b><br> +<P> +The following modifiers affect the matching process or request additional +information. Some of them may also be specified on a pattern line (see above), +in which case they apply to every subject line that is matched against that +pattern. +<pre> + aftertext show text after match + allaftertext show text after captures + allcaptures show all captures + allusedtext show all consulted text + altglobal alternative global matching + bsr=[anycrlf|unicode] specify \R handling + callout_capture show captures at callout time + callout_data=<n> set a value to pass via callouts + callout_fail=<n>[:<m>] control callout failure + callout_none do not supply a callout function + copy=<number or name> copy captured substring + dfa use <b>pcre2_dfa_match()</b> + find_limits find match and recursion limits + get=<number or name> extract captured substring + getall extract all captured substrings + /g global global matching + jitstack=<n> set size of JIT stack + jitverify verify JIT usage + mark show mark values + match_limit=>n> set a match limit + memory show memory usage + newline=<type> set newline type + offset=<n> set starting offset + ovector=<n> set size of output vector + recursion_limit=<n> set a recursion limit +</pre> +The effects of these modifiers are described in the following sections. +FIXME: Give more examples. +</P> +<br><b> +Newline and \R handling +</b><br> +<P> +These modifiers set the newline and \R processing conventions for the subject +line, overriding any values that were set at compile time (as described above). +JIT matching is disabled if these settings are changed at match time. +</P> +<br><b> +Showing more text +</b><br> +<P> +The <b>aftertext</b> modifier requests that as well as outputting the substring +that matched the entire pattern, <b>pcre2test</b> should in addition output the +remainder of the subject string. This is useful for tests where the subject +contains multiple copies of the same substring. The <b>allaftertext</b> modifier +requests the same action for captured substrings as well as the main matched +substring. In each case the remainder is output on the following line with a +plus character following the capture number. +</P> +<P> +The <b>allusedtext</b> modifier requests that all the text that was consulted +during a successful pattern match be shown. This affects the output if there +is a lookbehind at the start of a match, or a lookahead at the end, or if \K +is used in the pattern. Characters that precede or follow the start and end of +the actual match are indicated in the output by '<' or '>' characters +underneath them. Here is an example: +<pre> + /(?<=pqr)abc(?=xyz)/ + 123pqrabcxyz456\=allusedtext + 0: pqrabcxyz + <<< >>> +</pre> +This shows that the matched string is "abc", with the preceding and following +strings "pqr" and "xyz" also consulted during the match. +</P> +<br><b> +Showing the value of all capture groups +</b><br> +<P> +The <b>allcaptures</b> modifier requests that the values of all potential +captured parentheses be output after a match. By default, only those up to the +highest one actually used in the match are output (corresponding to the return +code from <b>pcre2_match()</b>). Groups that did not take part in the match +are output as "<unset>". +</P> +<br><b> +Testing callouts +</b><br> +<P> +A callout function is supplied when <b>pcre2test</b> calls the library matching +functions, unless <b>callout_none</b> is specified. If <b>callout_capture</b> is +set, the current captured groups are output when a callout occurs. +</P> +<P> +The <b>callout_fail</b> modifier can be given one or two numbers. If there is +only one number, 1 is returned instead of 0 when a callout of that number is +reached. If two numbers are given, 1 is returned when callout <n> is reached +for the <m>th time. +</P> +<P> +The <b>callout_data</b> modifier can be given an unsigned or a negative number. +Any value other than zero is used as a return from <b>pcre2test</b>'s callout +function. +</P> +<br><b> +Testing substring extraction functions +</b><br> +<P> +The <b>copy</b> and <b>get</b> modifiers can be used to test the +<b>pcre2_substring_copy_xxx()</b> and <b>pcre2_substring_get_xxx()</b> functions. +They can be given more than once, and each can specify a group name or number, +for example: +<pre> + abcd\=copy=1,copy=3,get=G1 +</pre> +If the <b>#subject</b> command is used to set default copy and get lists, these +can be unset by specifying a negative number for numbered groups and an empty +name for named groups. +</P> +<P> +The <b>getall</b> modifier tests <b>pcre2_substring_list_get()</b>, which +extracts all captured substrings. +</P> +<P> +If the subject line is successfully matched, the substrings extracted by the +convenience functions are output with C, G, or L after the string number +instead of a colon. This is in addition to the normal full list. The string +length (that is, the return from the extraction function) is given in +parentheses after each substring. +</P> +<br><b> +Finding all matches in a string +</b><br> +<P> +Searching for all possible matches within a subject can be requested by the +<b>global</b> or <b>/altglobal</b> modifier. After finding a match, the matching +function is called again to search the remainder of the subject. The difference +between <b>global</b> and <b>altglobal</b> is that the former uses the +<i>start_offset</i> argument to <b>pcre2_match()</b> or <b>pcre2_dfa_match()</b> +to start searching at a new point within the entire string (which is what Perl +does), whereas the latter passes over a shortened substring. This makes a +difference to the matching process if the pattern begins with a lookbehind +assertion (including \b or \B). +</P> +<P> +If an empty string is matched, the next match is done with the +PCRE2_NOTEMPTY_ATSTART and PCRE2_ANCHORED flags set, in order to search for +another, non-empty, match at the same point in the subject. If this match +fails, the start offset is advanced, and the normal match is retried. This +imitates the way Perl handles such cases when using the <b>/g</b> modifier or +the <b>split()</b> function. Normally, the start offset is advanced by one +character, but if the newline convention recognizes CRLF as a newline, and the +current character is CR followed by LF, an advance of two is used. +</P> +<br><b> +Setting the JIT stack size +</b><br> +<P> +The <b>jitstack</b> modifier provides a way of setting the maximum stack size +that is used by the just-in-time optimization code. It is ignored if JIT +optimization is not being used. Providing a stack that is larger than the +default 32K is necessary only for very complicated patterns. +</P> +<br><b> +Setting match and recursion limits +</b><br> +<P> +The <b>match_limit</b> and <b>recursion_limit</b> modifiers set the appropriate +limits in the match context. These values are ignored when the +<b>find_limits</b> modifier is specified. +</P> +<br><b> +Finding minimum limits +</b><br> +<P> +If the <b>find_limits</b> modifier is present, <b>pcre2test</b> calls +<b>pcre2_match()</b> several times, setting different values in the match +context via <b>pcre2_set_match_limit()</b> and <b>pcre2_set_recursion_limit()</b> +until it finds the minimum values for each parameter that allow +<b>pcre2_match()</b> to complete without error. +</P> +<P> +The <i>match_limit</i> number is a measure of the amount of backtracking +that takes place, and learning the minimum value can be instructive. For most +simple matches, the number is quite small, but for patterns with very large +numbers of matching possibilities, it can become large very quickly with +increasing length of subject string. The <i>match_limit_recursion</i> number is +a measure of how much stack (or, if PCRE2 is compiled with NO_RECURSE, how much +heap) memory is needed to complete the match attempt. +</P> +<br><b> +Showing MARK names +</b><br> +<P> +The <b>mark</b> modifier causes the names from backtracking control verbs that +are returned from calls to <b>pcre2_match()</b> to be displayed. If a mark is +returned for a match, non-match, or partial match, <b>pcre2test</b> shows it. +For a match, it is on a line by itself, tagged with "MK:". Otherwise, it +is added to the non-match message. +</P> +<br><b> +Showing memory usage +</b><br> +<P> +The <b>memory</b> modifier causes <b>pcre2test</b> to log all memory allocation +and freeing calls that occur during a match operation. +</P> +<br><b> +Setting a starting offset +</b><br> +<P> +The <b>offset</b> modifier sets an offset in the subject string at which +matching starts. Its value is a number of code units, not characters. +</P> +<br><b> +Setting the size of the output vector +</b><br> +<P> +The <b>ovector</b> modifier applies only to the subject line in which it +appears, though of course it can also be used to set a default in a +<b>#subject</b> command. It specifies the number of pairs of offsets that are +available for storing matching information. The default is 15. +</P> +<br><a name="SEC12" href="#TOC1">THE ALTERNATIVE MATCHING FUNCTION</a><br> +<P> +By default, <b>pcre2test</b> uses the standard PCRE2 matching function, +<b>pcre2_match()</b> to match each subject line. PCRE2 also supports an +alternative matching function, <b>pcre2_dfa_match()</b>, which operates in a +different way, and has some restrictions. The differences between the two +functions are described in the +<a href="pcre2matching.html"><b>pcre2matching</b></a> +documentation. +</P> +<P> +If the <b>dfa</b> modifier is set, the alternative matching function is used. +This function finds all possible matches at a given point in the subject. If, +however, the <b>dfa_shortest</b> modifier is set, processing stops after the +first match is found. This is always the shortest possible match. +</P> +<br><a name="SEC13" href="#TOC1">DEFAULT OUTPUT FROM pcre2test</a><br> +<P> +This section describes the output when the normal matching function, +<b>pcre2_match()</b>, is being used. +</P> +<P> +When a match succeeds, <b>pcre2test</b> outputs the list of captured substrings, +starting with number 0 for the string that matched the whole pattern. +Otherwise, it outputs "No match" when the return is PCRE2_ERROR_NOMATCH, or +"Partial match:" followed by the partially matching substring when the +return is PCRE2_ERROR_PARTIAL. (Note that this is the +entire substring that was inspected during the partial match; it may include +characters before the actual match start if a lookbehind assertion, \K, \b, +or \B was involved.) +</P> +<P> +For any other return, <b>pcre2test</b> outputs the PCRE2 +negative error number and a short descriptive phrase. If the error is a failed +UTF string check, the offset of the start of the failing character and the +reason code are also output. Here is an example of an interactive +<b>pcre2test</b> run. +<pre> + $ pcre2test + PCRE2 version 9.00 2014-05-10 + + re> /^abc(\d+)/ + data> abc123 + 0: abc123 + 1: 123 + data> xyz + No match +</pre> +Unset capturing substrings that are not followed by one that is set are not +returned by <b>pcre2_match()</b>, and are not shown by <b>pcre2test</b>. In the +following example, there are two capturing substrings, but when the first data +line is matched, the second, unset substring is not shown. An "internal" unset +substring is shown as "<unset>", as for the second data line. +<pre> + re> /(a)|(b)/ + data> a + 0: a + 1: a + data> b + 0: b + 1: <unset> + 2: b +</pre> +If the strings contain any non-printing characters, they are output as \xhh +escapes if the value is less than 256 and UTF mode is not set. Otherwise they +are output as \x{hh...} escapes. See below for the definition of non-printing +characters. If the <b>/aftertext</b> modifier is set, the output for substring +0 is followed by the the rest of the subject string, identified by "0+" like +this: +<pre> + re> /cat/aftertext + data> cataract + 0: cat + 0+ aract +</pre> +If global matching is requested, the results of successive matching attempts +are output in sequence, like this: +<pre> + re> /\Bi(\w\w)/g + data> Mississippi + 0: iss + 1: ss + 0: iss + 1: ss + 0: ipp + 1: pp +</pre> +"No match" is output only if the first match attempt fails. Here is an example +of a failure message (the offset 4 that is specified by \>4 is past the end of +the subject string): +<pre> + re> /xyz/ + data> xyz\=offset=4 + Error -24 (bad offset value) +</PRE> +</P> +<P> +Note that whereas patterns can be continued over several lines (a plain ">" +prompt is used for continuations), subject lines may not. However newlines can +be included in a subject by means of the \n escape (or \r, \r\n, etc., +depending on the newline sequence setting). +</P> +<br><a name="SEC14" href="#TOC1">OUTPUT FROM THE ALTERNATIVE MATCHING FUNCTION</a><br> +<P> +When the alternative matching function, <b>pcre2_dfa_match()</b>, is used, the +output consists of a list of all the matches that start at the first point in +the subject where there is at least one match. For example: +<pre> + re> /(tang|tangerine|tan)/ + data> yellow tangerine\=dfa + 0: tangerine + 1: tang + 2: tan +</pre> +(Using the normal matching function on this data finds only "tang".) The +longest matching string is always given first (and numbered zero). After a +PCRE2_ERROR_PARTIAL return, the output is "Partial match:", followed by the +partially matching substring. (Note that this is the entire substring that was +inspected during the partial match; it may include characters before the actual +match start if a lookbehind assertion, \K, \b, or \B was involved.) +</P> +<P> +If global matching is requested, the search for further matches resumes +at the end of the longest match. For example: +<pre> + re> /(tang|tangerine|tan)/g + data> yellow tangerine and tangy sultana\=dfa + 0: tangerine + 1: tang + 2: tan + 0: tang + 1: tan + 0: tan +</pre> +The alternative matching function does not support substring capture, so the +modifiers that are concerned with captured substrings are not relevant. +</P> +<br><a name="SEC15" href="#TOC1">RESTARTING AFTER A PARTIAL MATCH</a><br> +<P> +When the alternative matching function has given the PCRE2_ERROR_PARTIAL +return, indicating that the subject partially matched the pattern, you can +restart the match with additional subject data by means of the +<b>dfa_restart</b> modifier. For example: +<pre> + re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/ + data> 23ja\=P,dfa + Partial match: 23ja + data> n05\=dfa,dfa_restart + 0: n05 +</pre> +For further information about partial matching, see the +<a href="pcre2partial.html"><b>pcre2partial</b></a> +documentation. +</P> +<br><a name="SEC16" href="#TOC1">CALLOUTS</a><br> +<P> +If the pattern contains any callout requests, <b>pcre2test</b>'s callout function +is called during matching. This works with both matching functions. By default, +the called function displays the callout number, the start and current +positions in the text at the callout time, and the next pattern item to be +tested. For example: +<pre> + --->pqrabcdef + 0 ^ ^ \d +</pre> +This output indicates that callout number 0 occurred for a match attempt +starting at the fourth character of the subject string, when the pointer was at +the seventh character, and when the next pattern item was \d. Just +one circumflex is output if the start and current positions are the same. +</P> +<P> +Callouts numbered 255 are assumed to be automatic callouts, inserted as a +result of the <b>/auto_callout</b> pattern modifier. In this case, instead of +showing the callout number, the offset in the pattern, preceded by a plus, is +output. For example: +<pre> + re> /\d?[A-E]\*/auto_callout + data> E* + --->E* + +0 ^ \d? + +3 ^ [A-E] + +8 ^^ \* + +10 ^ ^ + 0: E* +</pre> +If a pattern contains (*MARK) items, an additional line is output whenever +a change of latest mark is passed to the callout function. For example: +<pre> + re> /a(*MARK:X)bc/auto_callout + data> abc + --->abc + +0 ^ a + +1 ^^ (*MARK:X) + +10 ^^ b + Latest Mark: X + +11 ^ ^ c + +12 ^ ^ + 0: abc +</pre> +The mark changes between matching "a" and "b", but stays the same for the rest +of the match, so nothing more is output. If, as a result of backtracking, the +mark reverts to being unset, the text "<unset>" is output. +</P> +<P> +The callout function in <b>pcre2test</b> returns zero (carry on matching) by +default, but you can use a <b>callout_fail</b> modifier in a subject line (as +described above) to change this and other parameters of the callout. +</P> +<P> +Inserting callouts can be helpful when using <b>pcre2test</b> to check +complicated regular expressions. For further information about callouts, see +the +<a href="pcre2callout.html"><b>pcre2callout</b></a> +documentation. +</P> +<br><a name="SEC17" href="#TOC1">NON-PRINTING CHARACTERS</a><br> +<P> +When <b>pcre2test</b> is outputting text in the compiled version of a pattern, +bytes other than 32-126 are always treated as non-printing characters and are +therefore shown as hex escapes. +</P> +<P> +When <b>pcre2test</b> is outputting text that is a matched part of a subject +string, it behaves in the same way, unless a different locale has been set for +the pattern (using the <b>/locale</b> modifier). In this case, the +<b>isprint()</b> function is used to distinguish printing and non-printing +characters. +</P> +<br><a name="SEC18" href="#TOC1">SEE ALSO</a><br> +<P> +<b>pcre2</b>(3), <b>pcre16</b>(3), <b>pcre32</b>(3), <b>pcre2api</b>(3), +<b>pcre2callout</b>(3), +<b>pcre2jit</b>, <b>pcre2matching</b>(3), <b>pcre2partial</b>(d), +<b>pcre2pattern</b>(3), <b>pcre2precompile</b>(3). +</P> +<br><a name="SEC19" href="#TOC1">AUTHOR</a><br> +<P> +Philip Hazel +<br> +University Computing Service +<br> +Cambridge CB2 3QH, England. +<br> +</P> +<br><a name="SEC20" href="#TOC1">REVISION</a><br> +<P> +Last updated: 19 August 2014 +<br> +Copyright © 1997-2014 University of Cambridge. +<br> +<p> +Return to the <a href="index.html">PCRE2 index page</a>. +</p> |