summaryrefslogtreecommitdiff
path: root/doc/pcretest.txt
diff options
context:
space:
mode:
authorph10 <ph10@2f5784b3-3f2a-0410-8824-cb99058d5e15>2012-01-14 11:16:23 +0000
committerph10 <ph10@2f5784b3-3f2a-0410-8824-cb99058d5e15>2012-01-14 11:16:23 +0000
commit57607b5518c705150a68606724fb875c7ba2686f (patch)
tree50d07ccc0c6d9e7698ea7ddef24bc333d7f58d11 /doc/pcretest.txt
parent36aa4021b0390d7727d5e1b11aac2fc87765792a (diff)
downloadpcre-57607b5518c705150a68606724fb875c7ba2686f.tar.gz
Bring HTML docs up to date.
git-svn-id: svn://vcs.exim.org/pcre/code/trunk@869 2f5784b3-3f2a-0410-8824-cb99058d5e15
Diffstat (limited to 'doc/pcretest.txt')
-rw-r--r--doc/pcretest.txt672
1 files changed, 365 insertions, 307 deletions
diff --git a/doc/pcretest.txt b/doc/pcretest.txt
index 3835f48..383be92 100644
--- a/doc/pcretest.txt
+++ b/doc/pcretest.txt
@@ -14,56 +14,95 @@ SYNOPSIS
expressions. This document describes the features of the test program;
for details of the regular expressions themselves, see the pcrepattern
documentation. For details of the PCRE library function calls and their
- options, see the pcreapi documentation. The input for pcretest is a
- sequence of regular expression patterns and strings to be matched, as
- described below. The output shows the result of each match. Options on
- the command line and the patterns control PCRE options and exactly what
- is output.
+ options, see the pcreapi and pcre16 documentation. The input for
+ pcretest is a sequence of regular expression patterns and strings to be
+ matched, as described below. The output shows the result of each match.
+ Options on the command line and the patterns control PCRE options and
+ exactly what is output.
+
+
+PCRE's 8-BIT and 16-BIT LIBRARIES
+
+ From release 8.30, two separate PCRE libraries can be built. The origi-
+ nal one supports 8-bit character strings, whereas the newer 16-bit
+ library supports character strings encoded in 16-bit units. The
+ pcretest program can be used to test both libraries. However, it is
+ itself still an 8-bit program, reading 8-bit input and writing 8-bit
+ output. When testing the 16-bit library, the patterns and data strings
+ are converted to 16-bit format before being passed to the PCRE library
+ functions. Results are converted to 8-bit for output.
+
+ References to functions and structures of the form pcre[16]_xx below
+ mean "pcre_xx when using the 8-bit library or pcre16_xx when using the
+ 16-bit library".
COMMAND LINE OPTIONS
- -b Behave as if each pattern has the /B (show byte code) modi-
+ -16 If both the 8-bit and the 16-bit libraries have been built,
+ this option causes the 16-bit library to be used. If only the
+ 16-bit library has been built, this is the default (so has no
+ effect). If only the 8-bit library has been built, this
+ option causes an error.
+
+ -b Behave as if each pattern has the /B (show byte code) modi-
fier; the internal form is output after compilation.
-C Output the version number of the PCRE library, and all avail-
- able information about the optional features that are
- included, and then exit.
+ able information about the optional features that are
+ included, and then exit. All other options are ignored.
+
+ -C option Output information about a specific build-time option, then
+ exit. This functionality is intended for use in scripts such
+ as RunTest. The following options output the value indicated:
+
+ linksize the internal link size (2, 3, or 4)
+ newline the default newline setting:
+ CR, LF, CRLF, ANYCRLF, or ANY
+
+ The following options output 1 for true or zero for false:
- -d Behave as if each pattern has the /D (debug) modifier; the
- internal form and information about the compiled pattern is
+ jit just-in-time support is available
+ pcre16 the 16-bit library was built
+ pcre8 the 8-bit library was built
+ ucp Unicode property support is available
+ utf UTF-8 and/or UTF-16 support is available
+
+ -d Behave as if each pattern has the /D (debug) modifier; the
+ internal form and information about the compiled pattern is
output after compilation; -d is equivalent to -b -i.
- -dfa Behave as if each data line contains the \D escape sequence;
+ -dfa Behave as if each data line contains the \D escape sequence;
this causes the alternative matching function,
- pcre_dfa_exec(), to be used instead of the standard
- pcre_exec() function (more detail is given below).
+ pcre[16]_dfa_exec(), to be used instead of the standard
+ pcre[16]_exec() function (more detail is given below).
-help Output a brief summary these options and then exit.
- -i Behave as if each pattern has the /I modifier; information
+ -i Behave as if each pattern has the /I modifier; information
about the compiled pattern is given after compilation.
- -M Behave as if each data line contains the \M escape sequence;
- this causes PCRE to discover the minimum MATCH_LIMIT and
- MATCH_LIMIT_RECURSION settings by calling pcre_exec() repeat-
- edly with different limits.
+ -M Behave as if each data line contains the \M escape sequence;
+ this causes PCRE to discover the minimum MATCH_LIMIT and
+ MATCH_LIMIT_RECURSION settings by calling pcre[16]_exec()
+ repeatedly with different limits.
- -m Output the size of each compiled pattern after it has been
- compiled. This is equivalent to adding /M to each regular
- expression.
+ -m Output the size of each compiled pattern after it has been
+ compiled. This is equivalent to adding /M to each regular
+ expression. The size is given in bytes for both libraries.
- -o osize Set the number of elements in the output vector that is used
- when calling pcre_exec() or pcre_dfa_exec() to be osize. The
- default value is 45, which is enough for 14 capturing subex-
- pressions for pcre_exec() or 22 different matches for
- pcre_dfa_exec(). The vector size can be changed for individ-
- ual matching calls by including \O in the data line (see
- below).
+ -o osize Set the number of elements in the output vector that is used
+ when calling pcre[16]_exec() or pcre[16]_dfa_exec() to be
+ osize. The default value is 45, which is enough for 14 cap-
+ turing subexpressions for pcre[16]_exec() or 22 different
+ matches for pcre[16]_dfa_exec(). The vector size can be
+ changed for individual matching calls by including \O in the
+ data line (see below).
- -p Behave as if each pattern has the /P modifier; the POSIX
- wrapper API is used to call PCRE. None of the other options
- has any effect when -p is set.
+ -p Behave as if each pattern has the /P modifier; the POSIX
+ wrapper API is used to call PCRE. None of the other options
+ has any effect when -p is set. This option can be used only
+ with the 8-bit library.
-q Do not output the version number of pcretest at the start of
execution.
@@ -73,25 +112,27 @@ COMMAND LINE OPTIONS
-s or -s+ Behave as if each pattern has the /S modifier; in other
words, force each pattern to be studied. If -s+ is used, the
- PCRE_STUDY_JIT_COMPILE flag is passed to pcre_study(), caus-
- ing just-in-time optimization to be set up if it is avail-
- able. If the /I or /D option is present on a pattern
+ PCRE_STUDY_JIT_COMPILE flag is passed to pcre[16]_study(),
+ causing just-in-time optimization to be set up if it is
+ available. If the /I or /D option is present on a pattern
(requesting output about the compiled pattern), information
about the result of studying is not included when studying is
caused only by -s and neither -i nor -d is present on the
command line. This behaviour means that the output from tests
that are run with and without -s should be identical, except
when options that output information about the actual running
- of a match are set. The -M, -t, and -tm options, which give
- information about resources used, are likely to produce dif-
- ferent output with and without -s. Output may also differ if
- the /C option is present on an individual pattern. This uses
- callouts to trace the the matching process, and this may be
- different between studied and non-studied patterns. If the
- pattern contains (*MARK) items there may also be differences,
- for the same reason. The -s command line option can be over-
- ridden for specific patterns that should never be studied
- (see the /S pattern modifier below).
+ of a match are set.
+
+ The -M, -t, and -tm options, which give information about
+ resources used, are likely to produce different output with
+ and without -s. Output may also differ if the /C option is
+ present on an individual pattern. This uses callouts to trace
+ the the matching process, and this may be different between
+ studied and non-studied patterns. If the pattern contains
+ (*MARK) items there may also be differences, for the same
+ reason. The -s command line option can be overridden for spe-
+ cific patterns that should never be studied (see the /S pat-
+ tern modifier below).
-t Run each compile, study, and match many times with a timer,
and output resulting time per compile or match (in millisec-
@@ -173,7 +214,7 @@ PATTERN MODIFIERS
and the first modifier, and between the modifiers themselves.
The /i, /m, /s, and /x modifiers set the PCRE_CASELESS, PCRE_MULTILINE,
- PCRE_DOTALL, or PCRE_EXTENDED options, respectively, when pcre_com-
+ PCRE_DOTALL, or PCRE_EXTENDED options, respectively, when pcre[16]_com-
pile() is called. These four modifier letters have the same effect as
they do in Perl. For example:
@@ -182,8 +223,12 @@ PATTERN MODIFIERS
The following table shows additional modifiers for setting PCRE com-
pile-time options that do not correspond to anything in Perl:
- /8 PCRE_UTF8
- /? PCRE_NO_UTF8_CHECK
+ /8 PCRE_UTF8 ) when using the 8-bit
+ /? PCRE_NO_UTF8_CHECK ) library
+
+ /8 PCRE_UTF16 ) when using the 16-bit
+ /? PCRE_NO_UTF16_CHECK ) library
+
/A PCRE_ANCHORED
/C PCRE_AUTO_CALLOUT
/E PCRE_DOLLAR_ENDONLY
@@ -210,143 +255,147 @@ PATTERN MODIFIERS
/^abc/m<CRLF>
- As well as turning on the PCRE_UTF8 option, the /8 modifier also causes
- any non-printing characters in output strings to be printed using the
- \x{hh...} notation if they are valid UTF-8 sequences. Full details of
- the PCRE options are given in the pcreapi documentation.
+ As well as turning on the PCRE_UTF8/16 option, the /8 modifier causes
+ all non-printing characters in output strings to be printed using the
+ \x{hh...} notation. Otherwise, those less than 0x100 are output in hex
+ without the curly brackets.
+
+ Full details of the PCRE options are given in the pcreapi documenta-
+ tion.
Finding all matches in a string
- Searching for all possible matches within each subject string can be
- requested by the /g or /G modifier. After finding a match, PCRE is
+ Searching for all possible matches within each subject string can be
+ requested by the /g or /G modifier. After finding a match, PCRE is
called again to search the remainder of the subject string. The differ-
ence between /g and /G is that the former uses the startoffset argument
- to pcre_exec() to start searching at a new point within the entire
- string (which is in effect what Perl does), whereas the latter passes
- over a shortened substring. This makes a difference to the matching
+ to pcre[16]_exec() to start searching at a new point within the entire
+ string (which is in effect what Perl does), whereas the latter passes
+ over a shortened substring. This makes a difference to the matching
process if the pattern begins with a lookbehind assertion (including \b
or \B).
- If any call to pcre_exec() in a /g or /G sequence matches an empty
- string, the next call is done with the PCRE_NOTEMPTY_ATSTART and
- PCRE_ANCHORED flags set in order to search for another, non-empty,
- match at the same point. If this second match fails, the start offset
- is advanced, and the normal match is retried. This imitates the way
+ If any call to pcre[16]_exec() in a /g or /G sequence matches an empty
+ string, the next call is done with the PCRE_NOTEMPTY_ATSTART and
+ PCRE_ANCHORED flags set in order to search for another, non-empty,
+ match at the same point. If this second match fails, the start offset
+ is advanced, and the normal match is retried. This imitates the way
Perl handles such cases when using the /g modifier or the split() func-
- tion. Normally, the start offset is advanced by one character, but if
- the newline convention recognizes CRLF as a newline, and the current
+ tion. Normally, the start offset is advanced by one character, but if
+ the newline convention recognizes CRLF as a newline, and the current
character is CR followed by LF, an advance of two is used.
Other modifiers
There are yet more modifiers for controlling the way pcretest operates.
- The /+ modifier requests that as well as outputting the substring that
- matched the entire pattern, pcretest should in addition output the
- remainder of the subject string. This is useful for tests where the
- subject contains multiple copies of the same substring. If the + modi-
- fier appears twice, the same action is taken for captured substrings.
- In each case the remainder is output on the following line with a plus
- character following the capture number. Note that this modifier must
+ The /+ modifier requests that as well as outputting the substring that
+ matched the entire pattern, pcretest should in addition output the
+ remainder of the subject string. This is useful for tests where the
+ subject contains multiple copies of the same substring. If the + modi-
+ fier appears twice, the same action is taken for captured substrings.
+ In each case the remainder is output on the following line with a plus
+ character following the capture number. Note that this modifier must
not immediately follow the /S modifier because /S+ has another meaning.
- The /= modifier requests that the values of all potential captured
- parentheses be output after a match by pcre_exec(). By default, only
- those up to the highest one actually used in the match are output (cor-
- responding to the return code from pcre_exec()). Values in the offsets
- vector corresponding to higher numbers should be set to -1, and these
- are output as "<unset>". This modifier gives a way of checking that
- this is happening.
-
- The /B modifier is a debugging feature. It requests that pcretest out-
- put a representation of the compiled byte code after compilation. Nor-
- mally this information contains length and offset values; however, if
- /Z is also present, this data is replaced by spaces. This is a special
- feature for use in the automatic test scripts; it ensures that the same
+ The /= modifier requests that the values of all potential captured
+ parentheses be output after a match. By default, only those up to the
+ highest one actually used in the match are output (corresponding to the
+ return code from pcre[16]_exec()). Values in the offsets vector corre-
+ sponding to higher numbers should be set to -1, and these are output as
+ "<unset>". This modifier gives a way of checking that this is happen-
+ ing.
+
+ The /B modifier is a debugging feature. It requests that pcretest out-
+ put a representation of the compiled code after compilation. Normally
+ this information contains length and offset values; however, if /Z is
+ also present, this data is replaced by spaces. This is a special fea-
+ ture for use in the automatic test scripts; it ensures that the same
output is generated for different internal link sizes.
- The /D modifier is a PCRE debugging feature, and is equivalent to /BI,
+ The /D modifier is a PCRE debugging feature, and is equivalent to /BI,
that is, both the /B and the /I modifiers.
- The /F modifier causes pcretest to flip the byte order of the fields in
- the compiled pattern that contain 2-byte and 4-byte numbers. This
- facility is for testing the feature in PCRE that allows it to execute
- patterns that were compiled on a host with a different endianness. This
- feature is not available when the POSIX interface to PCRE is being
- used, that is, when the /P pattern modifier is specified. See also the
- section about saving and reloading compiled patterns below.
-
- The /I modifier requests that pcretest output information about the
- compiled pattern (whether it is anchored, has a fixed first character,
- and so on). It does this by calling pcre_fullinfo() after compiling a
- pattern. If the pattern is studied, the results of that are also out-
+ The /F modifier causes pcretest to flip the byte order of the 2-byte
+ and 4-byte fields in the compiled pattern. This facility is for testing
+ the feature in PCRE that allows it to execute patterns that were com-
+ piled on a host with a different endianness. This feature is not avail-
+ able when the POSIX interface to PCRE is being used, that is, when the
+ /P pattern modifier is specified. See also the section about saving and
+ reloading compiled patterns below.
+
+ The /I modifier requests that pcretest output information about the
+ compiled pattern (whether it is anchored, has a fixed first character,
+ and so on). It does this by calling pcre[16]_fullinfo() after compiling
+ a pattern. If the pattern is studied, the results of that are also out-
put.
- The /K modifier requests pcretest to show names from backtracking con-
- trol verbs that are returned from calls to pcre_exec(). It causes
- pcretest to create a pcre_extra block if one has not already been cre-
- ated by a call to pcre_study(), and to set the PCRE_EXTRA_MARK flag and
- the mark field within it, every time that pcre_exec() is called. If the
- variable that the mark field points to is non-NULL for a match, non-
- match, or partial match, pcretest prints the string to which it points.
- For a match, this is shown on a line by itself, tagged with "MK:". For
- a non-match it is added to the message.
-
- The /L modifier must be followed directly by the name of a locale, for
+ The /K modifier requests pcretest to show names from backtracking con-
+ trol verbs that are returned from calls to pcre[16]_exec(). It causes
+ pcretest to create a pcre[16]_extra block if one has not already been
+ created by a call to pcre[16]_study(), and to set the PCRE_EXTRA_MARK
+ flag and the mark field within it, every time that pcre[16]_exec() is
+ called. If the variable that the mark field points to is non-NULL for a
+ match, non-match, or partial match, pcretest prints the string to which
+ it points. For a match, this is shown on a line by itself, tagged with
+ "MK:". For a non-match it is added to the message.
+
+ The /L modifier must be followed directly by the name of a locale, for
example,
/pattern/Lfr_FR
For this reason, it must be the last modifier. The given locale is set,
- pcre_maketables() is called to build a set of character tables for the
- locale, and this is then passed to pcre_compile() when compiling the
- regular expression. Without an /L (or /T) modifier, NULL is passed as
- the tables pointer; that is, /L applies only to the expression on which
- it appears.
-
- The /M modifier causes the size of memory block used to hold the com-
- piled pattern to be output. This does not include the size of the pcre
- block; it is just the actual compiled data. If the pattern is success-
- fully studied with the PCRE_STUDY_JIT_COMPILE option, the size of the
- JIT compiled code is also output.
-
- If the /S modifier appears once, it causes pcre_study() to be called
- after the expression has been compiled, and the results used when the
- expression is matched. If /S appears twice, it suppresses studying,
- even if it was requested externally by the -s command line option. This
- makes it possible to specify that certain patterns are always studied,
- and others are never studied, independently of -s. This feature is used
- in the test files in a few cases where the output is different when the
- pattern is studied.
-
- If the /S modifier is immediately followed by a + character, the call
- to pcre_study() is made with the PCRE_STUDY_JIT_COMPILE option,
- requesting just-in-time optimization support if it is available. Note
- that there is also a /+ modifier; it must not be given immediately
- after /S because this will be misinterpreted. If JIT studying is suc-
- cessful, it will automatically be used when pcre_exec() is run, except
- when incompatible run-time options are specified. These include the
- partial matching options; a complete list is given in the pcrejit docu-
- mentation. See also the \J escape sequence below for a way of setting
- the size of the JIT stack.
-
- The /T modifier must be followed by a single digit. It causes a spe-
- cific set of built-in character tables to be passed to pcre_compile().
- It is used in the standard PCRE tests to check behaviour with different
- character tables. The digit specifies the tables as follows:
+ pcre[16]_maketables() is called to build a set of character tables for
+ the locale, and this is then passed to pcre[16]_compile() when compil-
+ ing the regular expression. Without an /L (or /T) modifier, NULL is
+ passed as the tables pointer; that is, /L applies only to the expres-
+ sion on which it appears.
+
+ The /M modifier causes the size in bytes of the memory block used to
+ hold the compiled pattern to be output. This does not include the size
+ of the pcre[16] block; it is just the actual compiled data. If the pat-
+ tern is successfully studied with the PCRE_STUDY_JIT_COMPILE option,
+ the size of the JIT compiled code is also output.
+
+ If the /S modifier appears once, it causes pcre[16]_study() to be
+ called after the expression has been compiled, and the results used
+ when the expression is matched. If /S appears twice, it suppresses
+ studying, even if it was requested externally by the -s command line
+ option. This makes it possible to specify that certain patterns are
+ always studied, and others are never studied, independently of -s. This
+ feature is used in the test files in a few cases where the output is
+ different when the pattern is studied.
+
+ If the /S modifier is immediately followed by a + character, the call
+ to pcre[16]_study() is made with the PCRE_STUDY_JIT_COMPILE option,
+ requesting just-in-time optimization support if it is available. Note
+ that there is also a /+ modifier; it must not be given immediately
+ after /S because this will be misinterpreted. If JIT studying is suc-
+ cessful, it will automatically be used when pcre[16]_exec() is run,
+ except when incompatible run-time options are specified. These include
+ the partial matching options; a complete list is given in the pcrejit
+ documentation. See also the \J escape sequence below for a way of set-
+ ting the size of the JIT stack.
+
+ The /T modifier must be followed by a single digit. It causes a spe-
+ cific set of built-in character tables to be passed to pcre[16]_com-
+ pile(). It is used in the standard PCRE tests to check behaviour with
+ different character tables. The digit specifies the tables as follows:
0 the default ASCII tables, as distributed in
pcre_chartables.c.dist
1 a set of tables defining ISO 8859 characters
- In table 1, some characters whose codes are greater than 128 are iden-
+ In table 1, some characters whose codes are greater than 128 are iden-
tified as letters, digits, spaces, etc.
Using the POSIX wrapper API
- The /P modifier causes pcretest to call PCRE via the POSIX wrapper API
- rather than its native API. When /P is set, the following modifiers set
- options for the regcomp() function:
+ The /P modifier causes pcretest to call PCRE via the POSIX wrapper API
+ rather than its native API. This supports only the 8-bit library. When
+ /P is set, the following modifiers set options for the regcomp() func-
+ tion:
/i REG_ICASE
/m REG_NEWLINE
@@ -362,12 +411,12 @@ PATTERN MODIFIERS
DATA LINES
- Before each data line is passed to pcre_exec(), leading and trailing
- white space is removed, and it is then scanned for \ escapes. Some of
- these are pretty esoteric features, intended for checking out some of
- the more complicated features of PCRE. If you are just testing "ordi-
- nary" regular expressions, you probably don't need any of these. The
- following escapes are recognized:
+ Before each data line is passed to pcre[16]_exec(), leading and trail-
+ ing white space is removed, and it is then scanned for \ escapes. Some
+ of these are pretty esoteric features, intended for checking out some
+ of the more complicated features of PCRE. If you are just testing
+ "ordinary" regular expressions, you probably don't need any of these.
+ The following escapes are recognized:
\a alarm (BEL, \x07)
\b backspace (\x08)
@@ -379,18 +428,17 @@ DATA LINES
\r carriage return (\x0d)
\t tab (\x09)
\v vertical tab (\x0b)
- \nnn octal character (up to 3 octal digits)
- always a byte unless > 255 in UTF-8 mode
+ \nnn octal character (up to 3 octal digits); always
+ a byte unless > 255 in UTF-8 or 16-bit mode
\xhh hexadecimal byte (up to 2 hex digits)
- \x{hh...} hexadecimal character, any number of digits
- in UTF-8 mode
- \A pass the PCRE_ANCHORED option to pcre_exec()
- or pcre_dfa_exec()
- \B pass the PCRE_NOTBOL option to pcre_exec()
- or pcre_dfa_exec()
- \Cdd call pcre_copy_substring() for substring dd
+ \x{hh...} hexadecimal character (any number of hex digits)
+ \A pass the PCRE_ANCHORED option to pcre[16]_exec()
+ or pcre[16]_dfa_exec()
+ \B pass the PCRE_NOTBOL option to pcre[16]_exec()
+ or pcre[16]_dfa_exec()
+ \Cdd call pcre[16]_copy_substring() for substring dd
after a successful match (number less than 32)
- \Cname call pcre_copy_named_substring() for substring
+ \Cname call pcre[16]_copy_named_substring() for substring
"name" after a successful match (name termin-
ated by next non alphanumeric character)
\C+ show the current captured substrings at callout
@@ -402,57 +450,65 @@ DATA LINES
reached for the nth time
\C*n pass the number n (may be negative) as callout
data; this is used as the callout return value
- \D use the pcre_dfa_exec() match function
- \F only shortest match for pcre_dfa_exec()
- \Gdd call pcre_get_substring() for substring dd
+ \D use the pcre[16]_dfa_exec() match function
+ \F only shortest match for pcre[16]_dfa_exec()
+ \Gdd call pcre[16]_get_substring() for substring dd
after a successful match (number less than 32)
- \Gname call pcre_get_named_substring() for substring
+ \Gname call pcre[16]_get_named_substring() for substring
"name" after a successful match (name termin-
ated by next non-alphanumeric character)
\Jdd set up a JIT stack of dd kilobytes maximum (any
number of digits)
- \L call pcre_get_substringlist() after a
+ \L call pcre[16]_get_substringlist() after a
successful match
\M discover the minimum MATCH_LIMIT and
MATCH_LIMIT_RECURSION settings
- \N pass the PCRE_NOTEMPTY option to pcre_exec()
- or pcre_dfa_exec(); if used twice, pass the
+ \N pass the PCRE_NOTEMPTY option to pcre[16]_exec()
+ or pcre[16]_dfa_exec(); if used twice, pass the
PCRE_NOTEMPTY_ATSTART option
\Odd set the size of the output vector passed to
- pcre_exec() to dd (any number of digits)
- \P pass the PCRE_PARTIAL_SOFT option to pcre_exec()
- or pcre_dfa_exec(); if used twice, pass the
+ pcre[16]_exec() to dd (any number of digits)
+ \P pass the PCRE_PARTIAL_SOFT option to pcre[16]_exec()
+ or pcre[16]_dfa_exec(); if used twice, pass the
PCRE_PARTIAL_HARD option
\Qdd set the PCRE_MATCH_LIMIT_RECURSION limit to dd
(any number of digits)
- \R pass the PCRE_DFA_RESTART option to pcre_dfa_exec()
+ \R pass the PCRE_DFA_RESTART option to pcre[16]_dfa_exec()
\S output details of memory get/free calls during matching
- \Y pass the PCRE_NO_START_OPTIMIZE option to pcre_exec()
- or pcre_dfa_exec()
- \Z pass the PCRE_NOTEOL option to pcre_exec()
- or pcre_dfa_exec()
- \? pass the PCRE_NO_UTF8_CHECK option to
- pcre_exec() or pcre_dfa_exec()
+ \Y pass the PCRE_NO_START_OPTIMIZE option to pcre[16]_exec()
+ or pcre[16]_dfa_exec()
+ \Z pass the PCRE_NOTEOL option to pcre[16]_exec()
+ or pcre[16]_dfa_exec()
+ \? pass the PCRE_NO_UTF[8|16]_CHECK option to
+ pcre[16]_exec() or pcre[16]_dfa_exec()
\>dd start the match at offset dd (optional "-"; then
any number of digits); this sets the startoffset
- argument for pcre_exec() or pcre_dfa_exec()
- \<cr> pass the PCRE_NEWLINE_CR option to pcre_exec()
- or pcre_dfa_exec()
- \<lf> pass the PCRE_NEWLINE_LF option to pcre_exec()
- or pcre_dfa_exec()
- \<crlf> pass the PCRE_NEWLINE_CRLF option to pcre_exec()
- or pcre_dfa_exec()
- \<anycrlf> pass the PCRE_NEWLINE_ANYCRLF option to pcre_exec()
- or pcre_dfa_exec()
- \<any> pass the PCRE_NEWLINE_ANY option to pcre_exec()
- or pcre_dfa_exec()
-
- Note that \xhh always specifies one byte, even in UTF-8 mode; this
- makes it possible to construct invalid UTF-8 sequences for testing pur-
- poses. On the other hand, \x{hh} is interpreted as a UTF-8 character in
- UTF-8 mode, generating more than one byte if the value is greater than
- 127. When not in UTF-8 mode, it generates one byte for values less than
- 256, and causes an error for greater values.
+ argument for pcre[16]_exec() or pcre[16]_dfa_exec()
+ \<cr> pass the PCRE_NEWLINE_CR option to pcre[16]_exec()
+ or pcre[16]_dfa_exec()
+ \<lf> pass the PCRE_NEWLINE_LF option to pcre[16]_exec()
+ or pcre[16]_dfa_exec()
+ \<crlf> pass the PCRE_NEWLINE_CRLF option to pcre[16]_exec()
+ or pcre[16]_dfa_exec()
+ \<anycrlf> pass the PCRE_NEWLINE_ANYCRLF option to pcre[16]_exec()
+ or pcre[16]_dfa_exec()
+ \<any> pass the PCRE_NEWLINE_ANY option to pcre[16]_exec()
+ or pcre[16]_dfa_exec()
+
+ The use of \x{hh...} is not dependent on the use of the /8 modifier on
+ the pattern. It is recognized always. There may be any number of hexa-
+ decimal digits inside the braces; invalid values provoke error mes-
+ sages.
+
+ Note that \xhh specifies one byte in UTF-8 mode; this makes it possible
+ to construct invalid UTF-8 sequences for testing purposes. On the other
+ hand, \x{hh} is interpreted as a UTF-8 character in UTF-8 mode, gener-
+ ating more than one byte if the value is greater than 127. When testing
+ the 8-bit library not in UTF-8 mode, \x{hh} generates one byte for val-
+ ues less than 256, and causes an error for greater values.
+
+ In UTF-16 mode, all 4-digit \x{hhhh} values are accepted. This makes it
+ possible to construct invalid UTF-16 sequences for testing purposes.
The escapes that specify line ending sequences are literal strings,
exactly as shown. No more than one newline setting should be present in
@@ -468,13 +524,13 @@ DATA LINES
mization is not being used. Providing a stack that is larger than the
default 32K is necessary only for very complicated patterns.
- If \M is present, pcretest calls pcre_exec() several times, with dif-
- ferent values in the match_limit and match_limit_recursion fields of
- the pcre_extra data structure, until it finds the minimum numbers for
- each parameter that allow pcre_exec() to complete without error.
- Because this is testing a specific feature of the normal interpretive
- pcre_exec() execution, the use of any JIT optimization that might have
- been set up by the /S+ qualifier of -s+ option is disabled.
+ If \M is present, pcretest calls pcre[16]_exec() several times, with
+ different values in the match_limit and match_limit_recursion fields of
+ the pcre[16]_extra data structure, until it finds the minimum numbers
+ for each parameter that allow pcre[16]_exec() to complete without
+ error. Because this is testing a specific feature of the normal inter-
+ pretive pcre[16]_exec() execution, the use of any JIT optimization that
+ might have been set up by the /S+ qualifier of -s+ option is disabled.
The match_limit number is a measure of the amount of backtracking that
takes place, and checking it out can be instructive. For most simple
@@ -487,56 +543,48 @@ DATA LINES
When \O is used, the value specified may be higher or lower than the
size set by the -O command line option (or defaulted to 45); \O applies
- only to the call of pcre_exec() for the line in which it appears.
+ only to the call of pcre[16]_exec() for the line in which it appears.
If the /P modifier was present on the pattern, causing the POSIX wrap-
per API to be used, the only option-setting sequences that have any
effect are \B, \N, and \Z, causing REG_NOTBOL, REG_NOTEMPTY, and
REG_NOTEOL, respectively, to be passed to regexec().
- The use of \x{hh...} to represent UTF-8 characters is not dependent on
- the use of the /8 modifier on the pattern. It is recognized always.
- There may be any number of hexadecimal digits inside the braces. The
- result is from one to six bytes, encoded according to the original
- UTF-8 rules of RFC 2279. This allows for values in the range 0 to
- 0x7FFFFFFF. Note that not all of those are valid Unicode code points,
- or indeed valid UTF-8 characters according to the later rules in RFC
- 3629.
-
THE ALTERNATIVE MATCHING FUNCTION
- By default, pcretest uses the standard PCRE matching function,
- pcre_exec() to match each data line. From release 6.0, PCRE supports an
- alternative matching function, pcre_dfa_test(), which operates in a
- different way, and has some restrictions. The differences between the
- two functions are described in the pcrematching documentation.
+ By default, pcretest uses the standard PCRE matching function,
+ pcre[16]_exec() to match each data line. PCRE also supports an alterna-
+ tive matching function, pcre[16]_dfa_test(), which operates in a dif-
+ ferent way, and has some restrictions. The differences between the two
+ functions are described in the pcrematching documentation.
- If a data line contains the \D escape sequence, or if the command line
- contains the -dfa option, the alternative matching function is called.
+ If a data line contains the \D escape sequence, or if the command line
+ contains the -dfa option, the alternative matching function is used.
This function finds all possible matches at a given point. If, however,
- the \F escape sequence is present in the data line, it stops after the
+ the \F escape sequence is present in the data line, it stops after the
first match is found. This is always the shortest possible match.
DEFAULT OUTPUT FROM PCRETEST
- This section describes the output when the normal matching function,
- pcre_exec(), is being used.
+ This section describes the output when the normal matching function,
+ pcre[16]_exec(), is being used.
When a match succeeds, pcretest outputs the list of captured substrings
- that pcre_exec() returns, starting with number 0 for the string that
- matched the whole pattern. Otherwise, it outputs "No match" when the
- return is PCRE_ERROR_NOMATCH, and "Partial match:" followed by the par-
- tially matching substring when pcre_exec() returns PCRE_ERROR_PARTIAL.
- (Note that this is the entire substring that was inspected during the
- partial match; it may include characters before the actual match start
- if a lookbehind assertion, \K, \b, or \B was involved.) For any other
- return, pcretest outputs the PCRE negative error number and a short
- descriptive phrase. If the error is a failed UTF-8 string check, the
- byte offset of the start of the failing character and the reason code
- are also output, provided that the size of the output vector is at
- least two. Here is an example of an interactive pcretest run.
+ that pcre[16]_exec() returns, starting with number 0 for the string
+ that matched the whole pattern. Otherwise, it outputs "No match" when
+ the return is PCRE_ERROR_NOMATCH, and "Partial match:" followed by the
+ partially matching substring when pcre[16]_exec() returns
+ PCRE_ERROR_PARTIAL. (Note that this is the entire substring that was
+ inspected during the partial match; it may include characters before
+ the actual match start if a lookbehind assertion, \K, \b, or \B was
+ involved.) For any other return, pcretest outputs the PCRE negative
+ error number and a short descriptive phrase. If the error is a failed
+ UTF string check, the offset of the start of the failing character and
+ the reason code are also output, provided that the size of the output
+ vector is at least two. Here is an example of an interactive pcretest
+ run.
$ pcretest
PCRE version 8.13 2011-04-30
@@ -549,10 +597,10 @@ DEFAULT OUTPUT FROM PCRETEST
No match
Unset capturing substrings that are not followed by one that is set are
- not returned by pcre_exec(), and are not shown by pcretest. In the fol-
- lowing example, there are two capturing substrings, but when the first
- data line is matched, the second, unset substring is not shown. An
- "internal" unset substring is shown as "<unset>", as for the second
+ not returned by pcre[16]_exec(), and are not shown by pcretest. In the
+ following example, there are two capturing substrings, but when the
+ first data line is matched, the second, unset substring is not shown.
+ An "internal" unset substring is shown as "<unset>", as for the second
data line.
re> /(a)|(b)/
@@ -565,11 +613,11 @@ DEFAULT OUTPUT FROM PCRETEST
2: b
If the strings contain any non-printing characters, they are output as
- \0x escapes, or as \x{...} escapes if the /8 modifier was present on
- the pattern. See below for the definition of non-printing characters.
- If the pattern has the /+ modifier, the output for substring 0 is fol-
- lowed by the the rest of the subject string, identified by "0+" like
- this:
+ \xhh escapes if the value is less than 256 and UTF mode is not set.
+ Otherwise they are output as \x{hh...} escapes. See below for the defi-
+ nition of non-printing characters. If the pattern has the /+ modifier,
+ the output for substring 0 is followed by the the rest of the subject
+ string, identified by "0+" like this:
re> /cat/+
data> cataract
@@ -611,10 +659,11 @@ DEFAULT OUTPUT FROM PCRETEST
OUTPUT FROM THE ALTERNATIVE MATCHING FUNCTION
- When the alternative matching function, pcre_dfa_exec(), is used (by
- means of the \D escape sequence or the -dfa command line option), the
- output consists of a list of all the matches that start at the first
- point in the subject where there is at least one match. For example:
+ When the alternative matching function, pcre[16]_dfa_exec(), is used
+ (by means of the \D escape sequence or the -dfa command line option),
+ the output consists of a list of all the matches that start at the
+ first point in the subject where there is at least one match. For exam-
+ ple:
re> /(tang|tangerine|tan)/
data> yellow tangerine\D
@@ -622,11 +671,11 @@ OUTPUT FROM THE ALTERNATIVE MATCHING FUNCTION
1: tang
2: tan
- (Using the normal matching function on this data finds only "tang".)
- The longest matching string is always given first (and numbered zero).
+ (Using the normal matching function on this data finds only "tang".)
+ The longest matching string is always given first (and numbered zero).
After a PCRE_ERROR_PARTIAL return, the output is "Partial match:", fol-
- lowed by the partially matching substring. (Note that this is the
- entire substring that was inspected during the partial match; it may
+ lowed by the partially matching substring. (Note that this is the
+ entire substring that was inspected during the partial match; it may
include characters before the actual match start if a lookbehind asser-
tion, \K, \b, or \B was involved.)
@@ -642,16 +691,16 @@ OUTPUT FROM THE ALTERNATIVE MATCHING FUNCTION
1: tan
0: tan
- Since the matching function does not support substring capture, the
- escape sequences that are concerned with captured substrings are not
+ Since the matching function does not support substring capture, the
+ escape sequences that are concerned with captured substrings are not
relevant.
RESTARTING AFTER A PARTIAL MATCH
When the alternative matching function has given the PCRE_ERROR_PARTIAL
- return, indicating that the subject partially matched the pattern, you
- can restart the match with additional subject data by means of the \R
+ return, indicating that the subject partially matched the pattern, you
+ can restart the match with additional subject data by means of the \R
escape sequence. For example:
re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
@@ -660,30 +709,30 @@ RESTARTING AFTER A PARTIAL MATCH
data> n05\R\D
0: n05
- For further information about partial matching, see the pcrepartial
+ For further information about partial matching, see the pcrepartial
documentation.
CALLOUTS
- If the pattern contains any callout requests, pcretest's callout func-
- tion is called during matching. This works with both matching func-
+ If the pattern contains any callout requests, pcretest's callout func-
+ tion is called during matching. This works with both matching func-
tions. By default, the called function displays the callout number, the
- start and current positions in the text at the callout time, and the
- next pattern item to be tested. For example, the output
+ start and current positions in the text at the callout time, and the
+ next pattern item to be tested. For example:
--->pqrabcdef
0 ^ ^ \d
- indicates that callout number 0 occurred for a match attempt starting
- at the fourth character of the subject string, when the pointer was at
- the seventh character of the data, and when the next pattern item was
- \d. Just one circumflex is output if the start and current positions
- are the same.
+ This output indicates that callout number 0 occurred for a match
+ attempt starting at the fourth character of the subject string, when
+ the pointer was at the seventh character of the data, and when the next
+ pattern item was \d. Just one circumflex is output if the start and
+ current positions are the same.
Callouts numbered 255 are assumed to be automatic callouts, inserted as
- a result of the /C pattern modifier. In this case, instead of showing
- the callout number, the offset in the pattern, preceded by a plus, is
+ a result of the /C pattern modifier. In this case, instead of showing
+ the callout number, the offset in the pattern, preceded by a plus, is
output. For example:
re> /\d?[A-E]\*/C
@@ -696,7 +745,7 @@ CALLOUTS
0: E*
If a pattern contains (*MARK) items, an additional line is output when-
- ever a change of latest mark is passed to the callout function. For
+ ever a change of latest mark is passed to the callout function. For
example:
re> /a(*MARK:X)bc/C
@@ -710,59 +759,59 @@ CALLOUTS
+12 ^ ^
0: abc
- The mark changes between matching "a" and "b", but stays the same for
- the rest of the match, so nothing more is output. If, as a result of
- backtracking, the mark reverts to being unset, the text "<unset>" is
+ The mark changes between matching "a" and "b", but stays the same for
+ the rest of the match, so nothing more is output. If, as a result of
+ backtracking, the mark reverts to being unset, the text "<unset>" is
output.
- The callout function in pcretest returns zero (carry on matching) by
- default, but you can use a \C item in a data line (as described above)
+ The callout function in pcretest returns zero (carry on matching) by
+ default, but you can use a \C item in a data line (as described above)
to change this and other parameters of the callout.
- Inserting callouts can be helpful when using pcretest to check compli-
- cated regular expressions. For further information about callouts, see
+ Inserting callouts can be helpful when using pcretest to check compli-
+ cated regular expressions. For further information about callouts, see
the pcrecallout documentation.
NON-PRINTING CHARACTERS
- When pcretest is outputting text in the compiled version of a pattern,
- bytes other than 32-126 are always treated as non-printing characters
+ When pcretest is outputting text in the compiled version of a pattern,
+ bytes other than 32-126 are always treated as non-printing characters
are are therefore shown as hex escapes.
- When pcretest is outputting text that is a matched part of a subject
- string, it behaves in the same way, unless a different locale has been
- set for the pattern (using the /L modifier). In this case, the
+ When pcretest is outputting text that is a matched part of a subject
+ string, it behaves in the same way, unless a different locale has been
+ set for the pattern (using the /L modifier). In this case, the
isprint() function to distinguish printing and non-printing characters.
SAVING AND RELOADING COMPILED PATTERNS
- The facilities described in this section are not available when the
- POSIX interface to PCRE is being used, that is, when the /P pattern
+ The facilities described in this section are not available when the
+ POSIX interface to PCRE is being used, that is, when the /P pattern
modifier is specified.
When the POSIX interface is not in use, you can cause pcretest to write
- a compiled pattern to a file, by following the modifiers with > and a
+ a compiled pattern to a file, by following the modifiers with > and a
file name. For example:
/pattern/im >/some/file
- See the pcreprecompile documentation for a discussion about saving and
- re-using compiled patterns. Note that if the pattern was successfully
+ See the pcreprecompile documentation for a discussion about saving and
+ re-using compiled patterns. Note that if the pattern was successfully
studied with JIT optimization, the JIT data cannot be saved.
- The data that is written is binary. The first eight bytes are the
- length of the compiled pattern data followed by the length of the
- optional study data, each written as four bytes in big-endian order
- (most significant byte first). If there is no study data (either the
+ The data that is written is binary. The first eight bytes are the
+ length of the compiled pattern data followed by the length of the
+ optional study data, each written as four bytes in big-endian order
+ (most significant byte first). If there is no study data (either the
pattern was not studied, or studying did not return any data), the sec-
- ond length is zero. The lengths are followed by an exact copy of the
- compiled pattern. If there is additional study data, this (excluding
- any JIT data) follows immediately after the compiled pattern. After
+ ond length is zero. The lengths are followed by an exact copy of the
+ compiled pattern. If there is additional study data, this (excluding
+ any JIT data) follows immediately after the compiled pattern. After
writing the file, pcretest expects to read a new pattern.
- A saved pattern can be reloaded into pcretest by specifying < and a
+ A saved pattern can be reloaded into pcretest by specifying < and a
file name instead of a pattern. The name of the file must not contain a
< character, as otherwise pcretest will interpret the line as a pattern
delimited by < characters. For example:
@@ -771,15 +820,24 @@ SAVING AND RELOADING COMPILED PATTERNS
Compiled pattern loaded from /some/file
No study data
- If the pattern was previously studied with the JIT optimization, the
- JIT information cannot be saved and restored, and so is lost. When the
- pattern has been loaded, pcretest proceeds to read data lines in the
+ If the pattern was previously studied with the JIT optimization, the
+ JIT information cannot be saved and restored, and so is lost. When the
+ pattern has been loaded, pcretest proceeds to read data lines in the
usual way.
- You can copy a file written by pcretest to a different host and reload
- it there, even if the new host has opposite endianness to the one on
- which the pattern was compiled. For example, you can compile on an i86
- machine and run on a SPARC machine.
+ You can copy a file written by pcretest to a different host and reload
+ it there, even if the new host has opposite endianness to the one on
+ which the pattern was compiled. For example, you can compile on an i86
+ machine and run on a SPARC machine. When a pattern is reloaded on a
+ host with different endianness, the confirmation message is changed to:
+
+ Compiled pattern (byte-inverted) loaded from /some/file
+
+ The test suite contains some saved pre-compiled patterns with different
+ endianness. These are reloaded using "<!" instead of just "<". This
+ suppresses the "(byte-inverted)" text so that the output is the same on
+ all hosts. It also forces debugging output once the pattern has been
+ reloaded.
File names for saving and reloading can be absolute or relative, but
note that the shell facility of expanding a file name that starts with
@@ -797,8 +855,8 @@ SAVING AND RELOADING COMPILED PATTERNS
SEE ALSO
- pcre(3), pcreapi(3), pcrecallout(3), pcrejit, pcrematching(3), pcrepar-
- tial(d), pcrepattern(3), pcreprecompile(3).
+ pcre(3), pcre16(3), pcreapi(3), pcrecallout(3), pcrejit, pcrematch-
+ ing(3), pcrepartial(d), pcrepattern(3), pcreprecompile(3).
AUTHOR
@@ -810,5 +868,5 @@ AUTHOR
REVISION
- Last updated: 02 December 2011
- Copyright (c) 1997-2011 University of Cambridge.
+ Last updated: 13 January 2012
+ Copyright (c) 1997-2012 University of Cambridge.