diff options
author | ph10 <ph10@2f5784b3-3f2a-0410-8824-cb99058d5e15> | 2011-08-02 11:00:40 +0000 |
---|---|---|
committer | ph10 <ph10@2f5784b3-3f2a-0410-8824-cb99058d5e15> | 2011-08-02 11:00:40 +0000 |
commit | 9c65843dde6af3b331acdf8518a6020df32f45af (patch) | |
tree | f4938ee9a3d4ca4b7282f86370a5a39875a3a562 /doc/pcretest.txt | |
parent | 2c1db477501a36945e05bc50a1d563c96c4e13f4 (diff) | |
download | pcre-9c65843dde6af3b331acdf8518a6020df32f45af.tar.gz |
Documentation and general text tidies in preparation for test release.
git-svn-id: svn://vcs.exim.org/pcre/code/trunk@654 2f5784b3-3f2a-0410-8824-cb99058d5e15
Diffstat (limited to 'doc/pcretest.txt')
-rw-r--r-- | doc/pcretest.txt | 335 |
1 files changed, 202 insertions, 133 deletions
diff --git a/doc/pcretest.txt b/doc/pcretest.txt index 7f67d6f..a7c42fa 100644 --- a/doc/pcretest.txt +++ b/doc/pcretest.txt @@ -7,26 +7,30 @@ NAME SYNOPSIS - pcretest [options] [source] [destination] + pcretest [options] [input file [output file]] pcretest was written as a test program for the PCRE regular expression library itself, but it can also be used for experimenting with regular expressions. This document describes the features of the test program; for details of the regular expressions themselves, see the pcrepattern documentation. For details of the PCRE library function calls and their - options, see the pcreapi documentation. + options, see the pcreapi documentation. The input for pcretest is a + sequence of regular expression patterns and strings to be matched, as + described below. The output shows the result of each match. Options on + the command line and the patterns control PCRE options and exactly what + is output. -OPTIONS +COMMAND LINE OPTIONS - -b Behave as if each regex has the /B (show bytecode) modifier; - the internal form is output after compilation. + -b Behave as if each pattern has the /B (show byte code) modi- + fier; the internal form is output after compilation. -C Output the version number of the PCRE library, and all avail- able information about the optional features that are included, and then exit. - -d Behave as if each regex has the /D (debug) modifier; the + -d Behave as if each pattern has the /D (debug) modifier; the internal form and information about the compiled pattern is output after compilation; -d is equivalent to -b -i. @@ -37,7 +41,7 @@ OPTIONS -help Output a brief summary these options and then exit. - -i Behave as if each regex has the /I modifier; information + -i Behave as if each pattern has the /I modifier; information about the compiled pattern is given after compilation. -M Behave as if each data line contains the \M escape sequence; @@ -47,33 +51,52 @@ OPTIONS -m Output the size of each compiled pattern after it has been compiled. This is equivalent to adding /M to each regular - expression. For compatibility with earlier versions of - pcretest, -s is a synonym for -m. - - -o osize Set the number of elements in the output vector that is used - when calling pcre_exec() or pcre_dfa_exec() to be osize. The - default value is 45, which is enough for 14 capturing subex- - pressions for pcre_exec() or 22 different matches for - pcre_dfa_exec(). The vector size can be changed for individ- - ual matching calls by including \O in the data line (see + expression. + + -o osize Set the number of elements in the output vector that is used + when calling pcre_exec() or pcre_dfa_exec() to be osize. The + default value is 45, which is enough for 14 capturing subex- + pressions for pcre_exec() or 22 different matches for + pcre_dfa_exec(). The vector size can be changed for individ- + ual matching calls by including \O in the data line (see below). - -p Behave as if each regex has the /P modifier; the POSIX wrap- - per API is used to call PCRE. None of the other options has - any effect when -p is set. + -p Behave as if each pattern has the /P modifier; the POSIX + wrapper API is used to call PCRE. None of the other options + has any effect when -p is set. - -q Do not output the version number of pcretest at the start of + -q Do not output the version number of pcretest at the start of execution. - -S size On Unix-like systems, set the size of the runtime stack to + -S size On Unix-like systems, set the size of the run-time stack to size megabytes. - -t Run each compile, study, and match many times with a timer, - and output resulting time per compile or match (in millisec- - onds). Do not set -m with -t, because you will then get the - size output a zillion times, and the timing will be dis- - torted. You can control the number of iterations that are - used for timing by following -t with a number (as a separate + -s Behave as if each pattern has the /S modifier; in other + words, force each pattern to be studied. If the /I or /D + option is present on a pattern (requesting output about the + compiled pattern), information about the result of studying + is not included when studying is caused only by -s and nei- + ther -i nor -d is present on the command line. This behaviour + means that the output from tests that are run with and with- + out -s should be identical, except when options that output + information about the actual running of a match are set. The + -M, -t, and -tm options, which give information about + resources used, are likely to produce different output with + and without -s. Output may also differ if the /C option is + present on an individual pattern. This uses callouts to trace + the the matching process, and this may be different between + studied and non-studied patterns. If the pattern contains + (*MARK) items there may also be differences, for the same + reason. The -s command line option can be overridden for spe- + cific patterns that should never be studied (see the /S + option below). + + -t Run each compile, study, and match many times with a timer, + and output resulting time per compile or match (in millisec- + onds). Do not set -m with -t, because you will then get the + size output a zillion times, and the timing will be dis- + torted. You can control the number of iterations that are + used for timing by following -t with a number (as a separate item on the command line). For example, "-t 1000" would iter- ate 1000 times. The default is to iterate 500000 times. @@ -83,78 +106,78 @@ OPTIONS DESCRIPTION - If pcretest is given two filename arguments, it reads from the first + If pcretest is given two filename arguments, it reads from the first and writes to the second. If it is given only one filename argument, it - reads from that file and writes to stdout. Otherwise, it reads from - stdin and writes to stdout, and prompts for each line of input, using + reads from that file and writes to stdout. Otherwise, it reads from + stdin and writes to stdout, and prompts for each line of input, using "re>" to prompt for regular expressions, and "data>" to prompt for data lines. - When pcretest is built, a configuration option can specify that it - should be linked with the libreadline library. When this is done, if + When pcretest is built, a configuration option can specify that it + should be linked with the libreadline library. When this is done, if the input is from a terminal, it is read using the readline() function. - This provides line-editing and history facilities. The output from the + This provides line-editing and history facilities. The output from the -help option states whether or not readline() will be used. The program handles any number of sets of input on a single input file. - Each set starts with a regular expression, and continues with any num- + Each set starts with a regular expression, and continues with any num- ber of data lines to be matched against the pattern. - Each data line is matched separately and independently. If you want to + Each data line is matched separately and independently. If you want to do multi-line matches, you have to use the \n escape sequence (or \r or \r\n, etc., depending on the newline setting) in a single line of input - to encode the newline sequences. There is no limit on the length of - data lines; the input buffer is automatically extended if it is too + to encode the newline sequences. There is no limit on the length of + data lines; the input buffer is automatically extended if it is too small. - An empty line signals the end of the data lines, at which point a new - regular expression is read. The regular expressions are given enclosed + An empty line signals the end of the data lines, at which point a new + regular expression is read. The regular expressions are given enclosed in any non-alphanumeric delimiters other than backslash, for example: /(a|bc)x+yz/ - White space before the initial delimiter is ignored. A regular expres- - sion may be continued over several input lines, in which case the new- - line characters are included within it. It is possible to include the + White space before the initial delimiter is ignored. A regular expres- + sion may be continued over several input lines, in which case the new- + line characters are included within it. It is possible to include the delimiter within the pattern by escaping it, for example /abc\/def/ - If you do so, the escape and the delimiter form part of the pattern, - but since delimiters are always non-alphanumeric, this does not affect - its interpretation. If the terminating delimiter is immediately fol- + If you do so, the escape and the delimiter form part of the pattern, + but since delimiters are always non-alphanumeric, this does not affect + its interpretation. If the terminating delimiter is immediately fol- lowed by a backslash, for example, /abc/\ - then a backslash is added to the end of the pattern. This is done to - provide a way of testing the error condition that arises if a pattern + then a backslash is added to the end of the pattern. This is done to + provide a way of testing the error condition that arises if a pattern finishes with a backslash, because /abc\/ - is interpreted as the first line of a pattern that starts with "abc/", + is interpreted as the first line of a pattern that starts with "abc/", causing pcretest to read the next line as a continuation of the regular expression. PATTERN MODIFIERS - A pattern may be followed by any number of modifiers, which are mostly - single characters. Following Perl usage, these are referred to below - as, for example, "the /i modifier", even though the delimiter of the - pattern need not always be a slash, and no slash is used when writing - modifiers. Whitespace may appear between the final pattern delimiter + A pattern may be followed by any number of modifiers, which are mostly + single characters. Following Perl usage, these are referred to below + as, for example, "the /i modifier", even though the delimiter of the + pattern need not always be a slash, and no slash is used when writing + modifiers. White space may appear between the final pattern delimiter and the first modifier, and between the modifiers themselves. The /i, /m, /s, and /x modifiers set the PCRE_CASELESS, PCRE_MULTILINE, - PCRE_DOTALL, or PCRE_EXTENDED options, respectively, when pcre_com- - pile() is called. These four modifier letters have the same effect as + PCRE_DOTALL, or PCRE_EXTENDED options, respectively, when pcre_com- + pile() is called. These four modifier letters have the same effect as they do in Perl. For example: /caseless/i - The following table shows additional modifiers for setting PCRE com- + The following table shows additional modifiers for setting PCRE com- pile-time options that do not correspond to anything in Perl: /8 PCRE_UTF8 @@ -178,48 +201,59 @@ PATTERN MODIFIERS /<bsr_anycrlf> PCRE_BSR_ANYCRLF /<bsr_unicode> PCRE_BSR_UNICODE - The modifiers that are enclosed in angle brackets are literal strings - as shown, including the angle brackets, but the letters can be in - either case. This example sets multiline matching with CRLF as the line - ending sequence: + The modifiers that are enclosed in angle brackets are literal strings + as shown, including the angle brackets, but the letters within can be + in either case. This example sets multiline matching with CRLF as the + line ending sequence: - /^abc/m<crlf> + /^abc/m<CRLF> As well as turning on the PCRE_UTF8 option, the /8 modifier also causes - any non-printing characters in output strings to be printed using the - \x{hh...} notation if they are valid UTF-8 sequences. Full details of + any non-printing characters in output strings to be printed using the + \x{hh...} notation if they are valid UTF-8 sequences. Full details of the PCRE options are given in the pcreapi documentation. Finding all matches in a string - Searching for all possible matches within each subject string can be - requested by the /g or /G modifier. After finding a match, PCRE is + Searching for all possible matches within each subject string can be + requested by the /g or /G modifier. After finding a match, PCRE is called again to search the remainder of the subject string. The differ- ence between /g and /G is that the former uses the startoffset argument - to pcre_exec() to start searching at a new point within the entire - string (which is in effect what Perl does), whereas the latter passes - over a shortened substring. This makes a difference to the matching + to pcre_exec() to start searching at a new point within the entire + string (which is in effect what Perl does), whereas the latter passes + over a shortened substring. This makes a difference to the matching process if the pattern begins with a lookbehind assertion (including \b or \B). - If any call to pcre_exec() in a /g or /G sequence matches an empty - string, the next call is done with the PCRE_NOTEMPTY_ATSTART and - PCRE_ANCHORED flags set in order to search for another, non-empty, - match at the same point. If this second match fails, the start offset - is advanced, and the normal match is retried. This imitates the way + If any call to pcre_exec() in a /g or /G sequence matches an empty + string, the next call is done with the PCRE_NOTEMPTY_ATSTART and + PCRE_ANCHORED flags set in order to search for another, non-empty, + match at the same point. If this second match fails, the start offset + is advanced, and the normal match is retried. This imitates the way Perl handles such cases when using the /g modifier or the split() func- - tion. Normally, the start offset is advanced by one character, but if - the newline convention recognizes CRLF as a newline, and the current + tion. Normally, the start offset is advanced by one character, but if + the newline convention recognizes CRLF as a newline, and the current character is CR followed by LF, an advance of two is used. Other modifiers There are yet more modifiers for controlling the way pcretest operates. - The /+ modifier requests that as well as outputting the substring that - matched the entire pattern, pcretest should in addition output the - remainder of the subject string. This is useful for tests where the - subject contains multiple copies of the same substring. + The /+ modifier requests that as well as outputting the substring that + matched the entire pattern, pcretest should in addition output the + remainder of the subject string. This is useful for tests where the + subject contains multiple copies of the same substring. If the + modi- + fier appears twice, the same action is taken for captured substrings. + In each case the remainder is output on the following line with a plus + character following the capture number. + + The /= modifier requests that the values of all potential captured + parentheses be output after a match by pcre_exec(). By default, only + those up to the highest one actually used in the match are output (cor- + responding to the return code from pcre_exec()). Values in the offsets + vector corresponding to higher numbers should be set to -1, and these + are output as "<unset>". This modifier gives a way of checking that + this is happening. The /B modifier is a debugging feature. It requests that pcretest out- put a representation of the compiled byte code after compilation. Nor- @@ -270,8 +304,14 @@ PATTERN MODIFIERS The /M modifier causes the size of memory block used to hold the com- piled pattern to be output. - The /S modifier causes pcre_study() to be called after the expression - has been compiled, and the results used when the expression is matched. + If the /S modifier appears once, it causes pcre_study() to be called + after the expression has been compiled, and the results used when the + expression is matched. If /S appears twice, it suppresses studying, + even if it was requested externally by the -s command line option. This + makes it possible to specify that certain patterns are always studied, + and others are never studied, independently of -s. This feature is used + in the test files in a few cases where the output is different when the + pattern is studied. The /T modifier must be followed by a single digit. It causes a spe- cific set of built-in character tables to be passed to pcre_compile(). @@ -306,7 +346,7 @@ PATTERN MODIFIERS DATA LINES Before each data line is passed to pcre_exec(), leading and trailing - whitespace is removed, and it is then scanned for \ escapes. Some of + white space is removed, and it is then scanned for \ escapes. Some of these are pretty esoteric features, intended for checking out some of the more complicated features of PCRE. If you are just testing "ordi- nary" regular expressions, you probably don't need any of these. The @@ -315,7 +355,7 @@ DATA LINES \a alarm (BEL, \x07) \b backspace (\x08) \e escape (\x27) - \f formfeed (\x0c) + \f form feed (\x0c) \n newline (\x0a) \qdd set the PCRE_MATCH_LIMIT limit to dd (any number of digits) @@ -463,11 +503,14 @@ DEFAULT OUTPUT FROM PCRETEST (Note that this is the entire substring that was inspected during the partial match; it may include characters before the actual match start if a lookbehind assertion, \K, \b, or \B was involved.) For any other - returns, it outputs the PCRE negative error number. Here is an example - of an interactive pcretest run. + return, pcretest outputs the PCRE negative error number and a short + descriptive phrase. If the error is a failed UTF-8 string check, the + byte offset of the start of the failing character and the reason code + are also output, provided that the size of the output vector is at + least two. Here is an example of an interactive pcretest run. $ pcretest - PCRE version 7.0 30-Nov-2006 + PCRE version 8.13 2011-04-30 re> /^abc(\d+)/ data> abc123 @@ -476,12 +519,12 @@ DEFAULT OUTPUT FROM PCRETEST data> xyz No match - Note that unset capturing substrings that are not followed by one that - is set are not returned by pcre_exec(), and are not shown by pcretest. - In the following example, there are two capturing substrings, but when - the first data line is matched, the second, unset substring is not - shown. An "internal" unset substring is shown as "<unset>", as for the - second data line. + Unset capturing substrings that are not followed by one that is set are + not returned by pcre_exec(), and are not shown by pcretest. In the fol- + lowing example, there are two capturing substrings, but when the first + data line is matched, the second, unset substring is not shown. An + "internal" unset substring is shown as "<unset>", as for the second + data line. re> /(a)|(b)/ data> a @@ -492,11 +535,11 @@ DEFAULT OUTPUT FROM PCRETEST 1: <unset> 2: b - If the strings contain any non-printing characters, they are output as - \0x escapes, or as \x{...} escapes if the /8 modifier was present on - the pattern. See below for the definition of non-printing characters. - If the pattern has the /+ modifier, the output for substring 0 is fol- - lowed by the the rest of the subject string, identified by "0+" like + If the strings contain any non-printing characters, they are output as + \0x escapes, or as \x{...} escapes if the /8 modifier was present on + the pattern. See below for the definition of non-printing characters. + If the pattern has the /+ modifier, the output for substring 0 is fol- + lowed by the the rest of the subject string, identified by "0+" like this: re> /cat/+ @@ -504,7 +547,7 @@ DEFAULT OUTPUT FROM PCRETEST 0: cat 0+ aract - If the pattern has the /g or /G modifier, the results of successive + If the pattern has the /g or /G modifier, the results of successive matching attempts are output in sequence, like this: re> /\Bi(\w\w)/g @@ -516,26 +559,32 @@ DEFAULT OUTPUT FROM PCRETEST 0: ipp 1: pp - "No match" is output only if the first match attempt fails. + "No match" is output only if the first match attempt fails. Here is an + example of a failure message (the offset 4 that is specified by \>4 is + past the end of the subject string): - If any of the sequences \C, \G, or \L are present in a data line that - is successfully matched, the substrings extracted by the convenience + re> /xyz/ + data> xyz\>4 + Error -24 (bad offset value) + + If any of the sequences \C, \G, or \L are present in a data line that + is successfully matched, the substrings extracted by the convenience functions are output with C, G, or L after the string number instead of a colon. This is in addition to the normal full list. The string length - (that is, the return from the extraction function) is given in paren- + (that is, the return from the extraction function) is given in paren- theses after each string for \C and \G. Note that whereas patterns can be continued over several lines (a plain ">" prompt is used for continuations), data lines may not. However new- - lines can be included in data by means of the \n escape (or \r, \r\n, + lines can be included in data by means of the \n escape (or \r, \r\n, etc., depending on the newline sequence setting). OUTPUT FROM THE ALTERNATIVE MATCHING FUNCTION - When the alternative matching function, pcre_dfa_exec(), is used (by - means of the \D escape sequence or the -dfa command line option), the - output consists of a list of all the matches that start at the first + When the alternative matching function, pcre_dfa_exec(), is used (by + means of the \D escape sequence or the -dfa command line option), the + output consists of a list of all the matches that start at the first point in the subject where there is at least one match. For example: re> /(tang|tangerine|tan)/ @@ -544,11 +593,11 @@ OUTPUT FROM THE ALTERNATIVE MATCHING FUNCTION 1: tang 2: tan - (Using the normal matching function on this data finds only "tang".) - The longest matching string is always given first (and numbered zero). + (Using the normal matching function on this data finds only "tang".) + The longest matching string is always given first (and numbered zero). After a PCRE_ERROR_PARTIAL return, the output is "Partial match:", fol- - lowed by the partially matching substring. (Note that this is the - entire substring that was inspected during the partial match; it may + lowed by the partially matching substring. (Note that this is the + entire substring that was inspected during the partial match; it may include characters before the actual match start if a lookbehind asser- tion, \K, \b, or \B was involved.) @@ -564,16 +613,16 @@ OUTPUT FROM THE ALTERNATIVE MATCHING FUNCTION 1: tan 0: tan - Since the matching function does not support substring capture, the - escape sequences that are concerned with captured substrings are not + Since the matching function does not support substring capture, the + escape sequences that are concerned with captured substrings are not relevant. RESTARTING AFTER A PARTIAL MATCH When the alternative matching function has given the PCRE_ERROR_PARTIAL - return, indicating that the subject partially matched the pattern, you - can restart the match with additional subject data by means of the \R + return, indicating that the subject partially matched the pattern, you + can restart the match with additional subject data by means of the \R escape sequence. For example: re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/ @@ -582,30 +631,30 @@ RESTARTING AFTER A PARTIAL MATCH data> n05\R\D 0: n05 - For further information about partial matching, see the pcrepartial + For further information about partial matching, see the pcrepartial documentation. CALLOUTS - If the pattern contains any callout requests, pcretest's callout func- - tion is called during matching. This works with both matching func- + If the pattern contains any callout requests, pcretest's callout func- + tion is called during matching. This works with both matching func- tions. By default, the called function displays the callout number, the - start and current positions in the text at the callout time, and the + start and current positions in the text at the callout time, and the next pattern item to be tested. For example, the output --->pqrabcdef 0 ^ ^ \d - indicates that callout number 0 occurred for a match attempt starting - at the fourth character of the subject string, when the pointer was at - the seventh character of the data, and when the next pattern item was - \d. Just one circumflex is output if the start and current positions + indicates that callout number 0 occurred for a match attempt starting + at the fourth character of the subject string, when the pointer was at + the seventh character of the data, and when the next pattern item was + \d. Just one circumflex is output if the start and current positions are the same. Callouts numbered 255 are assumed to be automatic callouts, inserted as - a result of the /C pattern modifier. In this case, instead of showing - the callout number, the offset in the pattern, preceded by a plus, is + a result of the /C pattern modifier. In this case, instead of showing + the callout number, the offset in the pattern, preceded by a plus, is output. For example: re> /\d?[A-E]\*/C @@ -617,9 +666,29 @@ CALLOUTS +10 ^ ^ 0: E* + If a pattern contains (*MARK) items, an additional line is output when- + ever a change of latest mark is passed to the callout function. For + example: + + re> /a(*MARK:X)bc/C + data> abc + --->abc + +0 ^ a + +1 ^^ (*MARK:X) + +10 ^^ b + Latest Mark: X + +11 ^ ^ c + +12 ^ ^ + 0: abc + + The mark changes between matching "a" and "b", but stays the same for + the rest of the match, so nothing more is output. If, as a result of + backtracking, the mark reverts to being unset, the text "<unset>" is + output. + The callout function in pcretest returns zero (carry on matching) by default, but you can use a \C item in a data line (as described above) - to change this. + to change this and other parameters of the callout. Inserting callouts can be helpful when using pcretest to check compli- cated regular expressions. For further information about callouts, see @@ -641,8 +710,8 @@ NON-PRINTING CHARACTERS SAVING AND RELOADING COMPILED PATTERNS The facilities described in this section are not available when the - POSIX inteface to PCRE is being used, that is, when the /P pattern mod- - ifier is specified. + POSIX interface to PCRE is being used, that is, when the /P pattern + modifier is specified. When the POSIX interface is not in use, you can cause pcretest to write a compiled pattern to a file, by following the modifiers with > and a @@ -663,13 +732,13 @@ SAVING AND RELOADING COMPILED PATTERNS diately after the compiled pattern. After writing the file, pcretest expects to read a new pattern. - A saved pattern can be reloaded into pcretest by specifing < and a file - name instead of a pattern. The name of the file must not contain a < - character, as otherwise pcretest will interpret the line as a pattern + A saved pattern can be reloaded into pcretest by specifying < and a + file name instead of a pattern. The name of the file must not contain a + < character, as otherwise pcretest will interpret the line as a pattern delimited by < characters. For example: re> </some/file - Compiled regex loaded from /some/file + Compiled pattern loaded from /some/file No study data When the pattern has been loaded, pcretest proceeds to read data lines @@ -709,5 +778,5 @@ AUTHOR REVISION - Last updated: 21 November 2010 - Copyright (c) 1997-2010 University of Cambridge. + Last updated: 01 August 2011 + Copyright (c) 1997-2011 University of Cambridge. |