diff options
author | nigel <nigel@2f5784b3-3f2a-0410-8824-cb99058d5e15> | 2007-02-24 21:40:30 +0000 |
---|---|---|
committer | nigel <nigel@2f5784b3-3f2a-0410-8824-cb99058d5e15> | 2007-02-24 21:40:30 +0000 |
commit | 568064fe47fac0c7caffc684e6ea227ad8127b70 (patch) | |
tree | 8891ffb119dd71e1d00083046876bdb6abbf28f9 /doc/pcretest.txt | |
parent | 4af6fcff808e079ca1aa09104d6146baa932af47 (diff) | |
download | pcre-568064fe47fac0c7caffc684e6ea227ad8127b70.tar.gz |
Load pcre-4.5 into code/trunk.
git-svn-id: svn://vcs.exim.org/pcre/code/trunk@73 2f5784b3-3f2a-0410-8824-cb99058d5e15
Diffstat (limited to 'doc/pcretest.txt')
-rw-r--r-- | doc/pcretest.txt | 632 |
1 files changed, 301 insertions, 331 deletions
diff --git a/doc/pcretest.txt b/doc/pcretest.txt index 4fa9ca4..0e9cd13 100644 --- a/doc/pcretest.txt +++ b/doc/pcretest.txt @@ -1,387 +1,357 @@ -NAME - pcretest - a program for testing Perl-compatible regular - expressions. +PCRETEST(1) PCRETEST(1) + +NAME + pcretest - a program for testing Perl-compatible regular expressions. + SYNOPSIS - pcretest [-d] [-i] [-m] [-o osize] [-p] [-t] [source] [des- - tination] + pcretest [-d] [-i] [-m] [-o osize] [-p] [-t] [source] [destination] - pcretest was written as a test program for the PCRE regular - expression library itself, but it can also be used for - experimenting with regular expressions. This document - describes the features of the test program; for details of - the regular expressions themselves, see the pcrepattern - documentation. For details of PCRE and its options, see the - pcreapi documentation. + pcretest was written as a test program for the PCRE regular expression + library itself, but it can also be used for experimenting with regular + expressions. This document describes the features of the test program; + for details of the regular expressions themselves, see the pcrepattern + documentation. For details of PCRE and its options, see the pcreapi + documentation. OPTIONS - -C Output the version number of the PCRE library, and - all available information about the optional - features that are included, and then exit. + -C Output the version number of the PCRE library, and all avail- + able information about the optional features that are + included, and then exit. - -d Behave as if each regex had the /D modifier (see - below); the internal form is output after compila- - tion. + -d Behave as if each regex had the /D modifier (see below); the + internal form is output after compilation. - -i Behave as if each regex had the /I modifier; - information about the compiled pattern is given - after compilation. + -i Behave as if each regex had the /I modifier; information + about the compiled pattern is given after compilation. - -m Output the size of each compiled pattern after it - has been compiled. This is equivalent to adding /M - to each regular expression. For compatibility with - earlier versions of pcretest, -s is a synonym for - -m. + -m Output the size of each compiled pattern after it has been + compiled. This is equivalent to adding /M to each regular + expression. For compatibility with earlier versions of + pcretest, -s is a synonym for -m. - -o osize Set the number of elements in the output vector - that is used when calling PCRE to be osize. The - default value is 45, which is enough for 14 cap- - turing subexpressions. The vector size can be - changed for individual matching calls by including - \O in the data line (see below). + -o osize Set the number of elements in the output vector that is used + when calling PCRE to be osize. The default value is 45, which + is enough for 14 capturing subexpressions. The vector size + can be changed for individual matching calls by including \O + in the data line (see below). - -p Behave as if each regex has /P modifier; the POSIX - wrapper API is used to call PCRE. None of the - other options has any effect when -p is set. + -p Behave as if each regex has /P modifier; the POSIX wrapper + API is used to call PCRE. None of the other options has any + effect when -p is set. - -t Run each compile, study, and match many times with - a timer, and output resulting time per compile or - match (in milliseconds). Do not set -t with -m, - because you will then get the size output 20000 - times and the timing will be distorted. + -t Run each compile, study, and match many times with a timer, + and output resulting time per compile or match (in millisec- + onds). Do not set -t with -m, because you will then get the + size output 20000 times and the timing will be distorted. DESCRIPTION - If pcretest is given two filename arguments, it reads from - the first and writes to the second. If it is given only one - filename argument, it reads from that file and writes to - stdout. Otherwise, it reads from stdin and writes to stdout, - and prompts for each line of input, using "re>" to prompt - for regular expressions, and "data>" to prompt for data - lines. + If pcretest is given two filename arguments, it reads from the first + and writes to the second. If it is given only one filename argument, it + reads from that file and writes to stdout. Otherwise, it reads from + stdin and writes to stdout, and prompts for each line of input, using + "re>" to prompt for regular expressions, and "data>" to prompt for data + lines. - The program handles any number of sets of input on a single - input file. Each set starts with a regular expression, and - continues with any number of data lines to be matched - against the pattern. + The program handles any number of sets of input on a single input file. + Each set starts with a regular expression, and continues with any num- + ber of data lines to be matched against the pattern. - Each line is matched separately and independently. If you - want to do multiple-line matches, you have to use the \n - escape sequence in a single line of input to encode the new- - line characters. The maximum length of data line is 30,000 - characters. + Each line is matched separately and independently. If you want to do + multiple-line matches, you have to use the \n escape sequence in a sin- + gle line of input to encode the newline characters. The maximum length + of data line is 30,000 characters. - An empty line signals the end of the data lines, at which - point a new regular expression is read. The regular expres- - sions are given enclosed in any non-alphameric delimiters - other than backslash, for example + An empty line signals the end of the data lines, at which point a new + regular expression is read. The regular expressions are given enclosed + in any non-alphameric delimiters other than backslash, for example - /(a|bc)x+yz/ + /(a|bc)x+yz/ - White space before the initial delimiter is ignored. A regu- - lar expression may be continued over several input lines, in - which case the newline characters are included within it. It - is possible to include the delimiter within the pattern by - escaping it, for example + White space before the initial delimiter is ignored. A regular expres- + sion may be continued over several input lines, in which case the new- + line characters are included within it. It is possible to include the + delimiter within the pattern by escaping it, for example - /abc\/def/ + /abc\/def/ - If you do so, the escape and the delimiter form part of the - pattern, but since delimiters are always non-alphameric, - this does not affect its interpretation. If the terminating - delimiter is immediately followed by a backslash, for exam- - ple, + If you do so, the escape and the delimiter form part of the pattern, + but since delimiters are always non-alphameric, this does not affect + its interpretation. If the terminating delimiter is immediately fol- + lowed by a backslash, for example, - /abc/\ + /abc/\ - then a backslash is added to the end of the pattern. This is - done to provide a way of testing the error condition that - arises if a pattern finishes with a backslash, because + then a backslash is added to the end of the pattern. This is done to + provide a way of testing the error condition that arises if a pattern + finishes with a backslash, because - /abc\/ + /abc\/ - is interpreted as the first line of a pattern that starts - with "abc/", causing pcretest to read the next line as a - continuation of the regular expression. + is interpreted as the first line of a pattern that starts with "abc/", + causing pcretest to read the next line as a continuation of the regular + expression. PATTERN MODIFIERS - The pattern may be followed by i, m, s, or x to set the - PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, or PCRE_EXTENDED - options, respectively. For example: - - /caseless/i - - These modifier letters have the same effect as they do in - Perl. There are others that set PCRE options that do not - correspond to anything in Perl: /A, /E, /N, /U, and /X set - PCRE_ANCHORED, PCRE_DOLLAR_ENDONLY, PCRE_NO_AUTO_CAPTURE, - PCRE_UNGREEDY, and PCRE_EXTRA respectively. - - Searching for all possible matches within each subject - string can be requested by the /g or /G modifier. After - finding a match, PCRE is called again to search the - remainder of the subject string. The difference between /g - and /G is that the former uses the startoffset argument to - pcre_exec() to start searching at a new point within the - entire string (which is in effect what Perl does), whereas - the latter passes over a shortened substring. This makes a - difference to the matching process if the pattern begins - with a lookbehind assertion (including \b or \B). - - If any call to pcre_exec() in a /g or /G sequence matches an - empty string, the next call is done with the PCRE_NOTEMPTY - and PCRE_ANCHORED flags set in order to search for another, - non-empty, match at the same point. If this second match - fails, the start offset is advanced by one, and the normal - match is retried. This imitates the way Perl handles such - cases when using the /g modifier or the split() function. - - There are a number of other modifiers for controlling the - way pcretest operates. - - The /+ modifier requests that as well as outputting the sub- - string that matched the entire pattern, pcretest should in - addition output the remainder of the subject string. This is - useful for tests where the subject contains multiple copies - of the same substring. - - The /L modifier must be followed directly by the name of a - locale, for example, - - /pattern/Lfr - - For this reason, it must be the last modifier letter. The - given locale is set, pcre_maketables() is called to build a - set of character tables for the locale, and this is then - passed to pcre_compile() when compiling the regular expres- - sion. Without an /L modifier, NULL is passed as the tables - pointer; that is, /L applies only to the expression on which - it appears. - - The /I modifier requests that pcretest output information - about the compiled expression (whether it is anchored, has a - fixed first character, and so on). It does this by calling - pcre_fullinfo() after compiling an expression, and output- - ting the information it gets back. If the pattern is stu- - died, the results of that are also output. - - The /D modifier is a PCRE debugging feature, which also - assumes /I. It causes the internal form of compiled regular - expressions to be output after compilation. If the pattern - was studied, the information returned is also output. - - The /S modifier causes pcre_study() to be called after the - expression has been compiled, and the results used when the - expression is matched. - - The /M modifier causes the size of memory block used to hold - the compiled pattern to be output. - - The /P modifier causes pcretest to call PCRE via the POSIX - wrapper API rather than its native API. When this is done, - all other modifiers except /i, /m, and /+ are ignored. - REG_ICASE is set if /i is present, and REG_NEWLINE is set if - /m is present. The wrapper functions force - PCRE_DOLLAR_ENDONLY always, and PCRE_DOTALL unless - REG_NEWLINE is set. - - The /8 modifier causes pcretest to call PCRE with the - PCRE_UTF8 option set. This turns on support for UTF-8 char- - acter handling in PCRE, provided that it was compiled with - this support enabled. This modifier also causes any non- - printing characters in output strings to be printed using - the \x{hh...} notation if they are valid UTF-8 sequences. - - If the /? modifier is used with /8, it causes pcretest to - call pcre_compile() with the PCRE_NO_UTF8_CHECK option, to - suppress the checking of the string for UTF-8 validity. + The pattern may be followed by i, m, s, or x to set the PCRE_CASELESS, + PCRE_MULTILINE, PCRE_DOTALL, or PCRE_EXTENDED options, respectively. + For example: + + /caseless/i + + These modifier letters have the same effect as they do in Perl. There + are others that set PCRE options that do not correspond to anything in + Perl: /A, /E, /N, /U, and /X set PCRE_ANCHORED, PCRE_DOLLAR_ENDONLY, + PCRE_NO_AUTO_CAPTURE, PCRE_UNGREEDY, and PCRE_EXTRA respectively. + + Searching for all possible matches within each subject string can be + requested by the /g or /G modifier. After finding a match, PCRE is + called again to search the remainder of the subject string. The differ- + ence between /g and /G is that the former uses the startoffset argument + to pcre_exec() to start searching at a new point within the entire + string (which is in effect what Perl does), whereas the latter passes + over a shortened substring. This makes a difference to the matching + process if the pattern begins with a lookbehind assertion (including \b + or \B). + + If any call to pcre_exec() in a /g or /G sequence matches an empty + string, the next call is done with the PCRE_NOTEMPTY and PCRE_ANCHORED + flags set in order to search for another, non-empty, match at the same + point. If this second match fails, the start offset is advanced by + one, and the normal match is retried. This imitates the way Perl han- + dles such cases when using the /g modifier or the split() function. + + There are a number of other modifiers for controlling the way pcretest + operates. + + The /+ modifier requests that as well as outputting the substring that + matched the entire pattern, pcretest should in addition output the + remainder of the subject string. This is useful for tests where the + subject contains multiple copies of the same substring. + + The /L modifier must be followed directly by the name of a locale, for + example, + + /pattern/Lfr + + For this reason, it must be the last modifier letter. The given locale + is set, pcre_maketables() is called to build a set of character tables + for the locale, and this is then passed to pcre_compile() when compil- + ing the regular expression. Without an /L modifier, NULL is passed as + the tables pointer; that is, /L applies only to the expression on which + it appears. + + The /I modifier requests that pcretest output information about the + compiled expression (whether it is anchored, has a fixed first charac- + ter, and so on). It does this by calling pcre_fullinfo() after compil- + ing an expression, and outputting the information it gets back. If the + pattern is studied, the results of that are also output. + + The /D modifier is a PCRE debugging feature, which also assumes /I. It + causes the internal form of compiled regular expressions to be output + after compilation. If the pattern was studied, the information returned + is also output. + + The /S modifier causes pcre_study() to be called after the expression + has been compiled, and the results used when the expression is matched. + + The /M modifier causes the size of memory block used to hold the com- + piled pattern to be output. + + The /P modifier causes pcretest to call PCRE via the POSIX wrapper API + rather than its native API. When this is done, all other modifiers + except /i, /m, and /+ are ignored. REG_ICASE is set if /i is present, + and REG_NEWLINE is set if /m is present. The wrapper functions force + PCRE_DOLLAR_ENDONLY always, and PCRE_DOTALL unless REG_NEWLINE is set. + + The /8 modifier causes pcretest to call PCRE with the PCRE_UTF8 option + set. This turns on support for UTF-8 character handling in PCRE, pro- + vided that it was compiled with this support enabled. This modifier + also causes any non-printing characters in output strings to be printed + using the \x{hh...} notation if they are valid UTF-8 sequences. + + If the /? modifier is used with /8, it causes pcretest to call + pcre_compile() with the PCRE_NO_UTF8_CHECK option, to suppress the + checking of the string for UTF-8 validity. CALLOUTS - If the pattern contains any callout requests, pcretest's - callout function will be called. By default, it displays the - callout number, and the start and current positions in the - text at the callout time. For example, the output + If the pattern contains any callout requests, pcretest's callout func- + tion will be called. By default, it displays the callout number, and + the start and current positions in the text at the callout time. For + example, the output - --->pqrabcdef - 0 ^ ^ + --->pqrabcdef + 0 ^ ^ - indicates that callout number 0 occurred for a match attempt - starting at the fourth character of the subject string, when - the pointer was at the seventh character. The callout func- - tion returns zero (carry on matching) by default. + indicates that callout number 0 occurred for a match attempt starting + at the fourth character of the subject string, when the pointer was at + the seventh character. The callout function returns zero (carry on + matching) by default. - Inserting callouts may be helpful when using pcretest to - check complicated regular expressions. For further informa- - tion about callouts, see the pcrecallout documentation. + Inserting callouts may be helpful when using pcretest to check compli- + cated regular expressions. For further information about callouts, see + the pcrecallout documentation. - For testing the PCRE library, additional control of callout - behaviour is available via escape sequences in the data, as - described in the following section. In particular, it is - possible to pass in a number as callout data (the default is - zero). If the callout function receives a non-zero number, - it returns that value instead of zero. + For testing the PCRE library, additional control of callout behaviour + is available via escape sequences in the data, as described in the fol- + lowing section. In particular, it is possible to pass in a number as + callout data (the default is zero). If the callout function receives a + non-zero number, it returns that value instead of zero. DATA LINES - Before each data line is passed to pcre_exec(), leading and - trailing whitespace is removed, and it is then scanned for \ - escapes. Some of these are pretty esoteric features, - intended for checking out some of the more complicated - features of PCRE. If you are just testing "ordinary" regular - expressions, you probably don't need any of these. The fol- - lowing escapes are recognized: - - \a alarm (= BEL) - \b backspace - \e escape - \f formfeed - \n newline - \r carriage return - \t tab - \v vertical tab - \nnn octal character (up to 3 octal digits) - \xhh hexadecimal character (up to 2 hex digits) - \x{hh...} hexadecimal character, any number of digits - in UTF-8 mode - \A pass the PCRE_ANCHORED option to pcre_exec() - \B pass the PCRE_NOTBOL option to pcre_exec() - \Cdd call pcre_copy_substring() for substring dd - after a successful match (any decimal number - less than 32) - \Cname call pcre_copy_named_substring() for substring - - "name" after a successful match (name termin- - ated by next non alphanumeric character) - \C+ show the current captured substrings at callout - time - \C- do not supply a callout function - \C!n return 1 instead of 0 when callout number n is - reached - \C!n!m return 1 instead of 0 when callout number n is - reached for the nth time - \C*n pass the number n (may be negative) as callout - data - \Gdd call pcre_get_substring() for substring dd - after a successful match (any decimal number - less than 32) - \Gname call pcre_get_named_substring() for substring - "name" after a successful match (name termin- - ated by next non-alphanumeric character) - \L call pcre_get_substringlist() after a - successful match - \M discover the minimum MATCH_LIMIT setting - \N pass the PCRE_NOTEMPTY option to pcre_exec() - \Odd set the size of the output vector passed to - pcre_exec() to dd (any number of decimal - digits) - \Z pass the PCRE_NOTEOL option to pcre_exec() - \? pass the PCRE_NO_UTF8_CHECK option to - pcre_exec() - - If \M is present, pcretest calls pcre_exec() several times, - with different values in the match_limit field of the - pcre_extra data structure, until it finds the minimum number - that is needed for pcre_exec() to complete. This number is a - measure of the amount of recursion and backtracking that - takes place, and checking it out can be instructive. For - most simple matches, the number is quite small, but for pat- - terns with very large numbers of matching possibilities, it - can become large very quickly with increasing length of sub- - ject string. - - When \O is used, it may be higher or lower than the size set - by the -O option (or defaulted to 45); \O applies only to - the call of pcre_exec() for the line in which it appears. - - A backslash followed by anything else just escapes the any- - thing else. If the very last character is a backslash, it is - ignored. This gives a way of passing an empty line as data, - since a real empty line terminates the data input. - - If /P was present on the regex, causing the POSIX wrapper - API to be used, only B, and Z have any effect, causing - REG_NOTBOL and REG_NOTEOL to be passed to regexec() respec- - tively. - The use of \x{hh...} to represent UTF-8 characters is not - dependent on the use of the /8 modifier on the pattern. It - is recognized always. There may be any number of hexadecimal - digits inside the braces. The result is from one to six - bytes, encoded according to the UTF-8 rules. + Before each data line is passed to pcre_exec(), leading and trailing + whitespace is removed, and it is then scanned for \ escapes. Some of + these are pretty esoteric features, intended for checking out some of + the more complicated features of PCRE. If you are just testing "ordi- + nary" regular expressions, you probably don't need any of these. The + following escapes are recognized: + + \a alarm (= BEL) + \b backspace + \e escape + \f formfeed + \n newline + \r carriage return + \t tab + \v vertical tab + \nnn octal character (up to 3 octal digits) + \xhh hexadecimal character (up to 2 hex digits) + \x{hh...} hexadecimal character, any number of digits + in UTF-8 mode + \A pass the PCRE_ANCHORED option to pcre_exec() + \B pass the PCRE_NOTBOL option to pcre_exec() + \Cdd call pcre_copy_substring() for substring dd + after a successful match (any decimal number + less than 32) + \Cname call pcre_copy_named_substring() for substring + "name" after a successful match (name termin- + ated by next non alphanumeric character) + \C+ show the current captured substrings at callout + time + \C- do not supply a callout function + \C!n return 1 instead of 0 when callout number n is + reached + \C!n!m return 1 instead of 0 when callout number n is + reached for the nth time + \C*n pass the number n (may be negative) as callout + data + \Gdd call pcre_get_substring() for substring dd + after a successful match (any decimal number + less than 32) + \Gname call pcre_get_named_substring() for substring + "name" after a successful match (name termin- + ated by next non-alphanumeric character) + \L call pcre_get_substringlist() after a + successful match + \M discover the minimum MATCH_LIMIT setting + \N pass the PCRE_NOTEMPTY option to pcre_exec() + \Odd set the size of the output vector passed to + pcre_exec() to dd (any number of decimal + digits) + \S output details of memory get/free calls during matching + \Z pass the PCRE_NOTEOL option to pcre_exec() + \? pass the PCRE_NO_UTF8_CHECK option to + pcre_exec() + + If \M is present, pcretest calls pcre_exec() several times, with dif- + ferent values in the match_limit field of the pcre_extra data struc- + ture, until it finds the minimum number that is needed for pcre_exec() + to complete. This number is a measure of the amount of recursion and + backtracking that takes place, and checking it out can be instructive. + For most simple matches, the number is quite small, but for patterns + with very large numbers of matching possibilities, it can become large + very quickly with increasing length of subject string. + + When \O is used, it may be higher or lower than the size set by the -O + option (or defaulted to 45); \O applies only to the call of pcre_exec() + for the line in which it appears. + + A backslash followed by anything else just escapes the anything else. + If the very last character is a backslash, it is ignored. This gives a + way of passing an empty line as data, since a real empty line termi- + nates the data input. + + If /P was present on the regex, causing the POSIX wrapper API to be + used, only 0 causing REG_NOTBOL and REG_NOTEOL to be passed to + regexec() respectively. + + The use of \x{hh...} to represent UTF-8 characters is not dependent on + the use of the /8 modifier on the pattern. It is recognized always. + There may be any number of hexadecimal digits inside the braces. The + result is from one to six bytes, encoded according to the UTF-8 rules. OUTPUT FROM PCRETEST - When a match succeeds, pcretest outputs the list of captured - substrings that pcre_exec() returns, starting with number 0 - for the string that matched the whole pattern. Here is an - example of an interactive pcretest run. - - $ pcretest - PCRE version 4.00 08-Jan-2003 - - re> /^abc(\d+)/ - data> abc123 - 0: abc123 - 1: 123 - data> xyz - No match - - If the strings contain any non-printing characters, they are - output as \0x escapes, or as \x{...} escapes if the /8 - modifier was present on the pattern. If the pattern has the - /+ modifier, then the output for substring 0 is followed by - the the rest of the subject string, identified by "0+" like - this: - - re> /cat/+ - data> cataract - 0: cat - 0+ aract - - If the pattern has the /g or /G modifier, the results of - successive matching attempts are output in sequence, like - this: - - re> /\Bi(\w\w)/g - data> Mississippi - 0: iss - 1: ss - 0: iss - 1: ss - 0: ipp - 1: pp - - "No match" is output only if the first match attempt fails. - - If any of the sequences \C, \G, or \L are present in a data - line that is successfully matched, the substrings extracted - by the convenience functions are output with C, G, or L - after the string number instead of a colon. This is in addi- - tion to the normal full list. The string length (that is, - the return from the extraction function) is given in - parentheses after each string for \C and \G. - - Note that while patterns can be continued over several lines - (a plain ">" prompt is used for continuations), data lines - may not. However newlines can be included in data by means - of the \n escape. + When a match succeeds, pcretest outputs the list of captured substrings + that pcre_exec() returns, starting with number 0 for the string that + matched the whole pattern. Here is an example of an interactive + pcretest run. + + $ pcretest + PCRE version 4.00 08-Jan-2003 + + re> /^abc(\d+)/ + data> abc123 + 0: abc123 + 1: 123 + data> xyz + No match + + If the strings contain any non-printing characters, they are output as + \0x escapes, or as \x{...} escapes if the /8 modifier was present on + the pattern. If the pattern has the /+ modifier, then the output for + substring 0 is followed by the the rest of the subject string, identi- + fied by "0+" like this: + + re> /cat/+ + data> cataract + 0: cat + 0+ aract + + If the pattern has the /g or /G modifier, the results of successive + matching attempts are output in sequence, like this: + + re> /\Bi(\w\w)/g + data> Mississippi + 0: iss + 1: ss + 0: iss + 1: ss + 0: ipp + 1: pp + + "No match" is output only if the first match attempt fails. + + If any of the sequences \C, \G, or \L are present in a data line that + is successfully matched, the substrings extracted by the convenience + functions are output with C, G, or L after the string number instead of + a colon. This is in addition to the normal full list. The string length + (that is, the return from the extraction function) is given in paren- + theses after each string for \C and \G. + + Note that while patterns can be continued over several lines (a plain + ">" prompt is used for continuations), data lines may not. However new- + lines can be included in data by means of the \n escape. AUTHOR - Philip Hazel <ph10@cam.ac.uk> - University Computing Service, - Cambridge CB2 3QG, England. + Philip Hazel <ph10@cam.ac.uk> + University Computing Service, + Cambridge CB2 3QG, England. -Last updated: 20 August 2003 +Last updated: 09 December 2003 Copyright (c) 1997-2003 University of Cambridge. |