summaryrefslogtreecommitdiff
path: root/README
diff options
context:
space:
mode:
authornigel <nigel@2f5784b3-3f2a-0410-8824-cb99058d5e15>2007-02-24 21:39:05 +0000
committernigel <nigel@2f5784b3-3f2a-0410-8824-cb99058d5e15>2007-02-24 21:39:05 +0000
commit8413b86222848f277386e72706ca548a37dbc6ca (patch)
treeaa68b52aa527385811d5e4af091c59609cc8fa03 /README
parent4864ac99ba4c4395fd8dc157ec734e228c780eb4 (diff)
downloadpcre-8413b86222848f277386e72706ca548a37dbc6ca.tar.gz
Load pcre-2.06 into code/trunk.
git-svn-id: svn://vcs.exim.org/pcre/code/trunk@35 2f5784b3-3f2a-0410-8824-cb99058d5e15
Diffstat (limited to 'README')
-rw-r--r--README143
1 files changed, 101 insertions, 42 deletions
diff --git a/README b/README
index 2db0070..190e75f 100644
--- a/README
+++ b/README
@@ -16,8 +16,23 @@ README file for PCRE (Perl-compatible regular expressions)
* possible to pass over a pointer to character tables built in the current *
* locale by pcre_maketables(). To use the default tables, this new arguement *
* should be passed as NULL. *
+* *
+* IMPORTANT FOR THOSE UPGRADING FROM VERSION 2.05 *
+* *
+* Yet another (and again I hope this really is the last) change has been made *
+* to the API for the pcre_exec() function. An additional argument has been *
+* added to make it possible to start the match other than at the start of the *
+* subject string. This is important if there are lookbehinds. The new man *
+* page has the details, but you just want to convert existing programs, all *
+* you need to do is to stick in a new fifth argument to pcre_exec(), with a *
+* value of zero. For example, change *
+* *
+* pcre_exec(pattern, extra, subject, length, options, ovec, ovecsize) *
+* to *
+* pcre_exec(pattern, extra, subject, length, 0, options, ovec, ovecsize) *
*******************************************************************************
+
The distribution should contain the following files:
ChangeLog log of changes to the code
@@ -45,7 +60,7 @@ The distribution should contain the following files:
testinput2 test data for error messages and non-Perl things
testinput3 test data, compatible with Perl 5.005
testinput4 test data for locale-specific tests
- testoutput1 test results corresponding to testinput
+ testoutput1 test results corresponding to testinput1
testoutput2 test results corresponding to testinput2
testoutput3 test results corresponding to testinput3
testoutput4 test results corresponding to testinput4
@@ -112,19 +127,20 @@ Character tables
PCRE uses four tables for manipulating and identifying characters. The final
argument of the pcre_compile() function is a pointer to a block of memory
-containing the concatenated tables. A call to pcre_maketables() is used to
-generate a set of tables in the current locale. However, if the final argument
-is passed as NULL, a set of default tables that is built into the binary is
-used.
+containing the concatenated tables. A call to pcre_maketables() can be used to
+generate a set of tables in the current locale. If the final argument for
+pcre_compile() is passed as NULL, a set of default tables that is built into
+the binary is used.
The source file called chartables.c contains the default set of tables. This is
not supplied in the distribution, but is built by the program dftables
(compiled from dftables.c), which uses the ANSI C character handling functions
such as isalnum(), isalpha(), isupper(), islower(), etc. to build the table
-sources. This means that the default C locale set your system will control the
-contents of the tables. You can change the default tables by editing
-chartables.c and then re-building PCRE. If you do this, you should probably
-also edit Makefile to ensure that the file doesn't ever get re-generated.
+sources. This means that the default C locale which is set for your system will
+control the contents of these default tables. You can change the default tables
+by editing chartables.c and then re-building PCRE. If you do this, you should
+probably also edit Makefile to ensure that the file doesn't ever get
+re-generated.
The first two 256-byte tables provide lower casing and case flipping functions,
respectively. The next table consists of three 32-byte bit maps which identify
@@ -178,9 +194,9 @@ example,
/abc/\
-then a backslash is added to the end of the pattern. This provides a way of
-testing the error condition that arises if a pattern finishes with a backslash,
-because
+then a backslash is added to the end of the pattern. This is done to provide a
+way of testing the error condition that arises if a pattern finishes with a
+backslash, because
/abc\/
@@ -188,42 +204,63 @@ is interpreted as the first line of a pattern that starts with "abc/", causing
pcretest to read the next line as a continuation of the regular expression.
The pattern may be followed by i, m, s, or x to set the PCRE_CASELESS,
-PCRE_MULTILINE, PCRE_DOTALL, or PCRE_EXTENDED options, respectively. These
-options have the same effect as they do in Perl.
+PCRE_MULTILINE, PCRE_DOTALL, or PCRE_EXTENDED options, respectively. For
+example:
+
+ /caseless/i
+
+These modifier letters have the same effect as they do in Perl. There are
+others which set PCRE options that do not correspond to anything in Perl: /A,
+/E, and /X set PCRE_ANCHORED, PCRE_DOLLAR_ENDONLY, and PCRE_EXTRA respectively.
+
+Searching for all possible matches within each subject string can be requested
+by the /g or /G modifier. The /g modifier behaves similarly to the way it does
+in Perl. After finding a match, PCRE is called again to search the remainder of
+the subject string. The difference between /g and /G is that the former uses
+the start_offset argument to pcre_exec() to start searching at a new point
+within the entire string, whereas the latter passes over a shortened substring.
+This makes a difference to the matching process if the pattern begins with a
+lookbehind assertion (including \b or \B).
-There are also some upper case options that do not match Perl options: /A, /E,
-and /X set PCRE_ANCHORED, PCRE_DOLLAR_ENDONLY, and PCRE_EXTRA respectively.
+There are a number of other modifiers for controlling the way pcretest
+operates.
-The /L option must be followed directly by the name of a locale, for example,
+The /+ modifier requests that as well as outputting the substring that matched
+the entire pattern, pcretest should in addition output the remainder of the
+subject string. This is useful for tests where the subject contains multiple
+copies of the same substring.
+
+The /L modifier must be followed directly by the name of a locale, for example,
/pattern/Lfr
-For this reason, it must be the last option letter. The given locale is set,
+For this reason, it must be the last modifier letter. The given locale is set,
pcre_maketables() is called to build a set of character tables for the locale,
and this is then passed to pcre_compile() when compiling the regular
-expression. Without an /L option, NULL is passed as the tables pointer; that
+expression. Without an /L modifier, NULL is passed as the tables pointer; that
is, /L applies only to the expression on which it appears.
-The /I option requests that pcretest output information about the compiled
+The /I modifier requests that pcretest output information about the compiled
expression (whether it is anchored, has a fixed first character, and so on). It
does this by calling pcre_info() after compiling an expression, and outputting
the information it gets back. If the pattern is studied, the results of that
are also output.
-The /D option is a PCRE debugging feature, which also assumes /I. It causes the
-internal form of compiled regular expressions to be output after compilation.
+The /D modifier is a PCRE debugging feature, which also assumes /I. It causes
+the internal form of compiled regular expressions to be output after
+compilation.
-The /S option causes pcre_study() to be called after the expression has been
+The /S modifier causes pcre_study() to be called after the expression has been
compiled, and the results used when the expression is matched.
-The /M option causes information about the size of memory block used to hold
+The /M modifier causes information about the size of memory block used to hold
the compile pattern to be output.
-Finally, the /P option causes pcretest to call PCRE via the POSIX wrapper API
-rather than its native API. When this is done, all other options except /i and
-/m are ignored. REG_ICASE is set if /i is present, and REG_NEWLINE is set if /m
-is present. The wrapper functions force PCRE_DOLLAR_ENDONLY always, and
-PCRE_DOTALL unless REG_NEWLINE is set.
+Finally, the /P modifier causes pcretest to call PCRE via the POSIX wrapper API
+rather than its native API. When this is done, all other modifiers except /i,
+/m, and /+ are ignored. REG_ICASE is set if /i is present, and REG_NEWLINE is
+set if /m is present. The wrapper functions force PCRE_DOLLAR_ENDONLY always,
+and PCRE_DOTALL unless REG_NEWLINE is set.
Before each data line is passed to pcre_exec(), leading and trailing whitespace
is removed, and it is then scanned for \ escapes. The following are recognized:
@@ -263,16 +300,38 @@ pcre_exec() returns, starting with number 0 for the string that matched the
whole pattern. Here is an example of an interactive pcretest run.
$ pcretest
- Testing Perl-Compatible Regular Expressions
- PCRE version 0.90 08-Sep-1997
+ PCRE version 2.06 08-Jun-1999
re> /^abc(\d+)/
data> abc123
- 0: abc123
- 1: 123
+ 0: abc123
+ 1: 123
data> xyz
No match
+If the strings contain any non-printing characters, they are output as \0x
+escapes. If the pattern has the /+ modifier, then the output for substring 0 is
+followed by the the rest of the subject string, identified by "0+" like this:
+
+ re> /cat/+
+ data> cataract
+ 0: cat
+ 0+ aract
+
+If the pattern has the /g or /G modifier, the results of successive matching
+attempts are output in sequence, like this:
+
+ re> /\Bi(\w\w)/g
+ data> Mississippi
+ 0: iss
+ 1: ss
+ 0: iss
+ 1: ss
+ 0: ipp
+ 1: pp
+
+"No match" is output only if the first match attempt fails.
+
If any of \C, \G, or \L are present in a data line that is successfully
matched, the substrings extracted by the convenience functions are output with
C, G, or L after the string number instead of a colon. This is in addition to
@@ -313,21 +372,21 @@ The perltest program
The perltest program tests Perl's regular expressions; it has the same
specification as pcretest, and so can be given identical input, except that
-input patterns can be followed only by Perl's lower case options. The contents
-of testinput1 and testinput3 meet this condition.
+input patterns can be followed only by Perl's lower case modifiers. The
+contents of testinput1 and testinput3 meet this condition.
The data lines are processed as Perl strings, so if they contain $ or @
characters, these have to be escaped. For this reason, all such characters in
-the testinput file are escaped so that it can be used for perltest as well as
-for pcretest, and the special upper case options such as /A that pcretest
-recognizes are not used in this file. The output should be identical, apart
-from the initial identifying banner.
+testinput1 and testinput3 are escaped so that they can be used for perltest as
+well as for pcretest, and the special upper case modifiers such as /A that
+pcretest recognizes are not used in these files. The output should be
+identical, apart from the initial identifying banner.
-The testinput2 and testinput4 files are not suitable for feeding to Perltest,
-since they do make use of the special upper case options and escapes that
+The testinput2 and testinput4 files are not suitable for feeding to perltest,
+since they do make use of the special upper case modifiers and escapes that
pcretest uses to test some features of PCRE. The first of these files also
contains malformed regular expressions, in order to check that PCRE diagnoses
them correctly.
Philip Hazel <ph10@cam.ac.uk>
-April 1999
+June 1999