summaryrefslogtreecommitdiff
path: root/doc/pcre.txt
diff options
context:
space:
mode:
Diffstat (limited to 'doc/pcre.txt')
-rw-r--r--doc/pcre.txt216
1 files changed, 182 insertions, 34 deletions
diff --git a/doc/pcre.txt b/doc/pcre.txt
index b8106e4..29cc490 100644
--- a/doc/pcre.txt
+++ b/doc/pcre.txt
@@ -28,6 +28,10 @@ SYNOPSIS
int pcre_get_substring_list(const char *subject,
int *ovector, int stringcount, const char ***listptr);
+ void pcre_free_substring(const char *stringptr);
+
+ void pcre_free_substring_list(const char **stringptr);
+
const unsigned char *pcre_maketables(void);
int pcre_fullinfo(const pcre *code, const pcre_extra *extra,
@@ -48,9 +52,12 @@ DESCRIPTION
The PCRE library is a set of functions that implement regu-
lar expression pattern matching using the same syntax and
semantics as Perl 5, with just a few differences (see
+
below). The current implementation corresponds to Perl
- 5.005, with some additional features from the Perl develop-
- ment release.
+ 5.005, with some additional features from later versions.
+ This includes some experimental, incomplete support for
+ UTF-8 encoded strings. Details of exactly what is and what
+ is not supported are given below.
PCRE has its own native API, which is described in this
document. There is also a set of wrapper functions that
@@ -67,13 +74,18 @@ DESCRIPTION
releases.
The functions pcre_compile(), pcre_study(), and pcre_exec()
- are used for compiling and matching regular expressions,
- while pcre_copy_substring(), pcre_get_substring(), and
- pcre_get_substring_list() are convenience functions for
+ are used for compiling and matching regular expressions.
+
+ The functions pcre_copy_substring(), pcre_get_substring(),
+ and pcre_get_substring_list() are convenience functions for
extracting captured substrings from a matched subject
- string. The function pcre_maketables() is used (optionally)
- to build a set of character tables in the current locale for
- passing to pcre_compile().
+ string; pcre_free_substring() and pcre_free_substring_list()
+ are also provided, to free the memory used for extracted
+ strings.
+
+ The function pcre_maketables() is used (optionally) to build
+ a set of character tables in the current locale for passing
+ to pcre_compile().
The function pcre_fullinfo() is used to find out information
about a compiled pattern; pcre_info() is an obsolete version
@@ -92,10 +104,19 @@ DESCRIPTION
MULTI-THREADING
- The PCRE functions can be used in multi-threading applica-
- tions, with the proviso that the memory management functions
- pointed to by pcre_malloc and pcre_free are shared by all
- threads.
+ The PCRE functions can be used in multi-threading
+
+
+
+
+
+SunOS 5.8 Last change: 2
+
+
+
+ applications, with the proviso that the memory management
+ functions pointed to by pcre_malloc and pcre_free are shared
+ by all threads.
The compiled form of a regular expression is not altered
during matching, so the same compiled pattern can safely be
@@ -103,7 +124,6 @@ MULTI-THREADING
-
COMPILING A PATTERN
The function pcre_compile() is called to compile a pattern
into an internal form. The pattern is a C string terminated
@@ -235,12 +255,23 @@ COMPILING A PATTERN
followed by "?". It is not compatible with Perl. It can also
be set by a (?U) option setting within the pattern.
+ PCRE_UTF8
+
+ This option causes PCRE to regard both the pattern and the
+ subject as strings of UTF-8 characters instead of just byte
+ strings. However, it is available only if PCRE has been
+ built to include UTF-8 support. If not, the use of this
+ option provokes an error. Support for UTF-8 is new, experi-
+ mental, and incomplete. Details of exactly what it entails
+ are given below.
+
STUDYING A PATTERN
When a pattern is going to be used several times, it is
worth spending more time analyzing it in order to speed up
the time taken for matching. The function pcre_study() takes
+
a pointer to a compiled pattern as its first argument, and
returns a pointer to a pcre_extra block (another void
typedef) containing additional information about the pat-
@@ -344,9 +375,9 @@ INFORMATION ABOUT A PATTERN
PCRE_INFO_BACKREFMAX
- Return the number of the highest back reference in the pat-
- tern. The fourth argument should point to an int variable.
- Zero is returned if there are no back references.
+ Return the number of the highest back reference in the
+ pattern. The fourth argument should point to an int vari-
+ able. Zero is returned if there are no back references.
PCRE_INFO_FIRSTCHAR
@@ -605,6 +636,15 @@ MATCHING A PATTERN
EXTRACTING CAPTURED SUBSTRINGS
Captured substrings can be accessed directly by using the
+
+
+
+
+
+SunOS 5.8 Last change: 12
+
+
+
offsets returned by pcre_exec() in ovector. For convenience,
the functions pcre_copy_substring(), pcre_get_substring(),
and pcre_get_substring_list() are provided for extracting
@@ -631,7 +671,7 @@ EXTRACTING CAPTURED SUBSTRINGS
the entire pattern, while higher values extract the captured
substrings. For pcre_copy_substring(), the string is placed
in buffer, whose length is given by buffersize, while for
- pcre_get_substring() a new block of store is obtained via
+ pcre_get_substring() a new block of memory is obtained via
pcre_malloc, and its address is returned via stringptr. The
yield of the function is the length of the string, not
including the terminating zero, or one of
@@ -665,6 +705,16 @@ EXTRACTING CAPTURED SUBSTRINGS
inspecting the appropriate offset in ovector, which is nega-
tive for unset substrings.
+ The two convenience functions pcre_free_substring() and
+ pcre_free_substring_list() can be used to free the memory
+ returned by a previous call of pcre_get_substring() or
+ pcre_get_substring_list(), respectively. They do nothing
+ more than call the function pointed to by pcre_free, which
+ of course could be called directly from a C program. How-
+ ever, PCRE is used in some situations where it is linked via
+ a special interface to another programming language which
+ cannot use pcre_free directly; it is for these cases that
+ the functions are provided.
@@ -733,6 +783,7 @@ DIFFERENCES FROM PERL
(?p{code}) constructions. However, there is some experimen-
tal support for recursive patterns using the non-Perl item
(?R).
+
8. There are at the time of writing some oddities in Perl
5.005_02 concerned with the settings of captured strings
when part of a pattern is repeated. For example, matching
@@ -785,11 +836,17 @@ REGULAR EXPRESSION DETAILS
The syntax and semantics of the regular expressions sup-
ported by PCRE are described below. Regular expressions are
also described in the Perl documentation and in a number of
-
other books, some of which have copious examples. Jeffrey
Friedl's "Mastering Regular Expressions", published by
- O'Reilly (ISBN 1-56592-257), covers them in great detail.
+ O'Reilly (ISBN 1-56592-257), covers them in great detail.
+
The description here is intended as reference documentation.
+ The basic operation of PCRE is on strings of bytes. However,
+ there is the beginnings of some support for UTF-8 character
+ strings. To use this support you must configure PCRE to
+ include it, and then call pcre_compile() with the PCRE_UTF8
+ option. How this affects the pattern matching is described
+ in the final section of this document.
A regular expression is a pattern that is matched against a
subject string from left to right. Most characters stand for
@@ -1004,6 +1061,7 @@ CIRCUMFLEX AND DOLLAR
Outside a character class, in the default matching mode, the
circumflex character is an assertion which is true only if
the current matching point is at the start of the subject
+
string. If the startoffset argument of pcre_exec() is non-
zero, circumflex can never match. Inside a character class,
circumflex has an entirely different meaning (see below).
@@ -1056,6 +1114,7 @@ FULL STOP (PERIOD, DOT)
Outside a character class, a dot in the pattern matches any
one character in the subject, including a non-printing char-
acter, but not (by default) newline. If the PCRE_DOTALL
+
option is set, dots match newlines as well. The handling of
dot is entirely independent of the handling of circumflex
and dollar, the only relationship being that they both
@@ -1517,18 +1576,19 @@ BACK REFERENCES
A back reference that occurs inside the parentheses to which
it refers fails when the subpattern is first used, so, for
example, (a\1) never matches. However, such references can
- be useful inside repeated subpatterns. For example, the
- pattern
+ be useful inside repeated subpatterns. For example, the pat-
+ tern
(a|b\1)+
- matches any number of "a"s and also "aba", "ababaa" etc. At
+ matches any number of "a"s and also "aba", "ababbaa" etc. At
each iteration of the subpattern, the back reference matches
- the character string corresponding to the previous itera-
- tion. In order for this to work, the pattern must be such
- that the first iteration does not need to match the back
- reference. This can be done using alternation, as in the
- example above, or by a quantifier with a minimum of zero.
+ the character string corresponding to the previous
+ iteration. In order for this to work, the pattern must be
+ such that the first iteration does not need to match the
+ back reference. This can be done using alternation, as in
+ the example above, or by a quantifier with a minimum of
+ zero.
@@ -1681,9 +1741,9 @@ ONCE-ONLY SUBPATTERNS
This kind of parenthesis "locks up" the part of the pattern
it contains once it has matched, and a failure further into
- the pattern is prevented from backtracking into it. Back-
- tracking past it to previous items, however, works as nor-
- mal.
+ the pattern is prevented from backtracking into it.
+ Backtracking past it to previous items, however, works as
+ normal.
An alternative description is that a subpattern of this type
matches the string of characters that an identical stan-
@@ -1941,9 +2001,9 @@ PERFORMANCE
repeat can match 0, 1, 2, 3, or 4 times, and for each of
those cases other than 0, the + repeats can match different
numbers of times.) When the remainder of the pattern is such
- that the entire match is going to fail, PCRE has in princi-
- ple to try every possible variation, and this can take an
- extremely long time.
+ that the entire match is going to fail, PCRE has in
+ principle to try every possible variation, and this can take
+ an extremely long time.
An optimization catches some of the more simple cases such
as
@@ -1966,6 +2026,93 @@ PERFORMANCE
+UTF-8 SUPPORT
+ Starting at release 3.3, PCRE has some support for character
+ strings encoded in the UTF-8 format. This is incomplete, and
+ is regarded as experimental. In order to use it, you must
+ configure PCRE to include UTF-8 support in the code, and, in
+ addition, you must call pcre_compile() with the PCRE_UTF8
+ option flag. When you do this, both the pattern and any sub-
+ ject strings that are matched against it are treated as
+ UTF-8 strings instead of just strings of bytes, but only in
+ the cases that are mentioned below.
+
+ If you compile PCRE with UTF-8 support, but do not use it at
+ run time, the library will be a bit bigger, but the addi-
+ tional run time overhead is limited to testing the PCRE_UTF8
+ flag in several places, so should not be very large.
+
+ PCRE assumes that the strings it is given contain valid
+ UTF-8 codes. It does not diagnose invalid UTF-8 strings. If
+ you pass invalid UTF-8 strings to PCRE, the results are
+ undefined.
+
+ Running with PCRE_UTF8 set causes these changes in the way
+ PCRE works:
+
+ 1. In a pattern, the escape sequence \x{...}, where the con-
+ tents of the braces is a string of hexadecimal digits, is
+ interpreted as a UTF-8 character whose code number is the
+ given hexadecimal number, for example: \x{1234}. This
+ inserts from one to six literal bytes into the pattern,
+ using the UTF-8 encoding. If a non-hexadecimal digit appears
+ between the braces, the item is not recognized.
+
+ 2. The original hexadecimal escape sequence, \xhh, generates
+ a two-byte UTF-8 character if its value is greater than 127.
+
+ 3. Repeat quantifiers are NOT correctly handled if they fol-
+ low a multibyte character. For example, \x{100}* and \xc3+
+ do not work. If you want to repeat such characters, you must
+ enclose them in non-capturing parentheses, for example
+ (?:\x{100}), at present.
+
+ 4. The dot metacharacter matches one UTF-8 character instead
+ of a single byte.
+
+ 5. Unlike literal UTF-8 characters, the dot metacharacter
+ followed by a repeat quantifier does operate correctly on
+ UTF-8 characters instead of single bytes.
+
+ 4. Although the \x{...} escape is permitted in a character
+ class, characters whose values are greater than 255 cannot
+ be included in a class.
+
+ 5. A class is matched against a UTF-8 character instead of
+ just a single byte, but it can match only characters whose
+ values are less than 256. Characters with greater values
+ always fail to match a class.
+
+ 6. Repeated classes work correctly on multiple characters.
+
+ 7. Classes containing just a single character whose value is
+ greater than 127 (but less than 256), for example, [\x80] or
+ [^\x{93}], do not work because these are optimized into sin-
+ gle byte matches. In the first case, of course, the class
+ brackets are just redundant.
+
+ 8. Lookbehind assertions move backwards in the subject by a
+ fixed number of characters instead of a fixed number of
+ bytes. Simple cases have been tested to work correctly, but
+ there may be hidden gotchas herein.
+
+ 9. The character types such as \d and \w do not work
+ correctly with UTF-8 characters. They continue to test a
+ single byte.
+
+ 10. Anything not explicitly mentioned here continues to work
+ in bytes rather than in characters.
+
+ The following UTF-8 features of Perl 5.6 are not imple-
+ mented:
+
+ 1. The escape sequence \C to match a single byte.
+
+ 2. The use of Unicode tables and properties and escapes \p,
+ \P, and \X.
+
+
+
AUTHOR
Philip Hazel <ph10@cam.ac.uk>
University Computing Service,
@@ -1973,5 +2120,6 @@ AUTHOR
Cambridge CB2 3QG, England.
Phone: +44 1223 334714
- Last updated: 27 January 2000
+ Last updated: 28 August 2000,
+ the 250th anniversary of the death of J.S. Bach.
Copyright (c) 1997-2000 University of Cambridge.