summaryrefslogtreecommitdiff
path: root/doc/pcre.txt
diff options
context:
space:
mode:
Diffstat (limited to 'doc/pcre.txt')
-rw-r--r--doc/pcre.txt58
1 files changed, 47 insertions, 11 deletions
diff --git a/doc/pcre.txt b/doc/pcre.txt
index 1ec5f2c..ad6f3b2 100644
--- a/doc/pcre.txt
+++ b/doc/pcre.txt
@@ -118,10 +118,19 @@ UTF-8 SUPPORT
The following comments apply when PCRE is running in UTF-8
mode:
- 1. PCRE assumes that the strings it is given contain valid
- UTF-8 codes. It does not diagnose invalid UTF-8 strings. If
- you pass invalid UTF-8 strings to PCRE, the results are
- undefined.
+ 1. When you set the PCRE_UTF8 flag, the strings passed as
+ patterns and subjects are checked for validity on entry to
+ the relevant functions. If an invalid UTF-8 string is
+ passed, an error return is given. In some situations, you
+ may already know that your strings are valid, and therefore
+ want to skip these checks in order to improve performance.
+ If you set the PCRE_NO_UTF8_CHECK flag at compile time or at
+ run time, PCRE assumes that the pattern or subject it is
+ given (respectively) contains only valid UTF-8 codes. In
+ this case, it does not diagnose an invalid UTF-8 string. If
+ you pass an invalid UTF-8 string to PCRE when
+ PCRE_NO_UTF8_CHECK is set, the results are undefined. Your
+ program may crash.
2. In a pattern, the escape sequence \x{...}, where the con-
tents of the braces is a string of hexadecimal digits, is
@@ -164,7 +173,7 @@ AUTHOR
Cambridge CB2 3QG, England.
Phone: +44 1223 334714
-Last updated: 04 February 2003
+Last updated: 20 August 2003
Copyright (c) 1997-2003 University of Cambridge.
-----------------------------------------------------------------------------
@@ -654,6 +663,20 @@ COMPILING A PATTERN
option changes the behaviour of PCRE are given in the sec-
tion on UTF-8 support in the main pcre page.
+ PCRE_NO_UTF8_CHECK
+
+ When PCRE_UTF8 is set, the validity of the pattern as a
+ UTF-8 string is automatically checked. If an invalid UTF-8
+ sequence of bytes is found, pcre_compile() returns an error.
+ If you already know that your pattern is valid, and you want
+ to skip this check for performance reasons, you can set the
+ PCRE_NO_UTF8_CHECK option. When it is set, the effect of
+ passing an invalid UTF-8 string as a pattern is undefined.
+ It may cause your program to crash. Note that there is a
+ similar option for suppressing the checking of subject
+ strings passed to pcre_exec().
+
+
STUDYING A PATTERN
@@ -747,7 +770,6 @@ INFORMATION ABOUT A PATTERN
compiled pattern. It replaces the obsolete pcre_info() func-
tion, which is nevertheless retained for backwards compabil-
ity (and is documented below).
-
The first argument for pcre_fullinfo() is a pointer to the
compiled pattern. The second argument is the result of
pcre_study(), or NULL if the pattern was not studied. The
@@ -1014,6 +1036,16 @@ MATCHING A PATTERN
turned out to be anchored by virtue of its contents, it can-
not be made unachored at matching time.
+ When PCRE_UTF8 was set at compile time, the validity of the
+ subject as a UTF-8 string is automatically checked. If an
+ invalid UTF-8 sequence of bytes is found, pcre_exec()
+ returns the error PCRE_ERROR_BADUTF8. If you already know
+ that your subject is valid, and you want to skip this check
+ for performance reasons, you can set the PCRE_NO_UTF8_CHECK
+ option when calling pcre_exec(). When this option is set,
+ the effect of passing an invalid UTF-8 string as a subject
+ is undefined. It may cause your program to crash.
+
There are also three further options that can be set only at
matching time:
@@ -1103,7 +1135,6 @@ MATCHING A PATTERN
used for a fragment of a pattern that picks out a substring.
PCRE supports several other kinds of parenthesized subpat-
tern that do not cause substrings to be captured.
-
Captured substrings are returned to the caller via a vector
of integer offsets whose address is passed in ovector. The
number of elements in the vector is passed in ovecsize. The
@@ -1219,6 +1250,11 @@ MATCHING A PATTERN
distinctive error code. See the pcrecallout documentation
for details.
+ PCRE_ERROR_BADUTF8 (-10)
+
+ A string that contains an invalid UTF-8 byte sequence was
+ passed as a subject.
+
EXTRACTING CAPTURED SUBSTRINGS BY NUMBER
@@ -1255,7 +1291,6 @@ EXTRACTING CAPTURED SUBSTRINGS BY NUMBER
returned zero, indicating that it ran out of space in ovec-
tor, the value passed as stringcount should be the size of
the vector divided by three.
-
The functions pcre_copy_substring() and pcre_get_substring()
extract a single substring, whose number is given as string-
number. A value of zero extracts the substring that matched
@@ -1352,7 +1387,7 @@ EXTRACTING CAPTURED SUBSTRINGS BY NAME
succeeds, they then call pcre_copy_substring() or
pcre_get_substring(), as appropriate.
-Last updated: 03 February 2003
+Last updated: 20 August 2003
Copyright (c) 1997-2003 University of Cambridge.
-----------------------------------------------------------------------------
@@ -1420,8 +1455,9 @@ PCRE CALLOUTS
The current_position field contains the offset within the
subject of the current match pointer.
- The capture_top field contains the number of the highest
- captured substring so far.
+ The capture_top field contains one more than the number of
+ the highest numbered captured substring so far. If no sub-
+ strings have been captured, the value of capture_top is one.
The capture_last field contains the number of the most
recently captured substring.