diff options
Diffstat (limited to 'doc/pcre.txt')
-rw-r--r-- | doc/pcre.txt | 58 |
1 files changed, 47 insertions, 11 deletions
diff --git a/doc/pcre.txt b/doc/pcre.txt index 1ec5f2c..ad6f3b2 100644 --- a/doc/pcre.txt +++ b/doc/pcre.txt @@ -118,10 +118,19 @@ UTF-8 SUPPORT The following comments apply when PCRE is running in UTF-8 mode: - 1. PCRE assumes that the strings it is given contain valid - UTF-8 codes. It does not diagnose invalid UTF-8 strings. If - you pass invalid UTF-8 strings to PCRE, the results are - undefined. + 1. When you set the PCRE_UTF8 flag, the strings passed as + patterns and subjects are checked for validity on entry to + the relevant functions. If an invalid UTF-8 string is + passed, an error return is given. In some situations, you + may already know that your strings are valid, and therefore + want to skip these checks in order to improve performance. + If you set the PCRE_NO_UTF8_CHECK flag at compile time or at + run time, PCRE assumes that the pattern or subject it is + given (respectively) contains only valid UTF-8 codes. In + this case, it does not diagnose an invalid UTF-8 string. If + you pass an invalid UTF-8 string to PCRE when + PCRE_NO_UTF8_CHECK is set, the results are undefined. Your + program may crash. 2. In a pattern, the escape sequence \x{...}, where the con- tents of the braces is a string of hexadecimal digits, is @@ -164,7 +173,7 @@ AUTHOR Cambridge CB2 3QG, England. Phone: +44 1223 334714 -Last updated: 04 February 2003 +Last updated: 20 August 2003 Copyright (c) 1997-2003 University of Cambridge. ----------------------------------------------------------------------------- @@ -654,6 +663,20 @@ COMPILING A PATTERN option changes the behaviour of PCRE are given in the sec- tion on UTF-8 support in the main pcre page. + PCRE_NO_UTF8_CHECK + + When PCRE_UTF8 is set, the validity of the pattern as a + UTF-8 string is automatically checked. If an invalid UTF-8 + sequence of bytes is found, pcre_compile() returns an error. + If you already know that your pattern is valid, and you want + to skip this check for performance reasons, you can set the + PCRE_NO_UTF8_CHECK option. When it is set, the effect of + passing an invalid UTF-8 string as a pattern is undefined. + It may cause your program to crash. Note that there is a + similar option for suppressing the checking of subject + strings passed to pcre_exec(). + + STUDYING A PATTERN @@ -747,7 +770,6 @@ INFORMATION ABOUT A PATTERN compiled pattern. It replaces the obsolete pcre_info() func- tion, which is nevertheless retained for backwards compabil- ity (and is documented below). - The first argument for pcre_fullinfo() is a pointer to the compiled pattern. The second argument is the result of pcre_study(), or NULL if the pattern was not studied. The @@ -1014,6 +1036,16 @@ MATCHING A PATTERN turned out to be anchored by virtue of its contents, it can- not be made unachored at matching time. + When PCRE_UTF8 was set at compile time, the validity of the + subject as a UTF-8 string is automatically checked. If an + invalid UTF-8 sequence of bytes is found, pcre_exec() + returns the error PCRE_ERROR_BADUTF8. If you already know + that your subject is valid, and you want to skip this check + for performance reasons, you can set the PCRE_NO_UTF8_CHECK + option when calling pcre_exec(). When this option is set, + the effect of passing an invalid UTF-8 string as a subject + is undefined. It may cause your program to crash. + There are also three further options that can be set only at matching time: @@ -1103,7 +1135,6 @@ MATCHING A PATTERN used for a fragment of a pattern that picks out a substring. PCRE supports several other kinds of parenthesized subpat- tern that do not cause substrings to be captured. - Captured substrings are returned to the caller via a vector of integer offsets whose address is passed in ovector. The number of elements in the vector is passed in ovecsize. The @@ -1219,6 +1250,11 @@ MATCHING A PATTERN distinctive error code. See the pcrecallout documentation for details. + PCRE_ERROR_BADUTF8 (-10) + + A string that contains an invalid UTF-8 byte sequence was + passed as a subject. + EXTRACTING CAPTURED SUBSTRINGS BY NUMBER @@ -1255,7 +1291,6 @@ EXTRACTING CAPTURED SUBSTRINGS BY NUMBER returned zero, indicating that it ran out of space in ovec- tor, the value passed as stringcount should be the size of the vector divided by three. - The functions pcre_copy_substring() and pcre_get_substring() extract a single substring, whose number is given as string- number. A value of zero extracts the substring that matched @@ -1352,7 +1387,7 @@ EXTRACTING CAPTURED SUBSTRINGS BY NAME succeeds, they then call pcre_copy_substring() or pcre_get_substring(), as appropriate. -Last updated: 03 February 2003 +Last updated: 20 August 2003 Copyright (c) 1997-2003 University of Cambridge. ----------------------------------------------------------------------------- @@ -1420,8 +1455,9 @@ PCRE CALLOUTS The current_position field contains the offset within the subject of the current match pointer. - The capture_top field contains the number of the highest - captured substring so far. + The capture_top field contains one more than the number of + the highest numbered captured substring so far. If no sub- + strings have been captured, the value of capture_top is one. The capture_last field contains the number of the most recently captured substring. |