diff options
Diffstat (limited to 'doc/pcrepattern.3')
-rw-r--r-- | doc/pcrepattern.3 | 34 |
1 files changed, 29 insertions, 5 deletions
diff --git a/doc/pcrepattern.3 b/doc/pcrepattern.3 index 250686a..8f2b960 100644 --- a/doc/pcrepattern.3 +++ b/doc/pcrepattern.3 @@ -971,11 +971,14 @@ that signifies the end of a line. .rs .sp Outside a character class, the escape sequence \eC matches any one byte, both -in and out of UTF-8 mode. Unlike a dot, it always matches any line-ending +in and out of UTF-8 mode. Unlike a dot, it always matches line-ending characters. The feature is provided in Perl in order to match individual bytes -in UTF-8 mode. Because it breaks up UTF-8 characters into individual bytes, the -rest of the string may start with a malformed UTF-8 character. For this reason, -the \eC escape sequence is best avoided. +in UTF-8 mode, but it is unclear how it can usefully be used. Because \eC +breaks up characters into individual bytes, matching one byte with \eC in UTF-8 +mode means that the rest of the string may start with a malformed UTF-8 +character. This has undefined results, because PCRE assumes that it is dealing +with valid UTF-8 strings (and by default it checks this at the start of +processing unless the PCRE_NO_UTF8_CHECK option is used). .P PCRE does not allow \eC to appear in lookbehind assertions .\" HTML <a href="#lookbehind"> @@ -984,6 +987,27 @@ PCRE does not allow \eC to appear in lookbehind assertions .\" because in UTF-8 mode this would make it impossible to calculate the length of the lookbehind. +.P +In general, the \eC escape sequence is best avoided in UTF-8 mode. However, one +way of using it that avoids the problem of malformed UTF-8 characters is to +use a lookahead to check the length of the next character, as in this pattern +(ignore white space and line breaks): +.sp + (?| (?=[\ex00-\ex7f])(\eC) | + (?=[\ex80-\ex{7ff}])(\eC)(\eC) | + (?=[\ex{800}-\ex{ffff}])(\eC)(\eC)(\eC) | + (?=[\ex{10000}-\ex{1fffff}])(\eC)(\eC)(\eC)(\eC)) +.sp +A group that starts with (?| resets the capturing parentheses numbers in each +alternative (see +.\" HTML <a href="#dupsubpatternnumber"> +.\" </a> +"Duplicate Subpattern Numbers" +.\" +below). The assertions at the start of each branch check the next UTF-8 +character for values whose encoding uses 1, 2, 3, or 4 bytes, respectively. The +character's individual bytes are then captured by the appropriate number of +groups. . . .\" HTML <a name="characterclass"></a> @@ -2830,6 +2854,6 @@ Cambridge CB2 3QH, England. .rs .sp .nf -Last updated: 09 October 2011 +Last updated: 19 October 2011 Copyright (c) 1997-2011 University of Cambridge. .fi |