summaryrefslogtreecommitdiff
path: root/doc/pcrepattern.3
diff options
context:
space:
mode:
Diffstat (limited to 'doc/pcrepattern.3')
-rw-r--r--doc/pcrepattern.334
1 files changed, 29 insertions, 5 deletions
diff --git a/doc/pcrepattern.3 b/doc/pcrepattern.3
index 250686a..8f2b960 100644
--- a/doc/pcrepattern.3
+++ b/doc/pcrepattern.3
@@ -971,11 +971,14 @@ that signifies the end of a line.
.rs
.sp
Outside a character class, the escape sequence \eC matches any one byte, both
-in and out of UTF-8 mode. Unlike a dot, it always matches any line-ending
+in and out of UTF-8 mode. Unlike a dot, it always matches line-ending
characters. The feature is provided in Perl in order to match individual bytes
-in UTF-8 mode. Because it breaks up UTF-8 characters into individual bytes, the
-rest of the string may start with a malformed UTF-8 character. For this reason,
-the \eC escape sequence is best avoided.
+in UTF-8 mode, but it is unclear how it can usefully be used. Because \eC
+breaks up characters into individual bytes, matching one byte with \eC in UTF-8
+mode means that the rest of the string may start with a malformed UTF-8
+character. This has undefined results, because PCRE assumes that it is dealing
+with valid UTF-8 strings (and by default it checks this at the start of
+processing unless the PCRE_NO_UTF8_CHECK option is used).
.P
PCRE does not allow \eC to appear in lookbehind assertions
.\" HTML <a href="#lookbehind">
@@ -984,6 +987,27 @@ PCRE does not allow \eC to appear in lookbehind assertions
.\"
because in UTF-8 mode this would make it impossible to calculate the length of
the lookbehind.
+.P
+In general, the \eC escape sequence is best avoided in UTF-8 mode. However, one
+way of using it that avoids the problem of malformed UTF-8 characters is to
+use a lookahead to check the length of the next character, as in this pattern
+(ignore white space and line breaks):
+.sp
+ (?| (?=[\ex00-\ex7f])(\eC) |
+ (?=[\ex80-\ex{7ff}])(\eC)(\eC) |
+ (?=[\ex{800}-\ex{ffff}])(\eC)(\eC)(\eC) |
+ (?=[\ex{10000}-\ex{1fffff}])(\eC)(\eC)(\eC)(\eC))
+.sp
+A group that starts with (?| resets the capturing parentheses numbers in each
+alternative (see
+.\" HTML <a href="#dupsubpatternnumber">
+.\" </a>
+"Duplicate Subpattern Numbers"
+.\"
+below). The assertions at the start of each branch check the next UTF-8
+character for values whose encoding uses 1, 2, 3, or 4 bytes, respectively. The
+character's individual bytes are then captured by the appropriate number of
+groups.
.
.
.\" HTML <a name="characterclass"></a>
@@ -2830,6 +2854,6 @@ Cambridge CB2 3QH, England.
.rs
.sp
.nf
-Last updated: 09 October 2011
+Last updated: 19 October 2011
Copyright (c) 1997-2011 University of Cambridge.
.fi