1 files changed, 29 insertions, 5 deletions
diff --git a/doc/pcrepattern.3 b/doc/pcrepattern.3
index 250686a..8f2b960 100644
--- a/doc/pcrepattern.3
+++ b/doc/pcrepattern.3
@@ -971,11 +971,14 @@ that signifies the end of a line.
 .rs
 .sp
 Outside a character class, the escape sequence \eC matches any one byte, both
-in and out of UTF-8 mode. Unlike a dot, it always matches any line-ending
+in and out of UTF-8 mode. Unlike a dot, it always matches line-ending
 characters. The feature is provided in Perl in order to match individual bytes
-in UTF-8 mode. Because it breaks up UTF-8 characters into individual bytes, the
-rest of the string may start with a malformed UTF-8 character. For this reason,
-the \eC escape sequence is best avoided.
+in UTF-8 mode, but it is unclear how it can usefully be used. Because \eC
+breaks up characters into individual bytes, matching one byte with \eC in UTF-8
+mode means that the rest of the string may start with a malformed UTF-8
+character. This has undefined results, because PCRE assumes that it is dealing
+with valid UTF-8 strings (and by default it checks this at the start of
+processing unless the PCRE_NO_UTF8_CHECK option is used).
 .P
 PCRE does not allow \eC to appear in lookbehind assertions
 .\" HTML <a href="#lookbehind">
@@ -984,6 +987,27 @@ PCRE does not allow \eC to appear in lookbehind assertions
 .\"
 because in UTF-8 mode this would make it impossible to calculate the length of
 the lookbehind.
+.P
+In general, the \eC escape sequence is best avoided in UTF-8 mode. However, one
+way of using it that avoids the problem of malformed UTF-8 characters is to 
+use a lookahead to check the length of the next character, as in this pattern 
+(ignore white space and line breaks):
+.sp
+  (?| (?=[\ex00-\ex7f])(\eC) |
+      (?=[\ex80-\ex{7ff}])(\eC)(\eC) |  
+      (?=[\ex{800}-\ex{ffff}])(\eC)(\eC)(\eC) |  
+      (?=[\ex{10000}-\ex{1fffff}])(\eC)(\eC)(\eC)(\eC))
+.sp
+A group that starts with (?| resets the capturing parentheses numbers in each 
+alternative (see 
+.\" HTML <a href="#dupsubpatternnumber">
+.\" </a>
+"Duplicate Subpattern Numbers"
+.\"
+below). The assertions at the start of each branch check the next UTF-8 
+character for values whose encoding uses 1, 2, 3, or 4 bytes, respectively. The 
+character's individual bytes are then captured by the appropriate number of
+groups.
 .
 .
 .\" HTML <a name="characterclass"></a>
@@ -2830,6 +2854,6 @@ Cambridge CB2 3QH, England.
 .rs
 .sp
 .nf
-Last updated: 09 October 2011
+Last updated: 19 October 2011
 Copyright (c) 1997-2011 University of Cambridge.
 .fi