summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorph10 <ph10@2f5784b3-3f2a-0410-8824-cb99058d5e15>2011-10-19 17:37:29 +0000
committerph10 <ph10@2f5784b3-3f2a-0410-8824-cb99058d5e15>2011-10-19 17:37:29 +0000
commit1b47bd07348a218b1120dcfb84b50f4e966994db (patch)
tree510361158dbdd6e6217207a2abe58e01dd01d8a6
parent6f209d5f91b2eaaedefbbd9093f992e13cdf2d98 (diff)
downloadpcre-1b47bd07348a218b1120dcfb84b50f4e966994db.tar.gz
Add more about \C to documentation.
git-svn-id: svn://vcs.exim.org/pcre/code/trunk@737 2f5784b3-3f2a-0410-8824-cb99058d5e15
-rw-r--r--doc/pcrejit.34
-rw-r--r--doc/pcrepattern.334
-rw-r--r--doc/pcreunicode.317
3 files changed, 42 insertions, 13 deletions
diff --git a/doc/pcrejit.3 b/doc/pcrejit.3
index 1ff3c98..d3f5bf0 100644
--- a/doc/pcrejit.3
+++ b/doc/pcrejit.3
@@ -95,7 +95,7 @@ supported.
.P
The unsupported pattern items are:
.sp
- \eC match a single byte, even in UTF-8 mode
+ \eC match a single byte; not supported in UTF-8 mode
(?Cn) callouts
(?(<name>)... conditional test on setting of a named subpattern
(?(R)... conditional test on whole pattern recursion
@@ -267,6 +267,6 @@ Cambridge CB2 3QH, England.
.rs
.sp
.nf
-Last updated: 05 October 2011
+Last updated: 19 October 2011
Copyright (c) 1997-2011 University of Cambridge.
.fi
diff --git a/doc/pcrepattern.3 b/doc/pcrepattern.3
index 250686a..8f2b960 100644
--- a/doc/pcrepattern.3
+++ b/doc/pcrepattern.3
@@ -971,11 +971,14 @@ that signifies the end of a line.
.rs
.sp
Outside a character class, the escape sequence \eC matches any one byte, both
-in and out of UTF-8 mode. Unlike a dot, it always matches any line-ending
+in and out of UTF-8 mode. Unlike a dot, it always matches line-ending
characters. The feature is provided in Perl in order to match individual bytes
-in UTF-8 mode. Because it breaks up UTF-8 characters into individual bytes, the
-rest of the string may start with a malformed UTF-8 character. For this reason,
-the \eC escape sequence is best avoided.
+in UTF-8 mode, but it is unclear how it can usefully be used. Because \eC
+breaks up characters into individual bytes, matching one byte with \eC in UTF-8
+mode means that the rest of the string may start with a malformed UTF-8
+character. This has undefined results, because PCRE assumes that it is dealing
+with valid UTF-8 strings (and by default it checks this at the start of
+processing unless the PCRE_NO_UTF8_CHECK option is used).
.P
PCRE does not allow \eC to appear in lookbehind assertions
.\" HTML <a href="#lookbehind">
@@ -984,6 +987,27 @@ PCRE does not allow \eC to appear in lookbehind assertions
.\"
because in UTF-8 mode this would make it impossible to calculate the length of
the lookbehind.
+.P
+In general, the \eC escape sequence is best avoided in UTF-8 mode. However, one
+way of using it that avoids the problem of malformed UTF-8 characters is to
+use a lookahead to check the length of the next character, as in this pattern
+(ignore white space and line breaks):
+.sp
+ (?| (?=[\ex00-\ex7f])(\eC) |
+ (?=[\ex80-\ex{7ff}])(\eC)(\eC) |
+ (?=[\ex{800}-\ex{ffff}])(\eC)(\eC)(\eC) |
+ (?=[\ex{10000}-\ex{1fffff}])(\eC)(\eC)(\eC)(\eC))
+.sp
+A group that starts with (?| resets the capturing parentheses numbers in each
+alternative (see
+.\" HTML <a href="#dupsubpatternnumber">
+.\" </a>
+"Duplicate Subpattern Numbers"
+.\"
+below). The assertions at the start of each branch check the next UTF-8
+character for values whose encoding uses 1, 2, 3, or 4 bytes, respectively. The
+character's individual bytes are then captured by the appropriate number of
+groups.
.
.
.\" HTML <a name="characterclass"></a>
@@ -2830,6 +2854,6 @@ Cambridge CB2 3QH, England.
.rs
.sp
.nf
-Last updated: 09 October 2011
+Last updated: 19 October 2011
Copyright (c) 1997-2011 University of Cambridge.
.fi
diff --git a/doc/pcreunicode.3 b/doc/pcreunicode.3
index 0aba50a..b805a64 100644
--- a/doc/pcreunicode.3
+++ b/doc/pcreunicode.3
@@ -100,11 +100,16 @@ bytes, for example: \ex{100}{3}.
4. The dot metacharacter matches one UTF-8 character instead of a single byte.
.P
5. The escape sequence \eC can be used to match a single byte in UTF-8 mode,
-but its use can lead to some strange effects. This facility is not available in
-the alternative matching function, \fBpcre_dfa_exec()\fP, nor is it supported
-by the JIT optimization of \fBpcre_exec()\fP. If JIT optimization is requested
-for a pattern that contains \eC, it will not succeed, and so the matching will
-be carried out by the normal interpretive function.
+but its use can lead to some strange effects because it breaks up multibyte
+characters (see the description of \eC in the
+.\" HREF
+\fBpcrepattern\fP
+.\"
+documentation). The use of \eC is not supported in the alternative matching
+function \fBpcre_dfa_exec()\fP, nor is it supported in UTF-8 mode by the JIT
+optimization of \fBpcre_exec()\fP. If JIT optimization is requested for a UTF-8
+pattern that contains \eC, it will not succeed, and so the matching will be
+carried out by the normal interpretive function.
.P
6. The character escapes \eb, \eB, \ed, \eD, \es, \eS, \ew, and \eW correctly
test characters of any code value, but, by default, the characters that PCRE
@@ -158,6 +163,6 @@ Cambridge CB2 3QH, England.
.rs
.sp
.nf
-Last updated: 06 September 2011
+Last updated: 19 October 2011
Copyright (c) 1997-2011 University of Cambridge.
.fi