summaryrefslogtreecommitdiff
path: root/utf8.h
diff options
context:
space:
mode:
authorKarl Williamson <public@khwilliamson.com>2013-02-19 15:13:19 -0700
committerKarl Williamson <public@khwilliamson.com>2013-08-29 09:55:52 -0600
commite7214ce8dd2816e52abdfe522e7ff5adc81ba23e (patch)
treeb0aae3d2fe1253cb4d4269fd6be1d6186b7ff9f4 /utf8.h
parent069727664af50f1d767b5928bad25bcb51f0644c (diff)
downloadperl-e7214ce8dd2816e52abdfe522e7ff5adc81ba23e.tar.gz
Use real illegal UTF-8 byte
The code here was wrong in assuming that \xFF is not legal in UTF-8 encoded strings. It currently doesn't work due to a bug, but that may eventually be fixed: [perl #116867]. The comments are also wrong that all bytes are legal in UTF-EBCDIC. It turns out that in well-formed UTF-8, the bytes C0 and C1 never appear (C2, C3, and C4 as well in UTF-EBCDIC), as they would be the start byte of an illegal overlong sequence. This creates a #define for an illegal byte using one of the real illegal ones, and changes the code to use that. No test is included due to #116867.
Diffstat (limited to 'utf8.h')
-rw-r--r--utf8.h4
1 files changed, 4 insertions, 0 deletions
diff --git a/utf8.h b/utf8.h
index bbbefdef70..b76f098fe4 100644
--- a/utf8.h
+++ b/utf8.h
@@ -349,6 +349,10 @@ Perl's extended UTF-8 means we can have start bytes up to FF.
#define UTF8_EIGHT_BIT_HI(c) UTF8_TWO_BYTE_HI((U8)(c))
#define UTF8_EIGHT_BIT_LO(c) UTF8_TWO_BYTE_LO((U8)(c))
+/* This is illegal in any well-formed UTF-8 in both EBCDIC and ASCII
+ * as it is only in overlongs. */
+#define ILLEGAL_UTF8_BYTE I8_TO_NATIVE_UTF8(0xC1)
+
/*
* 'UTF' is whether or not p is encoded in UTF8. The names 'foo_lazy_if' stem
* from an earlier version of these macros in which they didn't call the