Use real illegal UTF-8 byte

The code here was wrong in assuming that \xFF is not legal in UTF-8 encoded strings. It currently doesn't work due to a bug, but that may eventually be fixed: [perl #116867]. The comments are also wrong that all bytes are legal in UTF-EBCDIC. It turns out that in well-formed UTF-8, the bytes C0 and C1 never appear (C2, C3, and C4 as well in UTF-EBCDIC), as they would be the start byte of an illegal overlong sequence. This creates a #define for an illegal byte using one of the real illegal ones, and changes the code to use that. No test is included due to #116867.
author: Karl Williamson <public@khwilliamson.com> 2013-02-19 15:13:19 -0700
committer: Karl Williamson <public@khwilliamson.com> 2013-08-29 09:55:52 -0600
commit: e7214ce8dd2816e52abdfe522e7ff5adc81ba23e (patch)
tree: b0aae3d2fe1253cb4d4269fd6be1d6186b7ff9f4 /utf8.h
parent: 069727664af50f1d767b5928bad25bcb51f0644c (diff)
download: perl-e7214ce8dd2816e52abdfe522e7ff5adc81ba23e.tar.gz
1 files changed, 4 insertions, 0 deletions
diff --git a/utf8.h b/utf8.h
index bbbefdef70..b76f098fe4 100644
--- a/utf8.h
+++ b/utf8.h
@@ -349,6 +349,10 @@ Perl's extended UTF-8 means we can have start bytes up to FF.
 #define UTF8_EIGHT_BIT_HI(c)	UTF8_TWO_BYTE_HI((U8)(c))
 #define UTF8_EIGHT_BIT_LO(c)	UTF8_TWO_BYTE_LO((U8)(c))
 
+/* This is illegal in any well-formed UTF-8 in both EBCDIC and ASCII
+ * as it is only in overlongs. */
+#define ILLEGAL_UTF8_BYTE   I8_TO_NATIVE_UTF8(0xC1)
+
 /*
  * 'UTF' is whether or not p is encoded in UTF8.  The names 'foo_lazy_if' stem
  * from an earlier version of these macros in which they didn't call the
author	Karl Williamson <public@khwilliamson.com>	2013-02-19 15:13:19 -0700
committer	Karl Williamson <public@khwilliamson.com>	2013-08-29 09:55:52 -0600
commit	e7214ce8dd2816e52abdfe522e7ff5adc81ba23e (patch)
tree	b0aae3d2fe1253cb4d4269fd6be1d6186b7ff9f4 /utf8.h
parent	069727664af50f1d767b5928bad25bcb51f0644c (diff)
download	perl-e7214ce8dd2816e52abdfe522e7ff5adc81ba23e.tar.gz