diff options
author | Karl Williamson <public@khwilliamson.com> | 2013-02-19 15:13:19 -0700 |
---|---|---|
committer | Karl Williamson <public@khwilliamson.com> | 2013-08-29 09:55:52 -0600 |
commit | e7214ce8dd2816e52abdfe522e7ff5adc81ba23e (patch) | |
tree | b0aae3d2fe1253cb4d4269fd6be1d6186b7ff9f4 /utf8.h | |
parent | 069727664af50f1d767b5928bad25bcb51f0644c (diff) | |
download | perl-e7214ce8dd2816e52abdfe522e7ff5adc81ba23e.tar.gz |
Use real illegal UTF-8 byte
The code here was wrong in assuming that \xFF is not legal in UTF-8
encoded strings. It currently doesn't work due to a bug, but that may
eventually be fixed: [perl #116867]. The comments are also wrong that
all bytes are legal in UTF-EBCDIC.
It turns out that in well-formed UTF-8, the bytes C0 and C1 never appear
(C2, C3, and C4 as well in UTF-EBCDIC), as they would be the start byte
of an illegal overlong sequence.
This creates a #define for an illegal byte using one of the real illegal
ones, and changes the code to use that.
No test is included due to #116867.
Diffstat (limited to 'utf8.h')
-rw-r--r-- | utf8.h | 4 |
1 files changed, 4 insertions, 0 deletions
@@ -349,6 +349,10 @@ Perl's extended UTF-8 means we can have start bytes up to FF. #define UTF8_EIGHT_BIT_HI(c) UTF8_TWO_BYTE_HI((U8)(c)) #define UTF8_EIGHT_BIT_LO(c) UTF8_TWO_BYTE_LO((U8)(c)) +/* This is illegal in any well-formed UTF-8 in both EBCDIC and ASCII + * as it is only in overlongs. */ +#define ILLEGAL_UTF8_BYTE I8_TO_NATIVE_UTF8(0xC1) + /* * 'UTF' is whether or not p is encoded in UTF8. The names 'foo_lazy_if' stem * from an earlier version of these macros in which they didn't call the |