summaryrefslogtreecommitdiff
path: root/utfebcdic.h
diff options
context:
space:
mode:
authorKarl Williamson <khw@cpan.org>2017-06-27 14:46:26 -0600
committerKarl Williamson <khw@cpan.org>2017-07-12 21:14:25 -0600
commit57ff5f598ddf7ce8834832a15ba1a4628b5932c4 (patch)
treef39ae0ce8116b6ee8a13b1014a562f4b350aa3a4 /utfebcdic.h
parentd044b7a780a1f1916e96ed7d255bb0b7dad54713 (diff)
downloadperl-57ff5f598ddf7ce8834832a15ba1a4628b5932c4.tar.gz
utf8n_to_uvchr() Properly test for extended UTF-8
It somehow dawned on me that the code is incorrect for warning/disallowing very high code points. What is really wanted in the API is to catch UTF-8 that is not necessarily portable. There are several classes of this, but I'm referring here to just the code points that are above the Unicode-defined maximum of 0x10FFFF. These can be considered non-portable, and there is a mechanism in the API to warn/disallow these. However an earlier standard defined UTF-8 to handle code points up to 2**31-1. Anything above that is using an extension to UTF-8 that has never been officially recognized. Perl does use such an extension, and the API is supposed to have a different mechanism to warn/disallow on this. Thus there are two classes of warning/disallowing for above-Unicode code points. One for things that have some non-Unicode official recognition, and the other for things that have never had official recognition. UTF-EBCDIC differs somewhat in this, and since Perl 5.24, we have had a Perl extension that allows it to handle any code point that fits in a 64-bit word. This kicks in at code points above 2**30-1, a number different than UTF-8 extended kicks in on ASCII platforms. Things are also complicated by the fact that the API has provisions for accepting the overlong UTF-8 malformation. It is possible to use extended UTF-8 to represent code points smaller than 31-bit ones. Until this commit, the extended warning/disallowing was based on the resultant code point, and only when that code point did not fit into 31 bits. But what is really wanted is if extended UTF-8 was used to represent a code point, no matter how large the resultant code point is. This differs from the previous definition, but only for EBCDIC platforms, or when the overlong malformation was also present. So it does not affect very many real-world cases. This commit fixes that. It turns out that it is easier to tell if something is using extended-UTF8. One just looks at the first byte of a sequence. The trailing part of the warning message that gets raised is slightly changed to be clearer. It's not significant enough to affect perldiag.
Diffstat (limited to 'utfebcdic.h')
-rw-r--r--utfebcdic.h2
1 files changed, 2 insertions, 0 deletions
diff --git a/utfebcdic.h b/utfebcdic.h
index 0f81d1ffee..c2f0788cc4 100644
--- a/utfebcdic.h
+++ b/utfebcdic.h
@@ -511,6 +511,8 @@ explicitly forbidden, and the shortest possible encoding should always be used
* has this start byte (expressed in I8) as the maximum */
#define _IS_UTF8_CHAR_HIGHEST_START_BYTE 0xF9
+#define UNICODE_IS_PERL_EXTENDED(uv) UNLIKELY((UV) (uv) > 0x3FFFFFFF)
+
/*
* ex: set ts=8 sts=4 sw=4 et:
*/