diff options
author | Karl Williamson <khw@cpan.org> | 2017-06-27 14:46:26 -0600 |
---|---|---|
committer | Karl Williamson <khw@cpan.org> | 2017-07-12 21:14:25 -0600 |
commit | 57ff5f598ddf7ce8834832a15ba1a4628b5932c4 (patch) | |
tree | f39ae0ce8116b6ee8a13b1014a562f4b350aa3a4 /utfebcdic.h | |
parent | d044b7a780a1f1916e96ed7d255bb0b7dad54713 (diff) | |
download | perl-57ff5f598ddf7ce8834832a15ba1a4628b5932c4.tar.gz |
utf8n_to_uvchr() Properly test for extended UTF-8
It somehow dawned on me that the code is incorrect for
warning/disallowing very high code points. What is really wanted in the
API is to catch UTF-8 that is not necessarily portable. There are
several classes of this, but I'm referring here to just the code points
that are above the Unicode-defined maximum of 0x10FFFF. These can be
considered non-portable, and there is a mechanism in the API to
warn/disallow these.
However an earlier standard defined UTF-8 to handle code points up to
2**31-1. Anything above that is using an extension to UTF-8 that has
never been officially recognized. Perl does use such an extension, and
the API is supposed to have a different mechanism to warn/disallow on
this.
Thus there are two classes of warning/disallowing for above-Unicode code
points. One for things that have some non-Unicode official recognition,
and the other for things that have never had official recognition.
UTF-EBCDIC differs somewhat in this, and since Perl 5.24, we have had a
Perl extension that allows it to handle any code point that fits in a
64-bit word. This kicks in at code points above 2**30-1, a number
different than UTF-8 extended kicks in on ASCII platforms.
Things are also complicated by the fact that the API has provisions for
accepting the overlong UTF-8 malformation. It is possible to use
extended UTF-8 to represent code points smaller than 31-bit ones.
Until this commit, the extended warning/disallowing was based on the
resultant code point, and only when that code point did not fit into 31
bits.
But what is really wanted is if extended UTF-8 was used to represent a
code point, no matter how large the resultant code point is. This
differs from the previous definition, but only for EBCDIC platforms, or
when the overlong malformation was also present. So it does not affect
very many real-world cases.
This commit fixes that. It turns out that it is easier to tell if
something is using extended-UTF8. One just looks at the first byte of a
sequence.
The trailing part of the warning message that gets raised is slightly
changed to be clearer. It's not significant enough to affect perldiag.
Diffstat (limited to 'utfebcdic.h')
-rw-r--r-- | utfebcdic.h | 2 |
1 files changed, 2 insertions, 0 deletions
diff --git a/utfebcdic.h b/utfebcdic.h index 0f81d1ffee..c2f0788cc4 100644 --- a/utfebcdic.h +++ b/utfebcdic.h @@ -511,6 +511,8 @@ explicitly forbidden, and the shortest possible encoding should always be used * has this start byte (expressed in I8) as the maximum */ #define _IS_UTF8_CHAR_HIGHEST_START_BYTE 0xF9 +#define UNICODE_IS_PERL_EXTENDED(uv) UNLIKELY((UV) (uv) > 0x3FFFFFFF) + /* * ex: set ts=8 sts=4 sw=4 et: */ |