diff options
author | Karl Williamson <khw@cpan.org> | 2015-11-18 21:28:14 -0700 |
---|---|---|
committer | Karl Williamson <khw@cpan.org> | 2015-11-25 15:48:17 -0700 |
commit | c0236afee0c5845d3823612c5cd34eccc4d29321 (patch) | |
tree | 61f9694bcd4579ccd2196c74ec7be8c69cda1f0f /regen/charset_translations.pl | |
parent | b67fd2c557cdf9bdc899813a5e4f2dee22e4f63e (diff) | |
download | perl-c0236afee0c5845d3823612c5cd34eccc4d29321.tar.gz |
Extend UTF-EBCDIC to handle up to 2**64-1
This uses for UTF-EBCDIC essentially the same mechanism that Perl
already uses for UTF-8 on ASCII platforms to extend it beyond what might
be its natural maximum. That is, when the UTF-8 start byte is 0xFF, it
adds a bunch more bytes to the character than it otherwise would,
bringing it to a total of 14 for UTF-EBCDIC. This is enough to handle
any code point that fits in a 64 bit word.
The downside of this is that this extension is not compatible with
previous perls for the range 2**30 up through the previous max,
2**30 - 1. A simple program could be written to convert files that were
written out using an older perl so that they can be read with newer
perls, and the perldelta says we will do this should anyone ask.
However, I strongly suspect that the number of such files in existence
is zero, as people in EBCDIC land don't seem to use Unicode much, and
these are very large code points, which are associated with a
portability warning every time they are output in some way.
This extension brings UTF-EBCDIC to parity with UTF-8, so that both can
cover a 64-bit word. It allows some removal of special cases for EBCDIC
in core code and core tests. And it is a necessary step to handle Perl
6's NFG, which I'd like eventually to bring to Perl 5.
This commit causes two implementations of a macro in utf8.h and
utfebcdic.h to become the same, and both are moved to a single one in
the portion of utf8.h common to both.
To illustrate, the I8 for U+3FFFFFFF (2**30-1) is
"\xFE\xBF\xBF\xBF\xBF\xBF\xBF" before and after this commit, but the I8
for the next code point, U+40000000 is now
"\xFF\xA0\xA0\xA0\xA0\xA0\xA0\xA1\xA0\xA0\xA0\xA0\xA0\xA0",
and before this commit it was "\xFF\xA0\xA0\xA0\xA0\xA0\xA0".
The I8 for 2**64-1 (U+FFFFFFFFFFFFFFFF) is
"\xFF\xAF\xBF\xBF\xBF\xBF\xBF\xBF\xBF\xBF\xBF\xBF\xBF\xBF", whereas
before this commit it was unrepresentable.
Commit 7c560c3beefbb9946463c9f7b946a13f02f319d8 said in its message that
it was moving something that hadn't been needed on EBCDIC until the
"next commit". That statement turned out to be wrong, overtaken by
events. This now is the commit it was referring to.
commit I prematurely
pushed that
Diffstat (limited to 'regen/charset_translations.pl')
-rw-r--r-- | regen/charset_translations.pl | 13 |
1 files changed, 10 insertions, 3 deletions
diff --git a/regen/charset_translations.pl b/regen/charset_translations.pl index 9696560e8f..b37c3cdd6a 100644 --- a/regen/charset_translations.pl +++ b/regen/charset_translations.pl @@ -2,6 +2,10 @@ use strict; use warnings; +# WARNING: This must be kept in sync with the UTF8_MAXBYTES value in +# utfebcdic.h +$CHARSET_TRANSLATIONS::UTF_EBCDIC_MAXBYTES = 14; + # Utilities for various character set issues. Currently handles ASCII and # EBCDIC only. It is trivial to add support for new EBCDIC code pages (unless # they have identical variant character signatures as existing ones, and there @@ -234,12 +238,13 @@ sub get_I8_2_utf($) { sub _UTF_START_MASK($) { # Internal my $len = shift; - return ((($len) >= 6) ? 0x01 : (0x1F >> (($len)-2))); + return (($len >= 7) ? 0x00 : (0x1F >> ($len - 2))); } sub _UTF_START_MARK($) { # Internal - return (0xFF & (0xFE << (7-(shift)))); + my $len = shift; + return (($len > 7) ? 0xFF : (0xFF & (0xFE << (7- $len)))); } sub cp_2_utfbytes($$) { @@ -269,7 +274,9 @@ sub cp_2_utfbytes($$) { $ucp < 0x4000 ? 3 : $ucp < 0x40000 ? 4 : $ucp < 0x400000 ? 5 : - $ucp < 0x4000000 ? 6 : 7; + $ucp < 0x4000000 ? 6 : + $ucp < 0x40000000? 7 : + $CHARSET_TRANSLATIONS::UTF_EBCDIC_MAXBYTES; my @str; for (1 .. $len - 1) { |