summaryrefslogtreecommitdiff
path: root/regcharclass.h
diff options
context:
space:
mode:
authorKarl Williamson <khw@cpan.org>2015-11-18 21:28:14 -0700
committerKarl Williamson <khw@cpan.org>2015-11-25 15:48:17 -0700
commitc0236afee0c5845d3823612c5cd34eccc4d29321 (patch)
tree61f9694bcd4579ccd2196c74ec7be8c69cda1f0f /regcharclass.h
parentb67fd2c557cdf9bdc899813a5e4f2dee22e4f63e (diff)
downloadperl-c0236afee0c5845d3823612c5cd34eccc4d29321.tar.gz
Extend UTF-EBCDIC to handle up to 2**64-1
This uses for UTF-EBCDIC essentially the same mechanism that Perl already uses for UTF-8 on ASCII platforms to extend it beyond what might be its natural maximum. That is, when the UTF-8 start byte is 0xFF, it adds a bunch more bytes to the character than it otherwise would, bringing it to a total of 14 for UTF-EBCDIC. This is enough to handle any code point that fits in a 64 bit word. The downside of this is that this extension is not compatible with previous perls for the range 2**30 up through the previous max, 2**30 - 1. A simple program could be written to convert files that were written out using an older perl so that they can be read with newer perls, and the perldelta says we will do this should anyone ask. However, I strongly suspect that the number of such files in existence is zero, as people in EBCDIC land don't seem to use Unicode much, and these are very large code points, which are associated with a portability warning every time they are output in some way. This extension brings UTF-EBCDIC to parity with UTF-8, so that both can cover a 64-bit word. It allows some removal of special cases for EBCDIC in core code and core tests. And it is a necessary step to handle Perl 6's NFG, which I'd like eventually to bring to Perl 5. This commit causes two implementations of a macro in utf8.h and utfebcdic.h to become the same, and both are moved to a single one in the portion of utf8.h common to both. To illustrate, the I8 for U+3FFFFFFF (2**30-1) is "\xFE\xBF\xBF\xBF\xBF\xBF\xBF" before and after this commit, but the I8 for the next code point, U+40000000 is now "\xFF\xA0\xA0\xA0\xA0\xA0\xA0\xA1\xA0\xA0\xA0\xA0\xA0\xA0", and before this commit it was "\xFF\xA0\xA0\xA0\xA0\xA0\xA0". The I8 for 2**64-1 (U+FFFFFFFFFFFFFFFF) is "\xFF\xAF\xBF\xBF\xBF\xBF\xBF\xBF\xBF\xBF\xBF\xBF\xBF\xBF", whereas before this commit it was unrepresentable. Commit 7c560c3beefbb9946463c9f7b946a13f02f319d8 said in its message that it was moving something that hadn't been needed on EBCDIC until the "next commit". That statement turned out to be wrong, overtaken by events. This now is the commit it was referring to. commit I prematurely pushed that
Diffstat (limited to 'regcharclass.h')
-rw-r--r--regcharclass.h2
1 files changed, 1 insertions, 1 deletions
diff --git a/regcharclass.h b/regcharclass.h
index b947bf290d..54a5011c88 100644
--- a/regcharclass.h
+++ b/regcharclass.h
@@ -2516,7 +2516,7 @@
* a18d502bad39d527ac5586d7bc93e29f565859e3bcc24ada627eff606d6f5fed lib/unicore/extracted/DNumValues.txt
* 602994a2249dfd84ae106940eb48450e3e6f1a69d489274f2618861a86f5d8e0 lib/unicore/mktables
* 462c9aaa608fb2014cd9649af1c5c009485c60b9c8b15b89401fdc10cf6161c6 lib/unicore/version
- * c6884f4d629f04d1316f3476cb1050b6a1b98ca30c903262955d4eae337c6b1e regen/charset_translations.pl
+ * 996abda3c0fbc2bfd575092af09e3b9b0331e624eb2e969a268457f8fd31ecbb regen/charset_translations.pl
* d9c04ac46bdd81bb3e26519f2b8eb6242cb12337205add3f7cf092b0c58dccc4 regen/regcharclass.pl
* 393f8d882713a3ba227351ad0f00ea4839fda74fcf77dcd1cdf31519925adba5 regen/regcharclass_multi_char_folds.pl
* ex: set ro: */