diff options
author | Karl Williamson <khw@cpan.org> | 2014-04-21 21:02:44 -0600 |
---|---|---|
committer | Karl Williamson <khw@cpan.org> | 2014-05-30 10:00:51 -0600 |
commit | 9723729112a6aa3b219923c48ab1a294906b4819 (patch) | |
tree | 3c479587423b2042e066aad314c21bf644790df1 /utfebcdic.h | |
parent | ff7ecfc39c52fa2a01fe82fb9c109c148766026a (diff) | |
download | perl-9723729112a6aa3b219923c48ab1a294906b4819.tar.gz |
utfebcdic.h: Comment changes only
Clarifications and typo fix.
Diffstat (limited to 'utfebcdic.h')
-rw-r--r-- | utfebcdic.h | 71 |
1 files changed, 45 insertions, 26 deletions
diff --git a/utfebcdic.h b/utfebcdic.h index 54a3d2696e..edd7a1f084 100644 --- a/utfebcdic.h +++ b/utfebcdic.h @@ -7,15 +7,15 @@ * License or the Artistic License, as specified in the README file. * * Macros to implement UTF-EBCDIC as perl's internal encoding - * Taken from version 7.1 of Unicode Technical Report #16: + * Adapted from version 7.1 of Unicode Technical Report #16: * http://www.unicode.org/unicode/reports/tr16 * * To summarize, the way it works is: * To convert an EBCDIC character to UTF-EBCDIC: * 1) convert to Unicode. The table in this file that does this for - * EBCDIC bytes is PL_e2a (with inverse PLa2e). The 'a' stands for - * ASCIIish, meaning latin1. - * 2) convert that to a utf8-like string called I8 (I stands for + * EBCDIC bytes is PL_e2a (with inverse PL_a2e). The 'a' stands for + * ASCII platform, meaning latin1. + * 2) convert that to a utf8-like string called I8 ('I' stands for * intermediate) with variant characters occupying multiple bytes. This * step is similar to the utf8-creating step from Unicode, but the details * are different. This transformation is called UTF8-Mod. There is a @@ -29,20 +29,21 @@ * trailing 0 for the very largest possible allocation * in I8, far beyond the current Unicode standard's * max, as shown in the comment later in this file.) - * 3) Use the table published in tr16 to convert each byte from step 2 into - * final UTF-EBCDIC. That table is reproduced in this file as PL_utf2e, - * and its inverse is PL_e2utf. They are constructed so that all EBCDIC - * invariants remain invariant, but no others do. For example, the - * ordinal value of 'A' is 193 in EBCDIC, and also is 193 in UTF-EBCDIC. - * Step 1) converts it to 65, Step 2 leaves it at 65, and Step 3 converts - * it back to 193. As an example of how a variant character works, take - * LATIN SMALL LETTER Y WITH DIAERESIS, which is typically 0xDF in - * EBCDIC. Step 1 converts it to the Unicode value, 0xFF. Step 2 - * converts that to two bytes = 11000111 10111111 = C7 BF, and Step 3 - * converts those to 0x8B 0x73. The table is constructed so that the - * first byte of the final form of a variant will always have its upper - * bit set (at least in the encodings that Perl recognizes, and probably - * all). But note that the upper bit of some invariants is also 1. + * 3) Use the algorithm in tr16 to convert each byte from step 2 into + * final UTF-EBCDIC. This is done by table lookup from a table + * constructed from the algorithm, reproduced in this file as + * PL_utf2e, with its inverse being PL_e2utf. They are constructed so that + * all EBCDIC invariants remain invariant, but no others do, and the first + * byte of a variant will always have its upper bit set. But note that + * the upper bit of some invariants is also 1. + * + * For example, the ordinal value of 'A' is 193 in EBCDIC, and also is 193 in + * UTF-EBCDIC. Step 1) converts it to 65, Step 2 leaves it at 65, and Step 3 + * converts it back to 193. As an example of how a variant character works, + * take LATIN SMALL LETTER Y WITH DIAERESIS, which is typically 0xDF in + * EBCDIC. Step 1 converts it to the Unicode value, 0xFF. Step 2 converts + * that to two bytes = 11000111 10111111 = C7 BF, and Step 3 converts those to + * 0x8B 0x73. * * If you're starting from Unicode, skip step 1. For UTF-EBCDIC to straight * EBCDIC, reverse the steps. @@ -56,20 +57,38 @@ * The purpose of Step 3 is to make the encoding be invariant for the chosen * characters. This messes up the convenient patterns found in step 2, so * generally, one has to undo step 3 into a temporary to use them. However, - * a "shadow", or parallel table, PL_utf8skip, has been constructed so that for - * each byte, it says how long the sequence is if that byte were to begin it + * one "shadow", or parallel table, PL_utf8skip, has been constructed that + * doesn't require undoing things. It is such that for each byte, it says + * how long the sequence is if that (UTF-EBCDIC) byte were to begin it + * + * There are actually 3 slightly different UTF-EBCDIC encodings in + * this file, one for each of the code pages recognized by Perl. That + * means that there are actually three different sets of tables, one for each + * code page. (If Perl is compiled on platforms using another EBCDIC code + * page, it may not compile, or Perl may silently mistake it for one of the + * three.) * - * There are actually 3 slightly different UTF-EBCDIC encodings in this file, - * one for each of the code pages recognized by Perl. That means that there - * are actually three different sets of tables, one for each code page. (If - * Perl is compiled on platforms using another EBCDIC code page, it may not - * compile, or Perl may silently mistake it for one of the three.) + * Note that tr16 actually only specifies one version of UTF-EBCDIC, based on + * the 1047 encoding, and which is supposed to be used for all code pages. + * But this doesn't work. To illustrate the problem, consider the '^' character. + * On a 037 code page it is the single byte 176, whereas under 1047 UTF-EBCDIC + * it is the single byte 95. If Perl implemented tr16 exactly, it would mean + * that changing a string containing '^' to UTF-EBCDIC would change that '^' + * from 176 to 95 (and vice-versa), violating the rule that ASCII-range + * characters are the same in UTF-8 or not. Much code in Perl assumes this + * rule. See for example + * http://grokbase.com/t/perl/mvs/025xf0yhmn/utf-ebcdic-for-posix-bc-malformed-utf-8-character + * What Perl does is create a version of UTF-EBCDIC suited to each code page; + * the one for the 1047 code page is identical to what's specified in tr16. + * This complicates interchanging files between computers using different code + * pages. Best is to convert to I8 before sending them, as the I8 + * representation is the same no matter what the underlying code page is. * * EBCDIC characters above 0xFF are the same as Unicode in Perl's * implementation of all 3 encodings, so for those Step 1 is trivial. * * (Note that the entries for invariant characters are necessarily the same in - * PL_e2a and PLe2f, and the same for their inverses.) + * PL_e2a and PL_e2utf; likewise for their inverses.) * * UTF-EBCDIC strings are the same length or longer than UTF-8 representations * of the same string. The maximum code point representable as 2 bytes in |