diff options
author | Jarkko Hietaniemi <jhi@iki.fi> | 2000-08-17 14:44:02 +0000 |
---|---|---|
committer | Jarkko Hietaniemi <jhi@iki.fi> | 2000-08-17 14:44:02 +0000 |
commit | d396a55899b7bce58ef6008d9af7a500b5175b4a (patch) | |
tree | 92bb4fc9fea98748bcd8bc310e3b9dd4fd5f54a0 /pod/perlebcdic.pod | |
parent | 10c102662dfb8c226a9c3524f047501223fa8409 (diff) | |
download | perl-d396a55899b7bce58ef6008d9af7a500b5175b4a.tar.gz |
Add perlebcdic from Peter Prymmer, regen toc.
p4raw-id: //depot/perl@6676
Diffstat (limited to 'pod/perlebcdic.pod')
-rw-r--r-- | pod/perlebcdic.pod | 1001 |
1 files changed, 1001 insertions, 0 deletions
diff --git a/pod/perlebcdic.pod b/pod/perlebcdic.pod new file mode 100644 index 0000000000..f27a8dea2e --- /dev/null +++ b/pod/perlebcdic.pod @@ -0,0 +1,1001 @@ +=head1 NAME + +perlebcdic - Considerations for running Perl on EBCDIC platforms + +=head1 DESCRIPTION + +An exploration of some of the issues facing Perl programmers +on EBCDIC based computers. We do not cover localization, +internationalization, or multi byte character set issues (yet). + +Portions that are still incomplete are marked with XXX. + +=head1 COMMON CHARACTER CODE SETS + +=head2 ASCII + +The American Standard Code for Information Interchange is a set of +integers running from 0 to 127 (decimal) that imply character +interpretation by the display and other system(s) of computers. +The range 0..127 is covered by setting the bits in a 7-bit binary +digit, hence the set is sometimes referred to as a "7-bit ASCII". +ASCII was described by the American National Standards Instute +document ANSI X3.4-1986. It was also described by ISO 646:1991 +(with localization for currency symbols). The full ASCII set is +given in the table below as the first 128 elements. Languages that +can be written adequately with the characters in ASCII include +English, Hawaiian, Indonesian, Swahili and some Native American +languages. + +=head2 ISO 8859 + +The ISO 8859-$n are a collection of character code sets from the +International Organization for Standardization (ISO) each of which +adds characters to the ASCII set that are typically found in European +languages many of which are based on the Roman, or Latin, alphabet. + +=head2 Latin 1 (ISO 8859-1) + +A particular 8-bit extension to ASCII that includes grave and acute +accented Latin characters. Languages that can employ ISO 8859-1 +include all the languages covered by ASCII as well as Afrikaans, +Albanian, Basque, Catalan, Danish, Faroese, Finnish, Norwegian, +Portugese, Spanish, and Swedish. Dutch is covered albeit without +the ij ligature. French is covered too but without the oe ligature. +German can use ISO 8859-1 but must do so without German-style +quotation marks. This set is based on Western European extensions +to ASCII and is commonly encountered in world wide web work. +In IBM character code set identification terminology ISO 8859-1 is +known as CCSID 819 (or sometimes 0819 or even 00819). + +=head2 EBCDIC + +Extended Binary Coded Decimal Interchange Code. The EBCDIC acronym +refers to a large collection of slightly different single and +multi byte coded character sets that are different from ASCII or +ISO 8859-1 and typically run on host computers. The +EBCDIC encodings derive from Hollerith punched card encodings. +The layout on the cards was such that high bits were set for the +upper and lower case alphabet characters [a-z] and [A-Z], but there +were gaps within each latin alphabet range. + +=head2 13 variant characters + +XXX. + +EBCDIC character sets may be known by character code set identification +numbers (CCSID numbers) or code page numbers. + +=head2 0037 + +Character code set ID 0037 is a mapping of the ASCII plus Latin-1 +characters (i.e. ISO 8859-1) to an EBCDIC set. 0037 is used +on the OS/400 operating system that runs on AS/400 computers. +CCSID 37 differs from ISO 8859-1 in 237 places, in other words +they agree on only 19 code point values. + +=head2 1047 + +Character code set ID 1047 is also a mapping of the ASCII plus +Latin-1 characters (i.e. ISO 8859-1) to an EBCDIC set. 1047 is +used under Unix System Services for OS/390, and OpenEdition for VM/ESA. +CCSID 1047 differs from CCSID 0037 in eight places. + +=head2 POSIX-BC + +The EBCDIC code page in use on Siemens' BS2000 system is distinct from +1047 and 0037. It is identified below as the POSIX-BC set. + +=head1 SINGLE OCTET TABLES + +The following tables list the ASCII and Latin 1 ordered sets including +the subsets: C0 controls (0..31), ASCII graphics (32..7e), delete (7f), +C1 controls (80..9f), and Latin-1 (a.k.a. ISO 8859-1) (a0..ff). In the +table non-printing control character names as well as the Latin 1 +extensions to ASCII have been labelled with character names roughly +corresponding to I<The Unicode Standard, Version 2.0> albeit with +substitutions such as s/LATIN// and s/VULGAR// in all cases, +s/CAPITAL LETTER// in some cases, and s/SMALL LETTER ([A-Z])/\l$1/ +in some other cases. The "names" of the C1 control set +(128..159 in ISO 8859-1) are somewhat arbitrary. The differences +between the 0037 and 1047 sets are flagged with ***. The differences +between the 1047 and POSIX-BC sets are flagged with ###. +All ord() numbers listed are decimal. If you would rather see this +table listing octal values then run the table (that is, the pod +version of this document since this recipe may not work with +a pod2XXX translation to another format) through: + +=over 4 + +=item recipe 0 + +=back + + perl -ne 'if(/(.{33})(\d+)\s+(\d+)\s+(\d+)\s+(\d+)/)' \ + -e '{printf("%s%-9o%-9o%-9o%-9o\n",$1,$2,$3,$4,$5)}' perlebcdic.pod + +If you would rather see this table listing hexadecimal values then +run the table through: + +=over 4 + +=item recipe 1 + +=back + + perl -ne 'if(/(.{33})(\d+)\s+(\d+)\s+(\d+)\s+(\d+)/)' \ + -e '{printf("%s%-9X%-9X%-9X%-9X\n",$1,$2,$3,$4,$5)}' perlebcdic.pod + + + 8859-1 + chr 0819 0037 1047 POSIX-BC + ---------------------------------------------------------------- + <NULL> 0 0 0 0 + <START OF HEADING> 1 1 1 1 + <START OF TEXT> 2 2 2 2 + <END OF TEXT> 3 3 3 3 + <END OF TRANSMISSION> 4 55 55 55 + <ENQUIRY> 5 45 45 45 + <ACKNOWLEDGE> 6 46 46 46 + <BELL> 7 47 47 47 + <BACKSPACE> 8 22 22 22 + <HORIZONTAL TABULATION> 9 5 5 5 + <LINE FEED> 10 37 21 21 *** + <VERTICAL TABULATION> 11 11 11 11 + <FORM FEED> 12 12 12 12 + <CARRIAGE RETURN> 13 13 13 13 + <SHIFT OUT> 14 14 14 14 + <SHIFT IN> 15 15 15 15 + <DATA LINK ESCAPE> 16 16 16 16 + <DEVICE CONTROL ONE> 17 17 17 17 + <DEVICE CONTROL TWO> 18 18 18 18 + <DEVICE CONTROL THREE> 19 19 19 19 + <DEVICE CONTROL FOUR> 20 60 60 60 + <NEGATIVE ACKNOWLEDGE> 21 61 61 61 + <SYNCHRONOUS IDLE> 22 50 50 50 + <END OF TRANSMISSION BLOCK> 23 38 38 38 + <CANCEL> 24 24 24 24 + <END OF MEDIUM> 25 25 25 25 + <SUBSTITUTE> 26 63 63 63 + <ESCAPE> 27 39 39 39 + <FILE SEPARATOR> 28 28 28 28 + <GROUP SEPARATOR> 29 29 29 29 + <RECORD SEPARATOR> 30 30 30 30 + <UNIT SEPARATOR> 31 31 31 31 + <SPACE> 32 64 64 64 + ! 33 90 90 90 + " 34 127 127 127 + # 35 123 123 123 + $ 36 91 91 91 + % 37 108 108 108 + & 38 80 80 80 + ' 39 125 125 125 + ( 40 77 77 77 + ) 41 93 93 93 + * 42 92 92 92 + + 43 78 78 78 + , 44 107 107 107 + - 45 96 96 96 + . 46 75 75 75 + / 47 97 97 97 + 0 48 240 240 240 + 1 49 241 241 241 + 2 50 242 242 242 + 3 51 243 243 243 + 4 52 244 244 244 + 5 53 245 245 245 + 6 54 246 246 246 + 7 55 247 247 247 + 8 56 248 248 248 + 9 57 249 249 249 + : 58 122 122 122 + ; 59 94 94 94 + < 60 76 76 76 + = 61 126 126 126 + > 62 110 110 110 + ? 63 111 111 111 + @ 64 124 124 124 + A 65 193 193 193 + B 66 194 194 194 + C 67 195 195 195 + D 68 196 196 196 + E 69 197 197 197 + F 70 198 198 198 + G 71 199 199 199 + H 72 200 200 200 + I 73 201 201 201 + J 74 209 209 209 + K 75 210 210 210 + L 76 211 211 211 + M 77 212 212 212 + N 78 213 213 213 + O 79 214 214 214 + P 80 215 215 215 + Q 81 216 216 216 + R 82 217 217 217 + S 83 226 226 226 + T 84 227 227 227 + U 85 228 228 228 + V 86 229 229 229 + W 87 230 230 230 + X 88 231 231 231 + Y 89 232 232 232 + Z 90 233 233 233 + [ 91 186 173 187 *** ### + \ 92 224 224 188 ### + ] 93 187 189 189 *** + ^ 94 176 95 106 *** ### + _ 95 109 109 109 + ` 96 121 121 74 ### + a 97 129 129 129 + b 98 130 130 130 + c 99 131 131 131 + d 100 132 132 132 + e 101 133 133 133 + f 102 134 134 134 + g 103 135 135 135 + h 104 136 136 136 + i 105 137 137 137 + j 106 145 145 145 + k 107 146 146 146 + l 108 147 147 147 + m 109 148 148 148 + n 110 149 149 149 + o 111 150 150 150 + p 112 151 151 151 + q 113 152 152 152 + r 114 153 153 153 + s 115 162 162 162 + t 116 163 163 163 + u 117 164 164 164 + v 118 165 165 165 + w 119 166 166 166 + x 120 167 167 167 + y 121 168 168 168 + z 122 169 169 169 + { 123 192 192 251 ### + | 124 79 79 79 + } 125 208 208 253 ### + ~ 126 161 161 255 ### + <DELETE> 127 7 7 7 + <C1 0> 128 32 32 32 + <C1 1> 129 33 33 33 + <C1 2> 130 34 34 34 + <C1 3> 131 35 35 35 + <C1 4> 132 36 36 36 + <C1 5> 133 21 37 37 *** + <C1 6> 134 6 6 6 + <C1 7> 135 23 23 23 + <C1 8> 136 40 40 40 + <C1 9> 137 41 41 41 + <C1 10> 138 42 42 42 + <C1 11> 139 43 43 43 + <C1 12> 140 44 44 44 + <C1 13> 141 9 9 9 + <C1 14> 142 10 10 10 + <C1 15> 143 27 27 27 + <C1 16> 144 48 48 48 + <C1 17> 145 49 49 49 + <C1 18> 146 26 26 26 + <C1 19> 147 51 51 51 + <C1 20> 148 52 52 52 + <C1 21> 149 53 53 53 + <C1 22> 150 54 54 54 + <C1 23> 151 8 8 8 + <C1 24> 152 56 56 56 + <C1 25> 153 57 57 57 + <C1 26> 154 58 58 58 + <C1 27> 155 59 59 59 + <C1 28> 156 4 4 4 + <C1 29> 157 20 20 20 + <C1 30> 158 62 62 62 + <C1 31> 159 255 255 95 ### + <NON-BREAKING SPACE> 160 65 65 65 + <INVERTED EXCLAMATION MARK> 161 170 170 170 + <CENT SIGN> 162 74 74 176 ### + <POUND SIGN> 163 177 177 177 + <CURRENCY SIGN> 164 159 159 159 + <YEN SIGN> 165 178 178 178 + <BROKEN BAR> 166 106 106 208 ### + <SECTION SIGN> 167 181 181 181 + <DIAERESIS> 168 189 187 121 *** ### + <COPYRIGHT SIGN> 169 180 180 180 + <FEMININE ORDINAL INDICATOR> 170 154 154 154 + <LEFT POINTING GUILLEMET> 171 138 138 138 + <NOT SIGN> 172 95 176 186 *** ### + <SOFT HYPHEN> 173 202 202 202 + <REGISTERED TRADE MARK SIGN> 174 175 175 175 + <MACRON> 175 188 188 161 ### + <DEGREE SIGN> 176 144 144 144 + <PLUS-OR-MINUS SIGN> 177 143 143 143 + <SUPERSCRIPT TWO> 178 234 234 234 + <SUPERSCRIPT THREE> 179 250 250 250 + <ACUTE ACCENT> 180 190 190 190 + <MICRO SIGN> 181 160 160 160 + <PARAGRAPH SIGN> 182 182 182 182 + <MIDDLE DOT> 183 179 179 179 + <CEDILLA> 184 157 157 157 + <SUPERSCRIPT ONE> 185 218 218 218 + <MASC. ORDINAL INDICATOR> 186 155 155 155 + <RIGHT POINTING GUILLEMET> 187 139 139 139 + <FRACTION ONE QUARTER> 188 183 183 183 + <FRACTION ONE HALF> 189 184 184 184 + <FRACTION THREE QUARTERS> 190 185 185 185 + <INVERTED QUESTION MARK> 191 171 171 171 + <A WITH GRAVE> 192 100 100 100 + <A WITH ACUTE> 193 101 101 101 + <A WITH CIRCUMFLEX> 194 98 98 98 + <A WITH TILDE> 195 102 102 102 + <A WITH DIAERESIS> 196 99 99 99 + <A WITH RING ABOVE> 197 103 103 103 + <CAPITAL LIGATURE AE> 198 158 158 158 + <C WITH CEDILLA> 199 104 104 104 + <E WITH GRAVE> 200 116 116 116 + <E WITH ACUTE> 201 113 113 113 + <E WITH CIRCUMFLEX> 202 114 114 114 + <E WITH DIAERESIS> 203 115 115 115 + <I WITH GRAVE> 204 120 120 120 + <I WITH ACUTE> 205 117 117 117 + <I WITH CIRCUMFLEX> 206 118 118 118 + <I WITH DIAERESIS> 207 119 119 119 + <CAPITAL LETTER ETH> 208 172 172 172 + <N WITH TILDE> 209 105 105 105 + <O WITH GRAVE> 210 237 237 237 + <O WITH ACUTE> 211 238 238 238 + <O WITH CIRCUMFLEX> 212 235 235 235 + <O WITH TILDE> 213 239 239 239 + <O WITH DIAERESIS> 214 236 236 236 + <MULTIPLICATION SIGN> 215 191 191 191 + <O WITH STROKE> 216 128 128 128 + <U WITH GRAVE> 217 253 253 224 ### + <U WITH ACUTE> 218 254 254 254 + <U WITH CIRCUMFLEX> 219 251 251 221 ### + <U WITH DIAERESIS> 220 252 252 252 + <Y WITH ACUTE> 221 173 186 173 *** ### + <CAPITAL LETTER THORN> 222 174 174 174 + <SMALL LETTER SHARP S> 223 89 89 89 + <a WITH GRAVE> 224 68 68 68 + <a WITH ACUTE> 225 69 69 69 + <a WITH CIRCUMFLEX> 226 66 66 66 + <a WITH TILDE> 227 70 70 70 + <a WITH DIAERESIS> 228 67 67 67 + <a WITH RING ABOVE> 229 71 71 71 + <SMALL LIGATURE ae> 230 156 156 156 + <c WITH CEDILLA> 231 72 72 72 + <e WITH GRAVE> 232 84 84 84 + <e WITH ACUTE> 233 81 81 81 + <e WITH CIRCUMFLEX> 234 82 82 82 + <e WITH DIAERESIS> 235 83 83 83 + <i WITH GRAVE> 236 88 88 88 + <i WITH ACUTE> 237 85 85 85 + <i WITH CIRCUMFLEX> 238 86 86 86 + <i WITH DIAERESIS> 239 87 87 87 + <SMALL LETTER eth> 240 140 140 140 + <n WITH TILDE> 241 73 73 73 + <o WITH GRAVE> 242 205 205 205 + <o WITH ACUTE> 243 206 206 206 + <o WITH CIRCUMFLEX> 244 203 203 203 + <o WITH TILDE> 245 207 207 207 + <o WITH DIAERESIS> 246 204 204 204 + <DIVISION SIGN> 247 225 225 225 + <o WITH STROKE> 248 112 112 112 + <u WITH GRAVE> 249 221 221 192 ### + <u WITH ACUTE> 250 222 222 222 + <u WITH CIRCUMFLEX> 251 219 219 219 + <u WITH DIAERESIS> 252 220 220 220 + <y WITH ACUTE> 253 141 141 141 + <SMALL LETTER thorn> 254 142 142 142 + <y WITH DIAERESIS> 255 223 223 223 + +If you would rather see the above table in CCSID 0037 order rather than +ASCII + Latin-1 order then run the table through: + +=over 4 + +=item recipe 2 + +=back + + perl -ne 'if(/.{33}\d{1,3}\s{6,8}\d{1,3}\s{6,8}\d{1,3}\s{6,8}\s{1,3}/)'\ + -e '{push(@l,$_)}' \ + -e 'END{print map{$_->[0]}' \ + -e ' sort{$a->[1] <=> $b->[1]}' \ + -e ' map{[$_,substr($_,42,3)]}@l;}' perlebcdic.pod + +If you would rather see it in CCSID 1047 order then change the digit +42 in the last line to 51, like this: + +=over 4 + +=item recipe 3 + +=back + + perl -ne 'if(/.{33}\d{1,3}\s{6,8}\d{1,3}\s{6,8}\d{1,3}\s{6,8}\s{1,3}/)'\ + -e '{push(@l,$_)}' \ + -e 'END{print map{$_->[0]}' \ + -e ' sort{$a->[1] <=> $b->[1]}' \ + -e ' map{[$_,substr($_,51,3)]}@l;}' perlebcdic.pod + +If you would rather see it in POSIX-BC order then change the digit +51 in the last line to 60, like this: + +=over 4 + +=item recipe 4 + +=back + + perl -ne 'if(/.{33}\d{1,3}\s{6,8}\d{1,3}\s{6,8}\d{1,3}\s{6,8}\s{1,3}/)'\ + -e '{push(@l,$_)}' \ + -e 'END{print map{$_->[0]}' \ + -e ' sort{$a->[1] <=> $b->[1]}' \ + -e ' map{[$_,substr($_,60,3)]}@l;}' perlebcdic.pod + + +=head1 IDENTIFYING CHARACTER CODE SETS + +To determine the character set you are running under from perl one +could use the return value of ord() or chr() to test one or more +character values. For example: + + $is_ascii = "A" eq chr(65); + $is_ebcdic = "A" eq chr(193); + +"\t" is a <HORIZONTAL TABULATION>. So that: + + $is_ascii = ord("\t") == 9; + $is_ebcdic = ord("\t") == 5; + +To distinguish EBCDIC code pages try looking at one or more of +the characters that differ between them. For example: + + $is_ebcdic_37 = "\n" eq chr(37); + $is_ebcdic_1047 = "\n" eq chr(21); + +Or better still choose a character that is uniquely encoded in any +of the code sets, e.g.: + + $is_ascii = ord('[') == 91; + $is_ebcdic_37 = ord('[') == 186; + $is_ebcdic_1047 = ord('[') == 173; + $is_ebcdic_POSIX_BC = ord('[') == 187; + +However, it would be unwise to write tests such as: + + $is_ascii = "\r" ne chr(13); # WRONG + $is_ascii = "\n" ne chr(10); # ILL ADVISED + +Obviously the first of these will fail to distinguish most ASCII machines +from either a CCSID 0037, a 1047, or a POSIX-BC EBCDIC machine since "\r" eq +chr(13) under all of those coded character sets. But note too that +because "\n" is chr(13) and "\r" is chr(10) on the MacIntosh (which is an +ASCII machine) the second C<$is_ascii> test will lead to trouble there. + +To determine whether or not perl was built under an EBCDIC +code page you can use the Config module like so: + + use Config; + $is_ebcdic = $Config{ebcdic} eq 'define'; + +=head1 CONVERSIONS + +In order to convert a string of characters from one character set to +another a simple list of numbers, such as in the right columns in the +above table, along with perl's tr/// operator is all that is needed. +The data in the table are in ASCII order hence the EBCDIC columns +provide easy to use ASCII to EBCDIC operations that are also easily +reversed. + +For example, to convert ASCII to code page 037 take the output of the second +column from the output of recipe 0 and use it in tr/// like so: + + $cp_037 = + '\000\001\002\003\234\011\206\177\227\215\216\013\014\015\016\017' . + '\020\021\022\023\235\205\010\207\030\031\222\217\034\035\036\037' . + '\200\201\202\203\204\012\027\033\210\211\212\213\214\005\006\007' . + '\220\221\026\223\224\225\226\004\230\231\232\233\024\025\236\032' . + '\040\240\342\344\340\341\343\345\347\361\242\056\074\050\053\174' . + '\046\351\352\353\350\355\356\357\354\337\041\044\052\051\073\254' . + '\055\057\302\304\300\301\303\305\307\321\246\054\045\137\076\077' . + '\370\311\312\313\310\315\316\317\314\140\072\043\100\047\075\042' . + '\330\141\142\143\144\145\146\147\150\151\253\273\360\375\376\261' . + '\260\152\153\154\155\156\157\160\161\162\252\272\346\270\306\244' . + '\265\176\163\164\165\166\167\170\171\172\241\277\320\335\336\256' . + '\136\243\245\267\251\247\266\274\275\276\133\135\257\250\264\327' . + '\173\101\102\103\104\105\106\107\110\111\255\364\366\362\363\365' . + '\175\112\113\114\115\116\117\120\121\122\271\373\374\371\372\377' . + '\134\367\123\124\125\126\127\130\131\132\262\324\326\322\323\325' . + '\060\061\062\063\064\065\066\067\070\071\263\333\334\331\332\237' ; + + my $ebcdic_string = $ascii_string; + $ebcdic_string = tr/\000-\377/$cp_037/; + +To convert from EBCDIC to ASCII just reverse the order of the tr/// +arguments like so: + + my $ascii_string = $ebcdic_string; + $ascii_string = tr/$code_page_chrs/\000-\037/; + +XPG4 interoperability often implies the presence of an I<iconv> utility +available from the shell or from the C library. Consult your system's +documentation for information on iconv. + +On OS/390 see the iconv(1) man page. One way to invoke the iconv +shell utility from within perl would be to: + + $ascii_data = `echo '$ebcdic_data'| iconv -f IBM-1047 -t ISO8859-1` + +or the inverse map: + + $ebcdic_data = `echo '$ascii_data'| iconv -f ISO8859-1 -t IBM-1047` + +XXX iconv under qsh on OS/400? +XXX iconv on VM? +XXX iconv on BS2k? + +For other perl based conversion options see the Convert::* modules on CPAN. + +=head1 OPERATOR DIFFERENCES + +The C<..> range operator treats certain character ranges with +care on EBCDIC machines. For example the following array +will have twenty six elements on either an EBCDIC machine +or an ASCII machine: + + @alphabet = ('A'..'Z'); # $#alphabet == 25 + +The bitwise operators such as & ^ | may return different results +when operating on string or character data in a perl program running +on an EBCDIC machine than when run on an ASCII machine. Here is +an example adapted from the one in L<perlop>: + + # EBCDIC-based examples + print "j p \n" ^ " a h"; # prints "JAPH\n" + print "JA" | " ph\n"; # prints "japh\n" + print "JAPH\nJunk" & "\277\277\277\277\277"; # prints "japh\n"; + print 'p N$' ^ " E<H\n"; # prints "Perl\n"; + +An interesting property of the 32 C0 control characters +in the ASCII table is that they can "literally" be constructed +as control characters in perl, e.g. (chr(0) eq "\c@"), +(chr(1) eq "\cA"), and so on. Perl on EBCDIC machines has been +ported to take "\c@" -> chr(0) and "\cA" -> chr(1) as well, but the +thirty three characters that result depend on which code page you are +using. The table below uses the character names from the previous table +but with substitions such as s/START OF/S.O./; s/END OF /E.O./; +s/TRANSMISSION/TRANS./; s/TABULATION/TAB./; s/VERTICAL/VERT./; +s/HORIZONTAL/HORIZ./; s/DEVICE CONTROL/D.C./; s/SEPARATOR/SEP./; +s/NEGATIVE ACKNOWLEDGE/NEG. ACK./;. The POSIX-BC and 1047 sets are +identical throughout this range and differ from the 0037 set at only +one spot (21 decimal). Note that "\c\\" maps to two characters +not one. + + chr ord 8859-1 0037 1047 && POSIX-BC + ------------------------------------------------------------------------ + "\c?" 127 <DELETE> " " ***>< + "\c@" 0 <NULL> <NULL> <NULL> ***>< + "\cA" 1 <S.O. HEADING> <S.O. HEADING> <S.O. HEADING> + "\cB" 2 <S.O. TEXT> <S.O. TEXT> <S.O. TEXT> + "\cC" 3 <E.O. TEXT> <E.O. TEXT> <E.O. TEXT> + "\cD" 4 <E.O. TRANS.> <C1 28> <C1 28> + "\cE" 5 <ENQUIRY> <HORIZ. TAB.> <HORIZ. TAB.> + "\cF" 6 <ACKNOWLEDGE> <C1 6> <C1 6> + "\cG" 7 <BELL> <DELETE> <DELETE> + "\cH" 8 <BACKSPACE> <C1 23> <C1 23> + "\cI" 9 <HORIZ. TAB.> <C1 13> <C1 13> + "\cJ" 10 <LINE FEED> <C1 14> <C1 14> + "\cK" 11 <VERT. TAB.> <VERT. TAB.> <VERT. TAB.> + "\cL" 12 <FORM FEED> <FORM FEED> <FORM FEED> + "\cM" 13 <CARRIAGE RETURN> <CARRIAGE RETURN> <CARRIAGE RETURN> + "\cN" 14 <SHIFT OUT> <SHIFT OUT> <SHIFT OUT> + "\cO" 15 <SHIFT IN> <SHIFT IN> <SHIFT IN> + "\cP" 16 <DATA LINK ESCAPE> <DATA LINK ESCAPE> <DATA LINK ESCAPE> + "\cQ" 17 <D.C. ONE> <D.C. ONE> <D.C. ONE> + "\cR" 18 <D.C. TWO> <D.C. TWO> <D.C. TWO> + "\cS" 19 <D.C. THREE> <D.C. THREE> <D.C. THREE> + "\cT" 20 <D.C. FOUR> <C1 29> <C1 29> + "\cU" 21 <NEG. ACK.> <C1 5> <LINE FEED> *** + "\cV" 22 <SYNCHRONOUS IDLE> <BACKSPACE> <BACKSPACE> + "\cW" 23 <E.O. TRANS. BLOCK> <C1 7> <C1 7> + "\cX" 24 <CANCEL> <CANCEL> <CANCEL> + "\cY" 25 <E.O. MEDIUM> <E.O. MEDIUM> <E.O. MEDIUM> + "\cZ" 26 <SUBSTITUTE> <C1 18> <C1 18> + "\c[" 27 <ESCAPE> <C1 15> <C1 15> + "\c\\" 28 <FILE SEP.>\ <FILE SEP.>\ <FILE SEP.>\ + "\c]" 29 <GROUP SEP.> <GROUP SEP.> <GROUP SEP.> + "\c^" 30 <RECORD SEP.> <RECORD SEP.> <RECORD SEP.> ***>< + "\c_" 31 <UNIT SEP.> <UNIT SEP.> <UNIT SEP.> ***>< + + +=head1 FUNCTION DIFFERENCES + +=over 8 + +=item chr() + +chr() must be given an EBCDIC code number argument to yield a desired +character return value on an EBCDIC machine. For example: + + $CAPITAL_LETTER_A = chr(193); + +=item ord() + +ord() will return EBCDIC code number values on an EBCDIC machine. +For example: + + $the_number_193 = ord("A"); + +=item pack() + +The c and C templates for pack() are dependent upon character set +encoding. Examples of usage on EBCDIC include: + + $foo = pack("CCCC",193,194,195,196); + # $foo eq "ABCD" + $foo = pack("C4",193,194,195,196); + # same thing + + $foo = pack("ccxxcc",193,194,195,196); + # $foo eq "AB\0\0CD" + +=item print() + +One must be careful with scalars and strings that are passed to +print that contain ASCII encodings. One common place +for this to occur is in the output of the MIME type header for +CGI script writing. For example, many perl programming guides +recommend something similar to: + + print "Content-type:\ttext/html\015\012\015\012"; + # this may be wrong on EBCDIC + +Under the IBM OS/390 USS Web Server for example you should instead +write that as: + + print "Content-type:\ttext/html\r\n\r\n"; # OK for DGW et alia + +That is because the translation from EBCDIC to ASCII is done +by the web server in this case (such code will not be appropriate for +the Macintosh however). Consult your web server's documentation for +further details. + +=item printf() + +The formats that can convert characters to numbers and vice versa +will be different from their ASCII counterparts when executed +on an EBCDIC machine. Examples include: + + printf("%c%c%c",193,194,195); # prints ABC + +=item sort() + +EBCDIC sort results may differ from ASCII sort results especially for +mixed case strings. This is discussed in more detail below. + +=item sprintf() + +See the discussion of printf() above. An example of the use +of sprintf would be: + + $CAPITAL_LETTER_A = sprintf("%c",193); + +=item unpack() + +See the discussion of pack() above. + +=back + +=head1 REGULAR EXPRESSION DIFFERENCES + +As of perl 5.005_03 the letter range regular expression such as +[A-Z] and [a-z] have been especially coded to not pick up gap +characters. For example characters such as <o WITH CIRCUMFLEX> +that lie between I and J would not be matched by C</[H-K]/>. +If you do want to match such characters in a single octet +regular expression try matching the hex or octal code such +as C</\313/> on EBCDIC or C</\364/> on ASCII machines to +have your regular expression match <o WITH CIRCUMFLEX>. + +Another place to be wary of is the inappropriate use of hex or +octal constants in regular expressions. Consider the following +set of subs: + + sub is_c0 { + my $char = substr(shift,0,1); + $char =~ /[\000-\037]/; + } + + sub is_print_ascii { + my $char = substr(shift,0,1); + $char =~ /[\040-\176]/; + } + + sub is_delete { + my $char = substr(shift,0,1); + $char eq "\177"; + } + + sub is_c1 { + my $char = substr(shift,0,1); + $char =~ /[\200-\237]/; + } + + sub is_latin_1 { + my $char = substr(shift,0,1); + $char =~ /[\240-\377]/; + } + +The above would be adequate if the concern was only with numeric codepoints. +However, we may actually be concerned with characters rather than codepoints +and on an EBCDIC machine would like for constructs such as +C<if (is_print_ascii("A")) {print "A is a printable character\n";}> to print +out the expected message. One way to represent the above collection +of character classification subs that is capable of working across the +four coded character sets discussed in this document is as follows: + + sub Is_c0 { + my $char = substr(shift,0,1); + if (ord('^')==94) { # ascii + return $char =~ /[\000-\037]/; + } + if (ord('^')==176) { # 37 + return $char =~ /[\000-\003\067\055-\057\026\005\045\013-\023\074\075\062\046\030\031\077\047\034-\037]/; + } + if (ord('^')==95 || ord('^')==106) { # 1047 || posix-bc + return $char =~ /[\000-\003\067\055-\057\026\005\025\013-\023\074\075\062\046\030\031\077\047\034-\037]/; + } + } + + sub Is_print_ascii { + my $char = substr(shift,0,1); + $char =~ /[ !"\#\$%&'()*+,\-.\/0-9:;<=>?\@A-Z[\\\]^_`a-z{|}~]/; + } + + sub Is_delete { + my $char = substr(shift,0,1); + if (ord('^')==94) { # ascii + return $char eq "\177"; + } + else { # ebcdic + return $char eq "\007"; + } + } + + sub Is_c1 { + my $char = substr(shift,0,1); + if (ord('^')==94) { # ascii + return $char =~ /[\200-\237]/; + } + if (ord('^')==176) { # 37 + return $char =~ /[\040-\044\025\006\027\050-\054\011\012\033\060\061\032\063-\066\010\070-\073\040\024\076\377]/; + } + if (ord('^')==95) { # 1047 + return $char =~ /[\040-\045\006\027\050-\054\011\012\033\060\061\032\063-\066\010\070-\073\040\024\076\377]/; + } + if (ord('^')==106) { # posix-bc + return $char =~ + /[\040-\045\006\027\050-\054\011\012\033\060\061\032\063-\066\010\070-\073\040\024\076\137]/; + } + } + + sub Is_latin_1 { + my $char = substr(shift,0,1); + if (ord('^')==94) { # ascii + return $char =~ /[\240-\377]/; + } + if (ord('^')==176) { # 37 + return $char =~ + /[\101\252\112\261\237\262\152\265\275\264\232\212\137\312\257\274\220\217\352\372\276\240\266\263\235\332\233\213\267\270\271\253\144\145\142\146\143\147\236\150\164\161-\163\170\165-\167\254\151\355\356\353\357\354\277\200\375\376\373\374\255\256\131\104\105\102\106\103\107\234\110\124\121-\123\130\125-\127\214\111\315\316\313\317\314\341\160\335\336\333\334\215\216\337]/; + } + if (ord('^')==95) { # 1047 + return $char =~ + /[\101\252\112\261\237\262\152\265\273\264\232\212\260\312\257\274\220\217\352\372\276\240\266\263\235\332\233\213\267\270\271\253\144\145\142\146\143\147\236\150\164\161-\163\170\165-\167\254\151\355\356\353\357\354\277\200\375\376\373\374\272\256\131\104\105\102\106\103\107\234\110\124\121-\123\130\125-\127\214\111\315\316\313\317\314\341\160\335\336\333\334\215\216\337]/; + } + if (ord('^')==106) { # posix-bc + return $char =~ + /[\101\252\260\261\237\262\320\265\171\264\232\212\272\312\257\241\220\217\352\372\276\240\266\263\235\332\233\213\267\270\271\253\144\145\142\146\143\147\236\150\164\161-\163\170\165-\167\254\151\355\356\353\357\354\277\200\340\376\335\374\255\256\131\104\105\102\106\103\107\234\110\124\121-\123\130\125-\127\214\111\315\316\313\317\314\341\160\300\336\333\334\215\216\337]/; + } + } + +Note however that only the C<Is_ascii_print()> sub is really independent +of coded character set. Another way to write C<Is_latin_1()> would be +to use the characters in the range explicitly: + + sub Is_latin_1 { + my $char = substr(shift,0,1); + $char =~ /[ ¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ]/; + } + +Although that form may run into trouble in network transit (due to the +presence of 8 bit characters) or on non ISO-Latin character sets. + + +=head1 SOCKETS + +Most socket programming assumes ASCII character encodings in network +byte order. Exceptions can include CGI script writing under a +host web server where the server may take care of translation for you. +Most host web servers convert EBCDIC data to ISO-8859-1 or Unicode on +output. + +=head1 SORTING + +One big difference between ASCII based character sets and EBCDIC ones +are the relative positions of upper and lower case letters and the +letters compared to the digits. If sorted on an ASCII based machine the +two letter abbreviation for a physician comes before the two letter +for drive, that is: + + @sorted = sort(qw(Dr. dr.)); # @sorted holds qw(Dr. dr.) on ASCII, + # qw(dr. Dr.) on EBCDIC + +The property of lower case before uppercase letters in EBCDIC is +even carried to the Latin 1 EBCDIC pages such as 0037 and 1047. +An example would be that <E WITH DIAERESIS> (203) comes before +<e WITH DIAERESIS> (235) on and ASCII machine, but the latter (83) +comes before the former (115) on an EBCDIC machine. (Astute readers will +note that the upper case version of <SMALL LETTER SHARP S> is +simply "SS" and that the upper case version of <y WITH DIAERESIS> +is not in the 0..255 range but it is at U+x0178 in Unicode). + +The sort order will cause differences between results obtained on +ASCII machines versus EBCDIC machines. What follows are some suggestions +on how to deal with these differences. + +=head2 Ignore ASCII vs EBCDIC sort differences. + +This is the least computationally expensive strategy. It may require +some user education. + +=head2 MONOCASE then sort data. + +In order to minimize the expense of monocasing mixed test try to +C<tr///> towards the character set case most employed within the data. +If the data are primarily UPPERCASE non Latin 1 then apply tr/[a-z]/[A-Z]/ +then sort(). If the data are primarily lowercase non Latin 1 then +apply tr/[A-Z]/[a-z]/ before sorting. If the data are primarily UPPERCASE +and include Latin-1 characters then apply: tr/[a-z]/[A-Z]/; +XXX + +This strategy does not preserve the case of the data and may not be +acceptable. + +=head2 Convert, sort data, then reconvert. + +This is the most expensive proposition that does not employ a network +connection. + +=head2 Perform sorting on one type of machine only. + +This strategy can employ a network connection. As such +it would be computationally expensive. + +=head1 URL ENCODING and DECODING + +Note that some URLs have hexadecimal ASCII codepoints in them in an +attempt to overcome character limitation issues. For example the +tilde character is not on every keyboard hence a URL of the form: + + http://www.pvhp.com/~pvhp/ + +may also be expressed as either of: + + http://www.pvhp.com/%7Epvhp/ + + http://www.pvhp.com/%7epvhp/ + +where 7E is the hexadecimal ASCII codepoint for '~'. Here is an example +of decoding such a URL under CCSID 1047: + + $url = 'http://www.pvhp.com/%7Epvhp/'; + # this array assumes code page 1047 + my @a2e_1047 = ( + 0, 1, 2, 3, 55, 45, 46, 47, 22, 5, 21, 11, 12, 13, 14, 15, + 16, 17, 18, 19, 60, 61, 50, 38, 24, 25, 63, 39, 28, 29, 30, 31, + 64, 90,127,123, 91,108, 80,125, 77, 93, 92, 78,107, 96, 75, 97, + 240,241,242,243,244,245,246,247,248,249,122, 94, 76,126,110,111, + 124,193,194,195,196,197,198,199,200,201,209,210,211,212,213,214, + 215,216,217,226,227,228,229,230,231,232,233,173,224,189, 95,109, + 121,129,130,131,132,133,134,135,136,137,145,146,147,148,149,150, + 151,152,153,162,163,164,165,166,167,168,169,192, 79,208,161, 7, + 32, 33, 34, 35, 36, 37, 6, 23, 40, 41, 42, 43, 44, 9, 10, 27, + 48, 49, 26, 51, 52, 53, 54, 8, 56, 57, 58, 59, 4, 20, 62,255, + 65,170, 74,177,159,178,106,181,187,180,154,138,176,202,175,188, + 144,143,234,250,190,160,182,179,157,218,155,139,183,184,185,171, + 100,101, 98,102, 99,103,158,104,116,113,114,115,120,117,118,119, + 172,105,237,238,235,239,236,191,128,253,254,251,252,186,174, 89, + 68, 69, 66, 70, 67, 71,156, 72, 84, 81, 82, 83, 88, 85, 86, 87, + 140, 73,205,206,203,207,204,225,112,221,222,219,220,141,142,223 + ); + $url =~ s/%([0-9a-fA-F]{2})/pack("c",$a2e_1047[hex($1)])/ge; + +=head1 I18N AND L10N + +Internationalization(I18N) and localization(L10N) are supported at least +in principle even on EBCDIC machines. The details are system dependent +and discussed under the L<perlebcdic/OS ISSUES> section below. + +=head1 MULTI OCTET CHARACTER SETS + +Double byte EBCDIC code pages (?) XXX. + +UTF-8, UTF-EBCDIC, (?) XXX. + +=head1 OS ISSUES + +There may be a few system dependent issues +of concern to EBCDIC Perl programmers. + +=head2 OS/400 + +=over 8 + +=item IFS access + +XXX. + +=back + +=head2 OS/390 + +=over 8 + +=item dataset access + +For sequential data set access try: + + my @ds_records = `cat //DSNAME`; + +or: + + my @ds_records = `cat //'HLQ.DSNAME'`; + +See also the OS390::Stdio module on CPAN. + +=item locales + +On OS/390 see L<locale> for information on locales. The L10N files +are in F</usr/nls/locale>. $Config{d_setlocale} is 'define' on OS/390. + +=back + +=head2 VM/ESA? + +XXX. + +=head2 POSIX-BC? + +XXX. + +=head1 REFERENCES + +http://anubis.dkuug.dk/i18n/charmaps + +L<perllocale>. + +http://www.unicode.org/ + +http://www.unicode.org/unicode/reports/tr16/ + +B<The Unicode Standard Version 2.0> The Unicode Consortium, +ISBN 0-201-48345-9, Addison Wesley Developers Press, July 1996. + +B<CDRA: IBM - Character Data Representation Architecture - +Reference and Registry>, IBM SC09-2190-00, December 1996. + +"Demystifying Character Sets", Andrea Vine, Multilingual Computing +& Technology, B<#26 Vol. 10 Issue 4>, August/September 1999; +ISSN 1523-0309; Multilingual Computing Inc. Sandpoint ID, USA. + +=head1 AUTHOR + +Peter Prymmer E<lt>pvhp@best.comE<gt> wrote this in 1999 and 2000 +with CCSID 0819 and 0037 help from Chris Leach and +Andre' Pirard E<lt>A.Pirard@ulg.ac.beE<gt> as well as POSIX-BC +help from Thomas Dorner E<lt>Thomas.Dorner@start.deE<gt>. +Thanks also to Philip Newton and Vickie Cooper. Trademarks, registered +trademarks, service marks and registered service marks used in this +document are the property of their respective owners. + + |