summaryrefslogtreecommitdiff
path: root/pod/perlebcdic.pod
diff options
context:
space:
mode:
authorJarkko Hietaniemi <jhi@iki.fi>2000-08-17 14:44:02 +0000
committerJarkko Hietaniemi <jhi@iki.fi>2000-08-17 14:44:02 +0000
commitd396a55899b7bce58ef6008d9af7a500b5175b4a (patch)
tree92bb4fc9fea98748bcd8bc310e3b9dd4fd5f54a0 /pod/perlebcdic.pod
parent10c102662dfb8c226a9c3524f047501223fa8409 (diff)
downloadperl-d396a55899b7bce58ef6008d9af7a500b5175b4a.tar.gz
Add perlebcdic from Peter Prymmer, regen toc.
p4raw-id: //depot/perl@6676
Diffstat (limited to 'pod/perlebcdic.pod')
-rw-r--r--pod/perlebcdic.pod1001
1 files changed, 1001 insertions, 0 deletions
diff --git a/pod/perlebcdic.pod b/pod/perlebcdic.pod
new file mode 100644
index 0000000000..f27a8dea2e
--- /dev/null
+++ b/pod/perlebcdic.pod
@@ -0,0 +1,1001 @@
+=head1 NAME
+
+perlebcdic - Considerations for running Perl on EBCDIC platforms
+
+=head1 DESCRIPTION
+
+An exploration of some of the issues facing Perl programmers
+on EBCDIC based computers. We do not cover localization,
+internationalization, or multi byte character set issues (yet).
+
+Portions that are still incomplete are marked with XXX.
+
+=head1 COMMON CHARACTER CODE SETS
+
+=head2 ASCII
+
+The American Standard Code for Information Interchange is a set of
+integers running from 0 to 127 (decimal) that imply character
+interpretation by the display and other system(s) of computers.
+The range 0..127 is covered by setting the bits in a 7-bit binary
+digit, hence the set is sometimes referred to as a "7-bit ASCII".
+ASCII was described by the American National Standards Instute
+document ANSI X3.4-1986. It was also described by ISO 646:1991
+(with localization for currency symbols). The full ASCII set is
+given in the table below as the first 128 elements. Languages that
+can be written adequately with the characters in ASCII include
+English, Hawaiian, Indonesian, Swahili and some Native American
+languages.
+
+=head2 ISO 8859
+
+The ISO 8859-$n are a collection of character code sets from the
+International Organization for Standardization (ISO) each of which
+adds characters to the ASCII set that are typically found in European
+languages many of which are based on the Roman, or Latin, alphabet.
+
+=head2 Latin 1 (ISO 8859-1)
+
+A particular 8-bit extension to ASCII that includes grave and acute
+accented Latin characters. Languages that can employ ISO 8859-1
+include all the languages covered by ASCII as well as Afrikaans,
+Albanian, Basque, Catalan, Danish, Faroese, Finnish, Norwegian,
+Portugese, Spanish, and Swedish. Dutch is covered albeit without
+the ij ligature. French is covered too but without the oe ligature.
+German can use ISO 8859-1 but must do so without German-style
+quotation marks. This set is based on Western European extensions
+to ASCII and is commonly encountered in world wide web work.
+In IBM character code set identification terminology ISO 8859-1 is
+known as CCSID 819 (or sometimes 0819 or even 00819).
+
+=head2 EBCDIC
+
+Extended Binary Coded Decimal Interchange Code. The EBCDIC acronym
+refers to a large collection of slightly different single and
+multi byte coded character sets that are different from ASCII or
+ISO 8859-1 and typically run on host computers. The
+EBCDIC encodings derive from Hollerith punched card encodings.
+The layout on the cards was such that high bits were set for the
+upper and lower case alphabet characters [a-z] and [A-Z], but there
+were gaps within each latin alphabet range.
+
+=head2 13 variant characters
+
+XXX.
+
+EBCDIC character sets may be known by character code set identification
+numbers (CCSID numbers) or code page numbers.
+
+=head2 0037
+
+Character code set ID 0037 is a mapping of the ASCII plus Latin-1
+characters (i.e. ISO 8859-1) to an EBCDIC set. 0037 is used
+on the OS/400 operating system that runs on AS/400 computers.
+CCSID 37 differs from ISO 8859-1 in 237 places, in other words
+they agree on only 19 code point values.
+
+=head2 1047
+
+Character code set ID 1047 is also a mapping of the ASCII plus
+Latin-1 characters (i.e. ISO 8859-1) to an EBCDIC set. 1047 is
+used under Unix System Services for OS/390, and OpenEdition for VM/ESA.
+CCSID 1047 differs from CCSID 0037 in eight places.
+
+=head2 POSIX-BC
+
+The EBCDIC code page in use on Siemens' BS2000 system is distinct from
+1047 and 0037. It is identified below as the POSIX-BC set.
+
+=head1 SINGLE OCTET TABLES
+
+The following tables list the ASCII and Latin 1 ordered sets including
+the subsets: C0 controls (0..31), ASCII graphics (32..7e), delete (7f),
+C1 controls (80..9f), and Latin-1 (a.k.a. ISO 8859-1) (a0..ff). In the
+table non-printing control character names as well as the Latin 1
+extensions to ASCII have been labelled with character names roughly
+corresponding to I<The Unicode Standard, Version 2.0> albeit with
+substitutions such as s/LATIN// and s/VULGAR// in all cases,
+s/CAPITAL LETTER// in some cases, and s/SMALL LETTER ([A-Z])/\l$1/
+in some other cases. The "names" of the C1 control set
+(128..159 in ISO 8859-1) are somewhat arbitrary. The differences
+between the 0037 and 1047 sets are flagged with ***. The differences
+between the 1047 and POSIX-BC sets are flagged with ###.
+All ord() numbers listed are decimal. If you would rather see this
+table listing octal values then run the table (that is, the pod
+version of this document since this recipe may not work with
+a pod2XXX translation to another format) through:
+
+=over 4
+
+=item recipe 0
+
+=back
+
+ perl -ne 'if(/(.{33})(\d+)\s+(\d+)\s+(\d+)\s+(\d+)/)' \
+ -e '{printf("%s%-9o%-9o%-9o%-9o\n",$1,$2,$3,$4,$5)}' perlebcdic.pod
+
+If you would rather see this table listing hexadecimal values then
+run the table through:
+
+=over 4
+
+=item recipe 1
+
+=back
+
+ perl -ne 'if(/(.{33})(\d+)\s+(\d+)\s+(\d+)\s+(\d+)/)' \
+ -e '{printf("%s%-9X%-9X%-9X%-9X\n",$1,$2,$3,$4,$5)}' perlebcdic.pod
+
+
+ 8859-1
+ chr 0819 0037 1047 POSIX-BC
+ ----------------------------------------------------------------
+ <NULL> 0 0 0 0
+ <START OF HEADING> 1 1 1 1
+ <START OF TEXT> 2 2 2 2
+ <END OF TEXT> 3 3 3 3
+ <END OF TRANSMISSION> 4 55 55 55
+ <ENQUIRY> 5 45 45 45
+ <ACKNOWLEDGE> 6 46 46 46
+ <BELL> 7 47 47 47
+ <BACKSPACE> 8 22 22 22
+ <HORIZONTAL TABULATION> 9 5 5 5
+ <LINE FEED> 10 37 21 21 ***
+ <VERTICAL TABULATION> 11 11 11 11
+ <FORM FEED> 12 12 12 12
+ <CARRIAGE RETURN> 13 13 13 13
+ <SHIFT OUT> 14 14 14 14
+ <SHIFT IN> 15 15 15 15
+ <DATA LINK ESCAPE> 16 16 16 16
+ <DEVICE CONTROL ONE> 17 17 17 17
+ <DEVICE CONTROL TWO> 18 18 18 18
+ <DEVICE CONTROL THREE> 19 19 19 19
+ <DEVICE CONTROL FOUR> 20 60 60 60
+ <NEGATIVE ACKNOWLEDGE> 21 61 61 61
+ <SYNCHRONOUS IDLE> 22 50 50 50
+ <END OF TRANSMISSION BLOCK> 23 38 38 38
+ <CANCEL> 24 24 24 24
+ <END OF MEDIUM> 25 25 25 25
+ <SUBSTITUTE> 26 63 63 63
+ <ESCAPE> 27 39 39 39
+ <FILE SEPARATOR> 28 28 28 28
+ <GROUP SEPARATOR> 29 29 29 29
+ <RECORD SEPARATOR> 30 30 30 30
+ <UNIT SEPARATOR> 31 31 31 31
+ <SPACE> 32 64 64 64
+ ! 33 90 90 90
+ " 34 127 127 127
+ # 35 123 123 123
+ $ 36 91 91 91
+ % 37 108 108 108
+ & 38 80 80 80
+ ' 39 125 125 125
+ ( 40 77 77 77
+ ) 41 93 93 93
+ * 42 92 92 92
+ + 43 78 78 78
+ , 44 107 107 107
+ - 45 96 96 96
+ . 46 75 75 75
+ / 47 97 97 97
+ 0 48 240 240 240
+ 1 49 241 241 241
+ 2 50 242 242 242
+ 3 51 243 243 243
+ 4 52 244 244 244
+ 5 53 245 245 245
+ 6 54 246 246 246
+ 7 55 247 247 247
+ 8 56 248 248 248
+ 9 57 249 249 249
+ : 58 122 122 122
+ ; 59 94 94 94
+ < 60 76 76 76
+ = 61 126 126 126
+ > 62 110 110 110
+ ? 63 111 111 111
+ @ 64 124 124 124
+ A 65 193 193 193
+ B 66 194 194 194
+ C 67 195 195 195
+ D 68 196 196 196
+ E 69 197 197 197
+ F 70 198 198 198
+ G 71 199 199 199
+ H 72 200 200 200
+ I 73 201 201 201
+ J 74 209 209 209
+ K 75 210 210 210
+ L 76 211 211 211
+ M 77 212 212 212
+ N 78 213 213 213
+ O 79 214 214 214
+ P 80 215 215 215
+ Q 81 216 216 216
+ R 82 217 217 217
+ S 83 226 226 226
+ T 84 227 227 227
+ U 85 228 228 228
+ V 86 229 229 229
+ W 87 230 230 230
+ X 88 231 231 231
+ Y 89 232 232 232
+ Z 90 233 233 233
+ [ 91 186 173 187 *** ###
+ \ 92 224 224 188 ###
+ ] 93 187 189 189 ***
+ ^ 94 176 95 106 *** ###
+ _ 95 109 109 109
+ ` 96 121 121 74 ###
+ a 97 129 129 129
+ b 98 130 130 130
+ c 99 131 131 131
+ d 100 132 132 132
+ e 101 133 133 133
+ f 102 134 134 134
+ g 103 135 135 135
+ h 104 136 136 136
+ i 105 137 137 137
+ j 106 145 145 145
+ k 107 146 146 146
+ l 108 147 147 147
+ m 109 148 148 148
+ n 110 149 149 149
+ o 111 150 150 150
+ p 112 151 151 151
+ q 113 152 152 152
+ r 114 153 153 153
+ s 115 162 162 162
+ t 116 163 163 163
+ u 117 164 164 164
+ v 118 165 165 165
+ w 119 166 166 166
+ x 120 167 167 167
+ y 121 168 168 168
+ z 122 169 169 169
+ { 123 192 192 251 ###
+ | 124 79 79 79
+ } 125 208 208 253 ###
+ ~ 126 161 161 255 ###
+ <DELETE> 127 7 7 7
+ <C1 0> 128 32 32 32
+ <C1 1> 129 33 33 33
+ <C1 2> 130 34 34 34
+ <C1 3> 131 35 35 35
+ <C1 4> 132 36 36 36
+ <C1 5> 133 21 37 37 ***
+ <C1 6> 134 6 6 6
+ <C1 7> 135 23 23 23
+ <C1 8> 136 40 40 40
+ <C1 9> 137 41 41 41
+ <C1 10> 138 42 42 42
+ <C1 11> 139 43 43 43
+ <C1 12> 140 44 44 44
+ <C1 13> 141 9 9 9
+ <C1 14> 142 10 10 10
+ <C1 15> 143 27 27 27
+ <C1 16> 144 48 48 48
+ <C1 17> 145 49 49 49
+ <C1 18> 146 26 26 26
+ <C1 19> 147 51 51 51
+ <C1 20> 148 52 52 52
+ <C1 21> 149 53 53 53
+ <C1 22> 150 54 54 54
+ <C1 23> 151 8 8 8
+ <C1 24> 152 56 56 56
+ <C1 25> 153 57 57 57
+ <C1 26> 154 58 58 58
+ <C1 27> 155 59 59 59
+ <C1 28> 156 4 4 4
+ <C1 29> 157 20 20 20
+ <C1 30> 158 62 62 62
+ <C1 31> 159 255 255 95 ###
+ <NON-BREAKING SPACE> 160 65 65 65
+ <INVERTED EXCLAMATION MARK> 161 170 170 170
+ <CENT SIGN> 162 74 74 176 ###
+ <POUND SIGN> 163 177 177 177
+ <CURRENCY SIGN> 164 159 159 159
+ <YEN SIGN> 165 178 178 178
+ <BROKEN BAR> 166 106 106 208 ###
+ <SECTION SIGN> 167 181 181 181
+ <DIAERESIS> 168 189 187 121 *** ###
+ <COPYRIGHT SIGN> 169 180 180 180
+ <FEMININE ORDINAL INDICATOR> 170 154 154 154
+ <LEFT POINTING GUILLEMET> 171 138 138 138
+ <NOT SIGN> 172 95 176 186 *** ###
+ <SOFT HYPHEN> 173 202 202 202
+ <REGISTERED TRADE MARK SIGN> 174 175 175 175
+ <MACRON> 175 188 188 161 ###
+ <DEGREE SIGN> 176 144 144 144
+ <PLUS-OR-MINUS SIGN> 177 143 143 143
+ <SUPERSCRIPT TWO> 178 234 234 234
+ <SUPERSCRIPT THREE> 179 250 250 250
+ <ACUTE ACCENT> 180 190 190 190
+ <MICRO SIGN> 181 160 160 160
+ <PARAGRAPH SIGN> 182 182 182 182
+ <MIDDLE DOT> 183 179 179 179
+ <CEDILLA> 184 157 157 157
+ <SUPERSCRIPT ONE> 185 218 218 218
+ <MASC. ORDINAL INDICATOR> 186 155 155 155
+ <RIGHT POINTING GUILLEMET> 187 139 139 139
+ <FRACTION ONE QUARTER> 188 183 183 183
+ <FRACTION ONE HALF> 189 184 184 184
+ <FRACTION THREE QUARTERS> 190 185 185 185
+ <INVERTED QUESTION MARK> 191 171 171 171
+ <A WITH GRAVE> 192 100 100 100
+ <A WITH ACUTE> 193 101 101 101
+ <A WITH CIRCUMFLEX> 194 98 98 98
+ <A WITH TILDE> 195 102 102 102
+ <A WITH DIAERESIS> 196 99 99 99
+ <A WITH RING ABOVE> 197 103 103 103
+ <CAPITAL LIGATURE AE> 198 158 158 158
+ <C WITH CEDILLA> 199 104 104 104
+ <E WITH GRAVE> 200 116 116 116
+ <E WITH ACUTE> 201 113 113 113
+ <E WITH CIRCUMFLEX> 202 114 114 114
+ <E WITH DIAERESIS> 203 115 115 115
+ <I WITH GRAVE> 204 120 120 120
+ <I WITH ACUTE> 205 117 117 117
+ <I WITH CIRCUMFLEX> 206 118 118 118
+ <I WITH DIAERESIS> 207 119 119 119
+ <CAPITAL LETTER ETH> 208 172 172 172
+ <N WITH TILDE> 209 105 105 105
+ <O WITH GRAVE> 210 237 237 237
+ <O WITH ACUTE> 211 238 238 238
+ <O WITH CIRCUMFLEX> 212 235 235 235
+ <O WITH TILDE> 213 239 239 239
+ <O WITH DIAERESIS> 214 236 236 236
+ <MULTIPLICATION SIGN> 215 191 191 191
+ <O WITH STROKE> 216 128 128 128
+ <U WITH GRAVE> 217 253 253 224 ###
+ <U WITH ACUTE> 218 254 254 254
+ <U WITH CIRCUMFLEX> 219 251 251 221 ###
+ <U WITH DIAERESIS> 220 252 252 252
+ <Y WITH ACUTE> 221 173 186 173 *** ###
+ <CAPITAL LETTER THORN> 222 174 174 174
+ <SMALL LETTER SHARP S> 223 89 89 89
+ <a WITH GRAVE> 224 68 68 68
+ <a WITH ACUTE> 225 69 69 69
+ <a WITH CIRCUMFLEX> 226 66 66 66
+ <a WITH TILDE> 227 70 70 70
+ <a WITH DIAERESIS> 228 67 67 67
+ <a WITH RING ABOVE> 229 71 71 71
+ <SMALL LIGATURE ae> 230 156 156 156
+ <c WITH CEDILLA> 231 72 72 72
+ <e WITH GRAVE> 232 84 84 84
+ <e WITH ACUTE> 233 81 81 81
+ <e WITH CIRCUMFLEX> 234 82 82 82
+ <e WITH DIAERESIS> 235 83 83 83
+ <i WITH GRAVE> 236 88 88 88
+ <i WITH ACUTE> 237 85 85 85
+ <i WITH CIRCUMFLEX> 238 86 86 86
+ <i WITH DIAERESIS> 239 87 87 87
+ <SMALL LETTER eth> 240 140 140 140
+ <n WITH TILDE> 241 73 73 73
+ <o WITH GRAVE> 242 205 205 205
+ <o WITH ACUTE> 243 206 206 206
+ <o WITH CIRCUMFLEX> 244 203 203 203
+ <o WITH TILDE> 245 207 207 207
+ <o WITH DIAERESIS> 246 204 204 204
+ <DIVISION SIGN> 247 225 225 225
+ <o WITH STROKE> 248 112 112 112
+ <u WITH GRAVE> 249 221 221 192 ###
+ <u WITH ACUTE> 250 222 222 222
+ <u WITH CIRCUMFLEX> 251 219 219 219
+ <u WITH DIAERESIS> 252 220 220 220
+ <y WITH ACUTE> 253 141 141 141
+ <SMALL LETTER thorn> 254 142 142 142
+ <y WITH DIAERESIS> 255 223 223 223
+
+If you would rather see the above table in CCSID 0037 order rather than
+ASCII + Latin-1 order then run the table through:
+
+=over 4
+
+=item recipe 2
+
+=back
+
+ perl -ne 'if(/.{33}\d{1,3}\s{6,8}\d{1,3}\s{6,8}\d{1,3}\s{6,8}\s{1,3}/)'\
+ -e '{push(@l,$_)}' \
+ -e 'END{print map{$_->[0]}' \
+ -e ' sort{$a->[1] <=> $b->[1]}' \
+ -e ' map{[$_,substr($_,42,3)]}@l;}' perlebcdic.pod
+
+If you would rather see it in CCSID 1047 order then change the digit
+42 in the last line to 51, like this:
+
+=over 4
+
+=item recipe 3
+
+=back
+
+ perl -ne 'if(/.{33}\d{1,3}\s{6,8}\d{1,3}\s{6,8}\d{1,3}\s{6,8}\s{1,3}/)'\
+ -e '{push(@l,$_)}' \
+ -e 'END{print map{$_->[0]}' \
+ -e ' sort{$a->[1] <=> $b->[1]}' \
+ -e ' map{[$_,substr($_,51,3)]}@l;}' perlebcdic.pod
+
+If you would rather see it in POSIX-BC order then change the digit
+51 in the last line to 60, like this:
+
+=over 4
+
+=item recipe 4
+
+=back
+
+ perl -ne 'if(/.{33}\d{1,3}\s{6,8}\d{1,3}\s{6,8}\d{1,3}\s{6,8}\s{1,3}/)'\
+ -e '{push(@l,$_)}' \
+ -e 'END{print map{$_->[0]}' \
+ -e ' sort{$a->[1] <=> $b->[1]}' \
+ -e ' map{[$_,substr($_,60,3)]}@l;}' perlebcdic.pod
+
+
+=head1 IDENTIFYING CHARACTER CODE SETS
+
+To determine the character set you are running under from perl one
+could use the return value of ord() or chr() to test one or more
+character values. For example:
+
+ $is_ascii = "A" eq chr(65);
+ $is_ebcdic = "A" eq chr(193);
+
+"\t" is a <HORIZONTAL TABULATION>. So that:
+
+ $is_ascii = ord("\t") == 9;
+ $is_ebcdic = ord("\t") == 5;
+
+To distinguish EBCDIC code pages try looking at one or more of
+the characters that differ between them. For example:
+
+ $is_ebcdic_37 = "\n" eq chr(37);
+ $is_ebcdic_1047 = "\n" eq chr(21);
+
+Or better still choose a character that is uniquely encoded in any
+of the code sets, e.g.:
+
+ $is_ascii = ord('[') == 91;
+ $is_ebcdic_37 = ord('[') == 186;
+ $is_ebcdic_1047 = ord('[') == 173;
+ $is_ebcdic_POSIX_BC = ord('[') == 187;
+
+However, it would be unwise to write tests such as:
+
+ $is_ascii = "\r" ne chr(13); # WRONG
+ $is_ascii = "\n" ne chr(10); # ILL ADVISED
+
+Obviously the first of these will fail to distinguish most ASCII machines
+from either a CCSID 0037, a 1047, or a POSIX-BC EBCDIC machine since "\r" eq
+chr(13) under all of those coded character sets. But note too that
+because "\n" is chr(13) and "\r" is chr(10) on the MacIntosh (which is an
+ASCII machine) the second C<$is_ascii> test will lead to trouble there.
+
+To determine whether or not perl was built under an EBCDIC
+code page you can use the Config module like so:
+
+ use Config;
+ $is_ebcdic = $Config{ebcdic} eq 'define';
+
+=head1 CONVERSIONS
+
+In order to convert a string of characters from one character set to
+another a simple list of numbers, such as in the right columns in the
+above table, along with perl's tr/// operator is all that is needed.
+The data in the table are in ASCII order hence the EBCDIC columns
+provide easy to use ASCII to EBCDIC operations that are also easily
+reversed.
+
+For example, to convert ASCII to code page 037 take the output of the second
+column from the output of recipe 0 and use it in tr/// like so:
+
+ $cp_037 =
+ '\000\001\002\003\234\011\206\177\227\215\216\013\014\015\016\017' .
+ '\020\021\022\023\235\205\010\207\030\031\222\217\034\035\036\037' .
+ '\200\201\202\203\204\012\027\033\210\211\212\213\214\005\006\007' .
+ '\220\221\026\223\224\225\226\004\230\231\232\233\024\025\236\032' .
+ '\040\240\342\344\340\341\343\345\347\361\242\056\074\050\053\174' .
+ '\046\351\352\353\350\355\356\357\354\337\041\044\052\051\073\254' .
+ '\055\057\302\304\300\301\303\305\307\321\246\054\045\137\076\077' .
+ '\370\311\312\313\310\315\316\317\314\140\072\043\100\047\075\042' .
+ '\330\141\142\143\144\145\146\147\150\151\253\273\360\375\376\261' .
+ '\260\152\153\154\155\156\157\160\161\162\252\272\346\270\306\244' .
+ '\265\176\163\164\165\166\167\170\171\172\241\277\320\335\336\256' .
+ '\136\243\245\267\251\247\266\274\275\276\133\135\257\250\264\327' .
+ '\173\101\102\103\104\105\106\107\110\111\255\364\366\362\363\365' .
+ '\175\112\113\114\115\116\117\120\121\122\271\373\374\371\372\377' .
+ '\134\367\123\124\125\126\127\130\131\132\262\324\326\322\323\325' .
+ '\060\061\062\063\064\065\066\067\070\071\263\333\334\331\332\237' ;
+
+ my $ebcdic_string = $ascii_string;
+ $ebcdic_string = tr/\000-\377/$cp_037/;
+
+To convert from EBCDIC to ASCII just reverse the order of the tr///
+arguments like so:
+
+ my $ascii_string = $ebcdic_string;
+ $ascii_string = tr/$code_page_chrs/\000-\037/;
+
+XPG4 interoperability often implies the presence of an I<iconv> utility
+available from the shell or from the C library. Consult your system's
+documentation for information on iconv.
+
+On OS/390 see the iconv(1) man page. One way to invoke the iconv
+shell utility from within perl would be to:
+
+ $ascii_data = `echo '$ebcdic_data'| iconv -f IBM-1047 -t ISO8859-1`
+
+or the inverse map:
+
+ $ebcdic_data = `echo '$ascii_data'| iconv -f ISO8859-1 -t IBM-1047`
+
+XXX iconv under qsh on OS/400?
+XXX iconv on VM?
+XXX iconv on BS2k?
+
+For other perl based conversion options see the Convert::* modules on CPAN.
+
+=head1 OPERATOR DIFFERENCES
+
+The C<..> range operator treats certain character ranges with
+care on EBCDIC machines. For example the following array
+will have twenty six elements on either an EBCDIC machine
+or an ASCII machine:
+
+ @alphabet = ('A'..'Z'); # $#alphabet == 25
+
+The bitwise operators such as & ^ | may return different results
+when operating on string or character data in a perl program running
+on an EBCDIC machine than when run on an ASCII machine. Here is
+an example adapted from the one in L<perlop>:
+
+ # EBCDIC-based examples
+ print "j p \n" ^ " a h"; # prints "JAPH\n"
+ print "JA" | " ph\n"; # prints "japh\n"
+ print "JAPH\nJunk" & "\277\277\277\277\277"; # prints "japh\n";
+ print 'p N$' ^ " E<H\n"; # prints "Perl\n";
+
+An interesting property of the 32 C0 control characters
+in the ASCII table is that they can "literally" be constructed
+as control characters in perl, e.g. (chr(0) eq "\c@"),
+(chr(1) eq "\cA"), and so on. Perl on EBCDIC machines has been
+ported to take "\c@" -> chr(0) and "\cA" -> chr(1) as well, but the
+thirty three characters that result depend on which code page you are
+using. The table below uses the character names from the previous table
+but with substitions such as s/START OF/S.O./; s/END OF /E.O./;
+s/TRANSMISSION/TRANS./; s/TABULATION/TAB./; s/VERTICAL/VERT./;
+s/HORIZONTAL/HORIZ./; s/DEVICE CONTROL/D.C./; s/SEPARATOR/SEP./;
+s/NEGATIVE ACKNOWLEDGE/NEG. ACK./;. The POSIX-BC and 1047 sets are
+identical throughout this range and differ from the 0037 set at only
+one spot (21 decimal). Note that "\c\\" maps to two characters
+not one.
+
+ chr ord 8859-1 0037 1047 && POSIX-BC
+ ------------------------------------------------------------------------
+ "\c?" 127 <DELETE> " " ***><
+ "\c@" 0 <NULL> <NULL> <NULL> ***><
+ "\cA" 1 <S.O. HEADING> <S.O. HEADING> <S.O. HEADING>
+ "\cB" 2 <S.O. TEXT> <S.O. TEXT> <S.O. TEXT>
+ "\cC" 3 <E.O. TEXT> <E.O. TEXT> <E.O. TEXT>
+ "\cD" 4 <E.O. TRANS.> <C1 28> <C1 28>
+ "\cE" 5 <ENQUIRY> <HORIZ. TAB.> <HORIZ. TAB.>
+ "\cF" 6 <ACKNOWLEDGE> <C1 6> <C1 6>
+ "\cG" 7 <BELL> <DELETE> <DELETE>
+ "\cH" 8 <BACKSPACE> <C1 23> <C1 23>
+ "\cI" 9 <HORIZ. TAB.> <C1 13> <C1 13>
+ "\cJ" 10 <LINE FEED> <C1 14> <C1 14>
+ "\cK" 11 <VERT. TAB.> <VERT. TAB.> <VERT. TAB.>
+ "\cL" 12 <FORM FEED> <FORM FEED> <FORM FEED>
+ "\cM" 13 <CARRIAGE RETURN> <CARRIAGE RETURN> <CARRIAGE RETURN>
+ "\cN" 14 <SHIFT OUT> <SHIFT OUT> <SHIFT OUT>
+ "\cO" 15 <SHIFT IN> <SHIFT IN> <SHIFT IN>
+ "\cP" 16 <DATA LINK ESCAPE> <DATA LINK ESCAPE> <DATA LINK ESCAPE>
+ "\cQ" 17 <D.C. ONE> <D.C. ONE> <D.C. ONE>
+ "\cR" 18 <D.C. TWO> <D.C. TWO> <D.C. TWO>
+ "\cS" 19 <D.C. THREE> <D.C. THREE> <D.C. THREE>
+ "\cT" 20 <D.C. FOUR> <C1 29> <C1 29>
+ "\cU" 21 <NEG. ACK.> <C1 5> <LINE FEED> ***
+ "\cV" 22 <SYNCHRONOUS IDLE> <BACKSPACE> <BACKSPACE>
+ "\cW" 23 <E.O. TRANS. BLOCK> <C1 7> <C1 7>
+ "\cX" 24 <CANCEL> <CANCEL> <CANCEL>
+ "\cY" 25 <E.O. MEDIUM> <E.O. MEDIUM> <E.O. MEDIUM>
+ "\cZ" 26 <SUBSTITUTE> <C1 18> <C1 18>
+ "\c[" 27 <ESCAPE> <C1 15> <C1 15>
+ "\c\\" 28 <FILE SEP.>\ <FILE SEP.>\ <FILE SEP.>\
+ "\c]" 29 <GROUP SEP.> <GROUP SEP.> <GROUP SEP.>
+ "\c^" 30 <RECORD SEP.> <RECORD SEP.> <RECORD SEP.> ***><
+ "\c_" 31 <UNIT SEP.> <UNIT SEP.> <UNIT SEP.> ***><
+
+
+=head1 FUNCTION DIFFERENCES
+
+=over 8
+
+=item chr()
+
+chr() must be given an EBCDIC code number argument to yield a desired
+character return value on an EBCDIC machine. For example:
+
+ $CAPITAL_LETTER_A = chr(193);
+
+=item ord()
+
+ord() will return EBCDIC code number values on an EBCDIC machine.
+For example:
+
+ $the_number_193 = ord("A");
+
+=item pack()
+
+The c and C templates for pack() are dependent upon character set
+encoding. Examples of usage on EBCDIC include:
+
+ $foo = pack("CCCC",193,194,195,196);
+ # $foo eq "ABCD"
+ $foo = pack("C4",193,194,195,196);
+ # same thing
+
+ $foo = pack("ccxxcc",193,194,195,196);
+ # $foo eq "AB\0\0CD"
+
+=item print()
+
+One must be careful with scalars and strings that are passed to
+print that contain ASCII encodings. One common place
+for this to occur is in the output of the MIME type header for
+CGI script writing. For example, many perl programming guides
+recommend something similar to:
+
+ print "Content-type:\ttext/html\015\012\015\012";
+ # this may be wrong on EBCDIC
+
+Under the IBM OS/390 USS Web Server for example you should instead
+write that as:
+
+ print "Content-type:\ttext/html\r\n\r\n"; # OK for DGW et alia
+
+That is because the translation from EBCDIC to ASCII is done
+by the web server in this case (such code will not be appropriate for
+the Macintosh however). Consult your web server's documentation for
+further details.
+
+=item printf()
+
+The formats that can convert characters to numbers and vice versa
+will be different from their ASCII counterparts when executed
+on an EBCDIC machine. Examples include:
+
+ printf("%c%c%c",193,194,195); # prints ABC
+
+=item sort()
+
+EBCDIC sort results may differ from ASCII sort results especially for
+mixed case strings. This is discussed in more detail below.
+
+=item sprintf()
+
+See the discussion of printf() above. An example of the use
+of sprintf would be:
+
+ $CAPITAL_LETTER_A = sprintf("%c",193);
+
+=item unpack()
+
+See the discussion of pack() above.
+
+=back
+
+=head1 REGULAR EXPRESSION DIFFERENCES
+
+As of perl 5.005_03 the letter range regular expression such as
+[A-Z] and [a-z] have been especially coded to not pick up gap
+characters. For example characters such as <o WITH CIRCUMFLEX>
+that lie between I and J would not be matched by C</[H-K]/>.
+If you do want to match such characters in a single octet
+regular expression try matching the hex or octal code such
+as C</\313/> on EBCDIC or C</\364/> on ASCII machines to
+have your regular expression match <o WITH CIRCUMFLEX>.
+
+Another place to be wary of is the inappropriate use of hex or
+octal constants in regular expressions. Consider the following
+set of subs:
+
+ sub is_c0 {
+ my $char = substr(shift,0,1);
+ $char =~ /[\000-\037]/;
+ }
+
+ sub is_print_ascii {
+ my $char = substr(shift,0,1);
+ $char =~ /[\040-\176]/;
+ }
+
+ sub is_delete {
+ my $char = substr(shift,0,1);
+ $char eq "\177";
+ }
+
+ sub is_c1 {
+ my $char = substr(shift,0,1);
+ $char =~ /[\200-\237]/;
+ }
+
+ sub is_latin_1 {
+ my $char = substr(shift,0,1);
+ $char =~ /[\240-\377]/;
+ }
+
+The above would be adequate if the concern was only with numeric codepoints.
+However, we may actually be concerned with characters rather than codepoints
+and on an EBCDIC machine would like for constructs such as
+C<if (is_print_ascii("A")) {print "A is a printable character\n";}> to print
+out the expected message. One way to represent the above collection
+of character classification subs that is capable of working across the
+four coded character sets discussed in this document is as follows:
+
+ sub Is_c0 {
+ my $char = substr(shift,0,1);
+ if (ord('^')==94) { # ascii
+ return $char =~ /[\000-\037]/;
+ }
+ if (ord('^')==176) { # 37
+ return $char =~ /[\000-\003\067\055-\057\026\005\045\013-\023\074\075\062\046\030\031\077\047\034-\037]/;
+ }
+ if (ord('^')==95 || ord('^')==106) { # 1047 || posix-bc
+ return $char =~ /[\000-\003\067\055-\057\026\005\025\013-\023\074\075\062\046\030\031\077\047\034-\037]/;
+ }
+ }
+
+ sub Is_print_ascii {
+ my $char = substr(shift,0,1);
+ $char =~ /[ !"\#\$%&'()*+,\-.\/0-9:;<=>?\@A-Z[\\\]^_`a-z{|}~]/;
+ }
+
+ sub Is_delete {
+ my $char = substr(shift,0,1);
+ if (ord('^')==94) { # ascii
+ return $char eq "\177";
+ }
+ else { # ebcdic
+ return $char eq "\007";
+ }
+ }
+
+ sub Is_c1 {
+ my $char = substr(shift,0,1);
+ if (ord('^')==94) { # ascii
+ return $char =~ /[\200-\237]/;
+ }
+ if (ord('^')==176) { # 37
+ return $char =~ /[\040-\044\025\006\027\050-\054\011\012\033\060\061\032\063-\066\010\070-\073\040\024\076\377]/;
+ }
+ if (ord('^')==95) { # 1047
+ return $char =~ /[\040-\045\006\027\050-\054\011\012\033\060\061\032\063-\066\010\070-\073\040\024\076\377]/;
+ }
+ if (ord('^')==106) { # posix-bc
+ return $char =~
+ /[\040-\045\006\027\050-\054\011\012\033\060\061\032\063-\066\010\070-\073\040\024\076\137]/;
+ }
+ }
+
+ sub Is_latin_1 {
+ my $char = substr(shift,0,1);
+ if (ord('^')==94) { # ascii
+ return $char =~ /[\240-\377]/;
+ }
+ if (ord('^')==176) { # 37
+ return $char =~
+ /[\101\252\112\261\237\262\152\265\275\264\232\212\137\312\257\274\220\217\352\372\276\240\266\263\235\332\233\213\267\270\271\253\144\145\142\146\143\147\236\150\164\161-\163\170\165-\167\254\151\355\356\353\357\354\277\200\375\376\373\374\255\256\131\104\105\102\106\103\107\234\110\124\121-\123\130\125-\127\214\111\315\316\313\317\314\341\160\335\336\333\334\215\216\337]/;
+ }
+ if (ord('^')==95) { # 1047
+ return $char =~
+ /[\101\252\112\261\237\262\152\265\273\264\232\212\260\312\257\274\220\217\352\372\276\240\266\263\235\332\233\213\267\270\271\253\144\145\142\146\143\147\236\150\164\161-\163\170\165-\167\254\151\355\356\353\357\354\277\200\375\376\373\374\272\256\131\104\105\102\106\103\107\234\110\124\121-\123\130\125-\127\214\111\315\316\313\317\314\341\160\335\336\333\334\215\216\337]/;
+ }
+ if (ord('^')==106) { # posix-bc
+ return $char =~
+ /[\101\252\260\261\237\262\320\265\171\264\232\212\272\312\257\241\220\217\352\372\276\240\266\263\235\332\233\213\267\270\271\253\144\145\142\146\143\147\236\150\164\161-\163\170\165-\167\254\151\355\356\353\357\354\277\200\340\376\335\374\255\256\131\104\105\102\106\103\107\234\110\124\121-\123\130\125-\127\214\111\315\316\313\317\314\341\160\300\336\333\334\215\216\337]/;
+ }
+ }
+
+Note however that only the C<Is_ascii_print()> sub is really independent
+of coded character set. Another way to write C<Is_latin_1()> would be
+to use the characters in the range explicitly:
+
+ sub Is_latin_1 {
+ my $char = substr(shift,0,1);
+ $char =~ /[ ¡¢£¤¥¦§¨©ª«¬­®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ]/;
+ }
+
+Although that form may run into trouble in network transit (due to the
+presence of 8 bit characters) or on non ISO-Latin character sets.
+
+
+=head1 SOCKETS
+
+Most socket programming assumes ASCII character encodings in network
+byte order. Exceptions can include CGI script writing under a
+host web server where the server may take care of translation for you.
+Most host web servers convert EBCDIC data to ISO-8859-1 or Unicode on
+output.
+
+=head1 SORTING
+
+One big difference between ASCII based character sets and EBCDIC ones
+are the relative positions of upper and lower case letters and the
+letters compared to the digits. If sorted on an ASCII based machine the
+two letter abbreviation for a physician comes before the two letter
+for drive, that is:
+
+ @sorted = sort(qw(Dr. dr.)); # @sorted holds qw(Dr. dr.) on ASCII,
+ # qw(dr. Dr.) on EBCDIC
+
+The property of lower case before uppercase letters in EBCDIC is
+even carried to the Latin 1 EBCDIC pages such as 0037 and 1047.
+An example would be that <E WITH DIAERESIS> (203) comes before
+<e WITH DIAERESIS> (235) on and ASCII machine, but the latter (83)
+comes before the former (115) on an EBCDIC machine. (Astute readers will
+note that the upper case version of <SMALL LETTER SHARP S> is
+simply "SS" and that the upper case version of <y WITH DIAERESIS>
+is not in the 0..255 range but it is at U+x0178 in Unicode).
+
+The sort order will cause differences between results obtained on
+ASCII machines versus EBCDIC machines. What follows are some suggestions
+on how to deal with these differences.
+
+=head2 Ignore ASCII vs EBCDIC sort differences.
+
+This is the least computationally expensive strategy. It may require
+some user education.
+
+=head2 MONOCASE then sort data.
+
+In order to minimize the expense of monocasing mixed test try to
+C<tr///> towards the character set case most employed within the data.
+If the data are primarily UPPERCASE non Latin 1 then apply tr/[a-z]/[A-Z]/
+then sort(). If the data are primarily lowercase non Latin 1 then
+apply tr/[A-Z]/[a-z]/ before sorting. If the data are primarily UPPERCASE
+and include Latin-1 characters then apply: tr/[a-z]/[A-Z]/;
+XXX
+
+This strategy does not preserve the case of the data and may not be
+acceptable.
+
+=head2 Convert, sort data, then reconvert.
+
+This is the most expensive proposition that does not employ a network
+connection.
+
+=head2 Perform sorting on one type of machine only.
+
+This strategy can employ a network connection. As such
+it would be computationally expensive.
+
+=head1 URL ENCODING and DECODING
+
+Note that some URLs have hexadecimal ASCII codepoints in them in an
+attempt to overcome character limitation issues. For example the
+tilde character is not on every keyboard hence a URL of the form:
+
+ http://www.pvhp.com/~pvhp/
+
+may also be expressed as either of:
+
+ http://www.pvhp.com/%7Epvhp/
+
+ http://www.pvhp.com/%7epvhp/
+
+where 7E is the hexadecimal ASCII codepoint for '~'. Here is an example
+of decoding such a URL under CCSID 1047:
+
+ $url = 'http://www.pvhp.com/%7Epvhp/';
+ # this array assumes code page 1047
+ my @a2e_1047 = (
+ 0, 1, 2, 3, 55, 45, 46, 47, 22, 5, 21, 11, 12, 13, 14, 15,
+ 16, 17, 18, 19, 60, 61, 50, 38, 24, 25, 63, 39, 28, 29, 30, 31,
+ 64, 90,127,123, 91,108, 80,125, 77, 93, 92, 78,107, 96, 75, 97,
+ 240,241,242,243,244,245,246,247,248,249,122, 94, 76,126,110,111,
+ 124,193,194,195,196,197,198,199,200,201,209,210,211,212,213,214,
+ 215,216,217,226,227,228,229,230,231,232,233,173,224,189, 95,109,
+ 121,129,130,131,132,133,134,135,136,137,145,146,147,148,149,150,
+ 151,152,153,162,163,164,165,166,167,168,169,192, 79,208,161, 7,
+ 32, 33, 34, 35, 36, 37, 6, 23, 40, 41, 42, 43, 44, 9, 10, 27,
+ 48, 49, 26, 51, 52, 53, 54, 8, 56, 57, 58, 59, 4, 20, 62,255,
+ 65,170, 74,177,159,178,106,181,187,180,154,138,176,202,175,188,
+ 144,143,234,250,190,160,182,179,157,218,155,139,183,184,185,171,
+ 100,101, 98,102, 99,103,158,104,116,113,114,115,120,117,118,119,
+ 172,105,237,238,235,239,236,191,128,253,254,251,252,186,174, 89,
+ 68, 69, 66, 70, 67, 71,156, 72, 84, 81, 82, 83, 88, 85, 86, 87,
+ 140, 73,205,206,203,207,204,225,112,221,222,219,220,141,142,223
+ );
+ $url =~ s/%([0-9a-fA-F]{2})/pack("c",$a2e_1047[hex($1)])/ge;
+
+=head1 I18N AND L10N
+
+Internationalization(I18N) and localization(L10N) are supported at least
+in principle even on EBCDIC machines. The details are system dependent
+and discussed under the L<perlebcdic/OS ISSUES> section below.
+
+=head1 MULTI OCTET CHARACTER SETS
+
+Double byte EBCDIC code pages (?) XXX.
+
+UTF-8, UTF-EBCDIC, (?) XXX.
+
+=head1 OS ISSUES
+
+There may be a few system dependent issues
+of concern to EBCDIC Perl programmers.
+
+=head2 OS/400
+
+=over 8
+
+=item IFS access
+
+XXX.
+
+=back
+
+=head2 OS/390
+
+=over 8
+
+=item dataset access
+
+For sequential data set access try:
+
+ my @ds_records = `cat //DSNAME`;
+
+or:
+
+ my @ds_records = `cat //'HLQ.DSNAME'`;
+
+See also the OS390::Stdio module on CPAN.
+
+=item locales
+
+On OS/390 see L<locale> for information on locales. The L10N files
+are in F</usr/nls/locale>. $Config{d_setlocale} is 'define' on OS/390.
+
+=back
+
+=head2 VM/ESA?
+
+XXX.
+
+=head2 POSIX-BC?
+
+XXX.
+
+=head1 REFERENCES
+
+http://anubis.dkuug.dk/i18n/charmaps
+
+L<perllocale>.
+
+http://www.unicode.org/
+
+http://www.unicode.org/unicode/reports/tr16/
+
+B<The Unicode Standard Version 2.0> The Unicode Consortium,
+ISBN 0-201-48345-9, Addison Wesley Developers Press, July 1996.
+
+B<CDRA: IBM - Character Data Representation Architecture -
+Reference and Registry>, IBM SC09-2190-00, December 1996.
+
+"Demystifying Character Sets", Andrea Vine, Multilingual Computing
+& Technology, B<#26 Vol. 10 Issue 4>, August/September 1999;
+ISSN 1523-0309; Multilingual Computing Inc. Sandpoint ID, USA.
+
+=head1 AUTHOR
+
+Peter Prymmer E<lt>pvhp@best.comE<gt> wrote this in 1999 and 2000
+with CCSID 0819 and 0037 help from Chris Leach and
+Andre' Pirard E<lt>A.Pirard@ulg.ac.beE<gt> as well as POSIX-BC
+help from Thomas Dorner E<lt>Thomas.Dorner@start.deE<gt>.
+Thanks also to Philip Newton and Vickie Cooper. Trademarks, registered
+trademarks, service marks and registered service marks used in this
+document are the property of their respective owners.
+
+