Add perlebcdic from Peter Prymmer, regen toc.

p4raw-id: //depot/perl@6676
author: Jarkko Hietaniemi <jhi@iki.fi> 2000-08-17 14:44:02 +0000
committer: Jarkko Hietaniemi <jhi@iki.fi> 2000-08-17 14:44:02 +0000
commit: d396a55899b7bce58ef6008d9af7a500b5175b4a (patch)
tree: 92bb4fc9fea98748bcd8bc310e3b9dd4fd5f54a0 /pod/perlebcdic.pod
parent: 10c102662dfb8c226a9c3524f047501223fa8409 (diff)
download: perl-d396a55899b7bce58ef6008d9af7a500b5175b4a.tar.gz
1 files changed, 1001 insertions, 0 deletions
diff --git a/pod/perlebcdic.pod b/pod/perlebcdic.pod
new file mode 100644
index 0000000000..f27a8dea2e
--- /dev/null
+++ b/pod/perlebcdic.pod
@@ -0,0 +1,1001 @@
+=head1 NAME
+
+perlebcdic - Considerations for running Perl on EBCDIC platforms
+
+=head1 DESCRIPTION
+
+An exploration of some of the issues facing Perl programmers
+on EBCDIC based computers.  We do not cover localization, 
+internationalization, or multi byte character set issues (yet).
+
+Portions that are still incomplete are marked with XXX.
+
+=head1 COMMON CHARACTER CODE SETS
+
+=head2 ASCII
+
+The American Standard Code for Information Interchange is a set of
+integers running from 0 to 127 (decimal) that imply character 
+interpretation by the display and other system(s) of computers.  
+The range 0..127 is covered by setting the bits in a 7-bit binary 
+digit, hence the set is sometimes referred to as a "7-bit ASCII".  
+ASCII was described by the American National Standards Instute 
+document ANSI X3.4-1986.  It was also described by ISO 646:1991 
+(with localization for currency symbols).  The full ASCII set is 
+given in the table below as the first 128 elements.  Languages that 
+can be written adequately with the characters in ASCII include 
+English, Hawaiian, Indonesian, Swahili and some Native American 
+languages.
+
+=head2 ISO 8859
+
+The ISO 8859-$n are a collection of character code sets from the 
+International Organization for Standardization (ISO) each of which 
+adds characters to the ASCII set that are typically found in European 
+languages many of which are based on the Roman, or Latin, alphabet.
+
+=head2 Latin 1 (ISO 8859-1)
+
+A particular 8-bit extension to ASCII that includes grave and acute 
+accented Latin characters.  Languages that can employ ISO 8859-1 
+include all the languages covered by ASCII as well as Afrikaans, 
+Albanian, Basque, Catalan, Danish, Faroese, Finnish, Norwegian, 
+Portugese, Spanish, and Swedish.  Dutch is covered albeit without 
+the ij ligature.  French is covered too but without the oe ligature. 
+German can use ISO 8859-1 but must do so without German-style
+quotation marks.  This set is based on Western European extensions 
+to ASCII and is commonly encountered in world wide web work.
+In IBM character code set identification terminology ISO 8859-1 is
+known as CCSID 819 (or sometimes 0819 or even 00819).
+
+=head2 EBCDIC
+
+Extended Binary Coded Decimal Interchange Code.  The EBCDIC acronym 
+refers to a large collection of slightly different single and
+multi byte coded character sets that are different from ASCII or 
+ISO 8859-1 and typically run on host computers.  The
+EBCDIC encodings derive from Hollerith punched card encodings.
+The layout on the cards was such that high bits were set for the
+upper and lower case alphabet characters [a-z] and [A-Z], but there
+were gaps within each latin alphabet range.
+
+=head2 13 variant characters
+
+XXX.
+
+EBCDIC character sets may be known by character code set identification
+numbers (CCSID numbers) or code page numbers.
+
+=head2 0037
+
+Character code set ID 0037 is a mapping of the ASCII plus Latin-1 
+characters (i.e. ISO 8859-1) to an EBCDIC set.  0037 is used 
+on the OS/400 operating system that runs on AS/400 computers.
+CCSID 37 differs from ISO 8859-1 in 237 places, in other words
+they agree on only 19 code point values.
+
+=head2 1047
+
+Character code set ID 1047 is also a mapping of the ASCII plus 
+Latin-1 characters (i.e. ISO 8859-1) to an EBCDIC set.  1047 is 
+used under Unix System Services for OS/390, and OpenEdition for VM/ESA. 
+CCSID 1047 differs from CCSID 0037 in eight places.
+
+=head2 POSIX-BC
+
+The EBCDIC code page in use on Siemens' BS2000 system is distinct from
+1047 and 0037.  It is identified below as the POSIX-BC set.
+
+=head1 SINGLE OCTET TABLES
+
+The following tables list the ASCII and Latin 1 ordered sets including
+the subsets: C0 controls (0..31), ASCII graphics (32..7e), delete (7f),
+C1 controls (80..9f), and Latin-1 (a.k.a. ISO 8859-1) (a0..ff).  In the 
+table non-printing control character names as well as the Latin 1 
+extensions to ASCII have been labelled with character names roughly 
+corresponding to I<The Unicode Standard, Version 2.0> albeit with 
+substitutions such as s/LATIN// and s/VULGAR// in all cases, 
+s/CAPITAL LETTER// in some cases, and s/SMALL LETTER ([A-Z])/\l$1/ 
+in some other cases.  The "names" of the C1 control set 
+(128..159 in ISO 8859-1) are somewhat arbitrary.  The differences 
+between the 0037 and 1047 sets are flagged with ***.  The differences 
+between the 1047 and POSIX-BC sets are flagged with ###.  
+All ord() numbers listed are decimal.  If you would rather see this 
+table listing octal values then run the table (that is, the pod 
+version of this document since this recipe may not work with 
+a pod2XXX translation to another format) through:
+
+=over 4
+
+=item recipe 0
+
+=back
+
+    perl -ne 'if(/(.{33})(\d+)\s+(\d+)\s+(\d+)\s+(\d+)/)' \
+     -e '{printf("%s%-9o%-9o%-9o%-9o\n",$1,$2,$3,$4,$5)}' perlebcdic.pod
+
+If you would rather see this table listing hexadecimal values then
+run the table through:
+
+=over 4
+
+=item recipe 1
+
+=back
+
+    perl -ne 'if(/(.{33})(\d+)\s+(\d+)\s+(\d+)\s+(\d+)/)' \
+     -e '{printf("%s%-9X%-9X%-9X%-9X\n",$1,$2,$3,$4,$5)}' perlebcdic.pod
+
+
+                                 8859-1
+    chr                          0819     0037     1047     POSIX-BC
+    ----------------------------------------------------------------
+    <NULL>                       0        0        0        0 
+    <START OF HEADING>           1        1        1        1
+    <START OF TEXT>              2        2        2        2
+    <END OF TEXT>                3        3        3        3
+    <END OF TRANSMISSION>        4        55       55       55
+    <ENQUIRY>                    5        45       45       45
+    <ACKNOWLEDGE>                6        46       46       46
+    <BELL>                       7        47       47       47
+    <BACKSPACE>                  8        22       22       22
+    <HORIZONTAL TABULATION>      9        5        5        5
+    <LINE FEED>                  10       37       21       21  ***
+    <VERTICAL TABULATION>        11       11       11       11
+    <FORM FEED>                  12       12       12       12
+    <CARRIAGE RETURN>            13       13       13       13
+    <SHIFT OUT>                  14       14       14       14
+    <SHIFT IN>                   15       15       15       15
+    <DATA LINK ESCAPE>           16       16       16       16
+    <DEVICE CONTROL ONE>         17       17       17       17
+    <DEVICE CONTROL TWO>         18       18       18       18
+    <DEVICE CONTROL THREE>       19       19       19       19
+    <DEVICE CONTROL FOUR>        20       60       60       60
+    <NEGATIVE ACKNOWLEDGE>       21       61       61       61
+    <SYNCHRONOUS IDLE>           22       50       50       50
+    <END OF TRANSMISSION BLOCK>  23       38       38       38
+    <CANCEL>                     24       24       24       24
+    <END OF MEDIUM>              25       25       25       25
+    <SUBSTITUTE>                 26       63       63       63
+    <ESCAPE>                     27       39       39       39
+    <FILE SEPARATOR>             28       28       28       28
+    <GROUP SEPARATOR>            29       29       29       29
+    <RECORD SEPARATOR>           30       30       30       30
+    <UNIT SEPARATOR>             31       31       31       31
+    <SPACE>                      32       64       64       64
+    !                            33       90       90       90
+    "                            34       127      127      127
+    #                            35       123      123      123
+    $                            36       91       91       91
+    %                            37       108      108      108
+    &                            38       80       80       80
+    '                            39       125      125      125
+    (                            40       77       77       77
+    )                            41       93       93       93
+    *                            42       92       92       92
+    +                            43       78       78       78
+    ,                            44       107      107      107
+    -                            45       96       96       96
+    .                            46       75       75       75
+    /                            47       97       97       97
+    0                            48       240      240      240
+    1                            49       241      241      241
+    2                            50       242      242      242
+    3                            51       243      243      243
+    4                            52       244      244      244
+    5                            53       245      245      245
+    6                            54       246      246      246
+    7                            55       247      247      247
+    8                            56       248      248      248
+    9                            57       249      249      249
+    :                            58       122      122      122
+    ;                            59       94       94       94
+    <                            60       76       76       76
+    =                            61       126      126      126
+    >                            62       110      110      110
+    ?                            63       111      111      111
+    @                            64       124      124      124
+    A                            65       193      193      193
+    B                            66       194      194      194
+    C                            67       195      195      195
+    D                            68       196      196      196
+    E                            69       197      197      197
+    F                            70       198      198      198
+    G                            71       199      199      199
+    H                            72       200      200      200
+    I                            73       201      201      201
+    J                            74       209      209      209
+    K                            75       210      210      210
+    L                            76       211      211      211
+    M                            77       212      212      212
+    N                            78       213      213      213
+    O                            79       214      214      214
+    P                            80       215      215      215
+    Q                            81       216      216      216
+    R                            82       217      217      217
+    S                            83       226      226      226
+    T                            84       227      227      227
+    U                            85       228      228      228
+    V                            86       229      229      229
+    W                            87       230      230      230
+    X                            88       231      231      231
+    Y                            89       232      232      232
+    Z                            90       233      233      233
+    [                            91       186      173      187 *** ###
+    \                            92       224      224      188 ### 
+    ]                            93       187      189      189 ***
+    ^                            94       176      95       106 *** ###
+    _                            95       109      109      109
+    `                            96       121      121      74  ###
+    a                            97       129      129      129
+    b                            98       130      130      130
+    c                            99       131      131      131
+    d                            100      132      132      132
+    e                            101      133      133      133
+    f                            102      134      134      134
+    g                            103      135      135      135
+    h                            104      136      136      136
+    i                            105      137      137      137
+    j                            106      145      145      145
+    k                            107      146      146      146
+    l                            108      147      147      147
+    m                            109      148      148      148
+    n                            110      149      149      149
+    o                            111      150      150      150
+    p                            112      151      151      151
+    q                            113      152      152      152
+    r                            114      153      153      153
+    s                            115      162      162      162
+    t                            116      163      163      163
+    u                            117      164      164      164
+    v                            118      165      165      165
+    w                            119      166      166      166
+    x                            120      167      167      167
+    y                            121      168      168      168
+    z                            122      169      169      169
+    {                            123      192      192      251 ###
+    |                            124      79       79       79
+    }                            125      208      208      253 ###
+    ~                            126      161      161      255 ###
+    <DELETE>                     127      7        7        7
+    <C1 0>                       128      32       32       32
+    <C1 1>                       129      33       33       33
+    <C1 2>                       130      34       34       34
+    <C1 3>                       131      35       35       35
+    <C1 4>                       132      36       36       36
+    <C1 5>                       133      21       37       37  ***
+    <C1 6>                       134      6        6        6
+    <C1 7>                       135      23       23       23
+    <C1 8>                       136      40       40       40
+    <C1 9>                       137      41       41       41
+    <C1 10>                      138      42       42       42
+    <C1 11>                      139      43       43       43
+    <C1 12>                      140      44       44       44
+    <C1 13>                      141      9        9        9
+    <C1 14>                      142      10       10       10
+    <C1 15>                      143      27       27       27
+    <C1 16>                      144      48       48       48
+    <C1 17>                      145      49       49       49
+    <C1 18>                      146      26       26       26
+    <C1 19>                      147      51       51       51
+    <C1 20>                      148      52       52       52
+    <C1 21>                      149      53       53       53
+    <C1 22>                      150      54       54       54
+    <C1 23>                      151      8        8        8
+    <C1 24>                      152      56       56       56
+    <C1 25>                      153      57       57       57
+    <C1 26>                      154      58       58       58
+    <C1 27>                      155      59       59       59
+    <C1 28>                      156      4        4        4
+    <C1 29>                      157      20       20       20
+    <C1 30>                      158      62       62       62
+    <C1 31>                      159      255      255      95  ###
+    <NON-BREAKING SPACE>         160      65       65       65
+    <INVERTED EXCLAMATION MARK>  161      170      170      170
+    <CENT SIGN>                  162      74       74       176 ###
+    <POUND SIGN>                 163      177      177      177
+    <CURRENCY SIGN>              164      159      159      159
+    <YEN SIGN>                   165      178      178      178
+    <BROKEN BAR>                 166      106      106      208 ###
+    <SECTION SIGN>               167      181      181      181
+    <DIAERESIS>                  168      189      187      121 *** ###
+    <COPYRIGHT SIGN>             169      180      180      180
+    <FEMININE ORDINAL INDICATOR> 170      154      154      154
+    <LEFT POINTING GUILLEMET>    171      138      138      138
+    <NOT SIGN>                   172      95       176      186 *** ###       
+    <SOFT HYPHEN>                173      202      202      202
+    <REGISTERED TRADE MARK SIGN> 174      175      175      175
+    <MACRON>                     175      188      188      161 ###
+    <DEGREE SIGN>                176      144      144      144
+    <PLUS-OR-MINUS SIGN>         177      143      143      143
+    <SUPERSCRIPT TWO>            178      234      234      234
+    <SUPERSCRIPT THREE>          179      250      250      250
+    <ACUTE ACCENT>               180      190      190      190
+    <MICRO SIGN>                 181      160      160      160
+    <PARAGRAPH SIGN>             182      182      182      182
+    <MIDDLE DOT>                 183      179      179      179
+    <CEDILLA>                    184      157      157      157
+    <SUPERSCRIPT ONE>            185      218      218      218
+    <MASC. ORDINAL INDICATOR>    186      155      155      155
+    <RIGHT POINTING GUILLEMET>   187      139      139      139
+    <FRACTION ONE QUARTER>       188      183      183      183
+    <FRACTION ONE HALF>          189      184      184      184
+    <FRACTION THREE QUARTERS>    190      185      185      185
+    <INVERTED QUESTION MARK>     191      171      171      171
+    <A WITH GRAVE>               192      100      100      100
+    <A WITH ACUTE>               193      101      101      101
+    <A WITH CIRCUMFLEX>          194      98       98       98
+    <A WITH TILDE>               195      102      102      102
+    <A WITH DIAERESIS>           196      99       99       99
+    <A WITH RING ABOVE>          197      103      103      103
+    <CAPITAL LIGATURE AE>        198      158      158      158
+    <C WITH CEDILLA>             199      104      104      104
+    <E WITH GRAVE>               200      116      116      116
+    <E WITH ACUTE>               201      113      113      113
+    <E WITH CIRCUMFLEX>          202      114      114      114
+    <E WITH DIAERESIS>           203      115      115      115
+    <I WITH GRAVE>               204      120      120      120
+    <I WITH ACUTE>               205      117      117      117
+    <I WITH CIRCUMFLEX>          206      118      118      118
+    <I WITH DIAERESIS>           207      119      119      119
+    <CAPITAL LETTER ETH>         208      172      172      172
+    <N WITH TILDE>               209      105      105      105
+    <O WITH GRAVE>               210      237      237      237
+    <O WITH ACUTE>               211      238      238      238
+    <O WITH CIRCUMFLEX>          212      235      235      235
+    <O WITH TILDE>               213      239      239      239
+    <O WITH DIAERESIS>           214      236      236      236
+    <MULTIPLICATION SIGN>        215      191      191      191
+    <O WITH STROKE>              216      128      128      128
+    <U WITH GRAVE>               217      253      253      224 ###
+    <U WITH ACUTE>               218      254      254      254
+    <U WITH CIRCUMFLEX>          219      251      251      221 ###
+    <U WITH DIAERESIS>           220      252      252      252
+    <Y WITH ACUTE>               221      173      186      173 *** ###
+    <CAPITAL LETTER THORN>       222      174      174      174
+    <SMALL LETTER SHARP S>       223      89       89       89
+    <a WITH GRAVE>               224      68       68       68
+    <a WITH ACUTE>               225      69       69       69
+    <a WITH CIRCUMFLEX>          226      66       66       66
+    <a WITH TILDE>               227      70       70       70
+    <a WITH DIAERESIS>           228      67       67       67
+    <a WITH RING ABOVE>          229      71       71       71
+    <SMALL LIGATURE ae>          230      156      156      156
+    <c WITH CEDILLA>             231      72       72       72
+    <e WITH GRAVE>               232      84       84       84
+    <e WITH ACUTE>               233      81       81       81
+    <e WITH CIRCUMFLEX>          234      82       82       82
+    <e WITH DIAERESIS>           235      83       83       83
+    <i WITH GRAVE>               236      88       88       88
+    <i WITH ACUTE>               237      85       85       85
+    <i WITH CIRCUMFLEX>          238      86       86       86
+    <i WITH DIAERESIS>           239      87       87       87
+    <SMALL LETTER eth>           240      140      140      140
+    <n WITH TILDE>               241      73       73       73
+    <o WITH GRAVE>               242      205      205      205
+    <o WITH ACUTE>               243      206      206      206
+    <o WITH CIRCUMFLEX>          244      203      203      203
+    <o WITH TILDE>               245      207      207      207
+    <o WITH DIAERESIS>           246      204      204      204
+    <DIVISION SIGN>              247      225      225      225
+    <o WITH STROKE>              248      112      112      112
+    <u WITH GRAVE>               249      221      221      192 ###
+    <u WITH ACUTE>               250      222      222      222
+    <u WITH CIRCUMFLEX>          251      219      219      219
+    <u WITH DIAERESIS>           252      220      220      220
+    <y WITH ACUTE>               253      141      141      141
+    <SMALL LETTER thorn>         254      142      142      142
+    <y WITH DIAERESIS>           255      223      223      223
+
+If you would rather see the above table in CCSID 0037 order rather than
+ASCII + Latin-1 order then run the table through:
+
+=over 4
+
+=item recipe 2
+
+=back
+
+    perl -ne 'if(/.{33}\d{1,3}\s{6,8}\d{1,3}\s{6,8}\d{1,3}\s{6,8}\s{1,3}/)'\
+     -e '{push(@l,$_)}' \
+     -e 'END{print map{$_->[0]}' \
+     -e '          sort{$a->[1] <=> $b->[1]}' \ 
+     -e '          map{[$_,substr($_,42,3)]}@l;}' perlebcdic.pod
+
+If you would rather see it in CCSID 1047 order then change the digit
+42 in the last line to 51, like this:
+
+=over 4
+
+=item recipe 3
+
+=back
+
+    perl -ne 'if(/.{33}\d{1,3}\s{6,8}\d{1,3}\s{6,8}\d{1,3}\s{6,8}\s{1,3}/)'\
+     -e '{push(@l,$_)}' \
+     -e 'END{print map{$_->[0]}' \
+     -e '          sort{$a->[1] <=> $b->[1]}' \ 
+     -e '          map{[$_,substr($_,51,3)]}@l;}' perlebcdic.pod
+
+If you would rather see it in POSIX-BC order then change the digit
+51 in the last line to 60, like this:
+
+=over 4
+
+=item recipe 4
+
+=back
+
+    perl -ne 'if(/.{33}\d{1,3}\s{6,8}\d{1,3}\s{6,8}\d{1,3}\s{6,8}\s{1,3}/)'\
+     -e '{push(@l,$_)}' \
+     -e 'END{print map{$_->[0]}' \
+     -e '          sort{$a->[1] <=> $b->[1]}' \ 
+     -e '          map{[$_,substr($_,60,3)]}@l;}' perlebcdic.pod
+
+
+=head1 IDENTIFYING CHARACTER CODE SETS
+
+To determine the character set you are running under from perl one 
+could use the return value of ord() or chr() to test one or more 
+character values.  For example:
+
+    $is_ascii  = "A" eq chr(65);
+    $is_ebcdic = "A" eq chr(193);
+
+"\t" is a <HORIZONTAL TABULATION>.  So that:
+
+    $is_ascii  = ord("\t") == 9;
+    $is_ebcdic = ord("\t") == 5;
+
+To distinguish EBCDIC code pages try looking at one or more of
+the characters that differ between them.  For example:
+
+    $is_ebcdic_37   = "\n" eq chr(37);
+    $is_ebcdic_1047 = "\n" eq chr(21);
+
+Or better still choose a character that is uniquely encoded in any
+of the code sets, e.g.:
+
+    $is_ascii           = ord('[') == 91;
+    $is_ebcdic_37       = ord('[') == 186;
+    $is_ebcdic_1047     = ord('[') == 173;
+    $is_ebcdic_POSIX_BC = ord('[') == 187;
+
+However, it would be unwise to write tests such as:
+
+    $is_ascii = "\r" ne chr(13);  #  WRONG
+    $is_ascii = "\n" ne chr(10);  #  ILL ADVISED
+
+Obviously the first of these will fail to distinguish most ASCII machines
+from either a CCSID 0037, a 1047, or a POSIX-BC EBCDIC machine since "\r" eq 
+chr(13) under all of those coded character sets.  But note too that 
+because "\n" is chr(13) and "\r" is chr(10) on the MacIntosh (which is an 
+ASCII machine) the second C<$is_ascii> test will lead to trouble there.
+
+To determine whether or not perl was built under an EBCDIC 
+code page you can use the Config module like so:
+
+    use Config;
+    $is_ebcdic = $Config{ebcdic} eq 'define';
+
+=head1 CONVERSIONS
+
+In order to convert a string of characters from one character set to 
+another a simple list of numbers, such as in the right columns in the
+above table, along with perl's tr/// operator is all that is needed.  
+The data in the table are in ASCII order hence the EBCDIC columns 
+provide easy to use ASCII to EBCDIC operations that are also easily 
+reversed.
+
+For example, to convert ASCII to code page 037 take the output of the second 
+column from the output of recipe 0 and use it in tr/// like so:
+
+    $cp_037 = 
+    '\000\001\002\003\234\011\206\177\227\215\216\013\014\015\016\017' .
+    '\020\021\022\023\235\205\010\207\030\031\222\217\034\035\036\037' .
+    '\200\201\202\203\204\012\027\033\210\211\212\213\214\005\006\007' .
+    '\220\221\026\223\224\225\226\004\230\231\232\233\024\025\236\032' .
+    '\040\240\342\344\340\341\343\345\347\361\242\056\074\050\053\174' .
+    '\046\351\352\353\350\355\356\357\354\337\041\044\052\051\073\254' .
+    '\055\057\302\304\300\301\303\305\307\321\246\054\045\137\076\077' .
+    '\370\311\312\313\310\315\316\317\314\140\072\043\100\047\075\042' .
+    '\330\141\142\143\144\145\146\147\150\151\253\273\360\375\376\261' .
+    '\260\152\153\154\155\156\157\160\161\162\252\272\346\270\306\244' .
+    '\265\176\163\164\165\166\167\170\171\172\241\277\320\335\336\256' .
+    '\136\243\245\267\251\247\266\274\275\276\133\135\257\250\264\327' .
+    '\173\101\102\103\104\105\106\107\110\111\255\364\366\362\363\365' .
+    '\175\112\113\114\115\116\117\120\121\122\271\373\374\371\372\377' .
+    '\134\367\123\124\125\126\127\130\131\132\262\324\326\322\323\325' .
+    '\060\061\062\063\064\065\066\067\070\071\263\333\334\331\332\237' ;
+
+    my $ebcdic_string = $ascii_string;
+    $ebcdic_string = tr/\000-\377/$cp_037/;
+
+To convert from EBCDIC to ASCII just reverse the order of the tr/// 
+arguments like so:
+
+    my $ascii_string = $ebcdic_string;
+    $ascii_string = tr/$code_page_chrs/\000-\037/;
+
+XPG4 interoperability often implies the presence of an I<iconv> utility
+available from the shell or from the C library.  Consult your system's
+documentation for information on iconv.
+
+On OS/390 see the iconv(1) man page.  One way to invoke the iconv 
+shell utility from within perl would be to:
+
+    $ascii_data = `echo '$ebcdic_data'| iconv -f IBM-1047 -t ISO8859-1`
+
+or the inverse map:
+
+    $ebcdic_data = `echo '$ascii_data'| iconv -f ISO8859-1 -t IBM-1047`
+
+XXX iconv under qsh on OS/400?
+XXX iconv on VM?
+XXX iconv on BS2k? 
+
+For other perl based conversion options see the Convert::* modules on CPAN.
+
+=head1 OPERATOR DIFFERENCES
+
+The C<..> range operator treats certain character ranges with 
+care on EBCDIC machines.  For example the following array
+will have twenty six elements on either an EBCDIC machine
+or an ASCII machine:
+
+    @alphabet = ('A'..'Z');   #  $#alphabet == 25
+
+The bitwise operators such as & ^ | may return different results
+when operating on string or character data in a perl program running 
+on an EBCDIC machine than when run on an ASCII machine.  Here is
+an example adapted from the one in L<perlop>:
+
+    # EBCDIC-based examples
+    print "j p \n" ^ " a h";                      # prints "JAPH\n"
+    print "JA" | "  ph\n";                        # prints "japh\n" 
+    print "JAPH\nJunk" & "\277\277\277\277\277";  # prints "japh\n";
+    print 'p N$' ^ " E<H\n";                      # prints "Perl\n";
+
+An interesting property of the 32 C0 control characters
+in the ASCII table is that they can "literally" be constructed
+as control characters in perl, e.g. (chr(0) eq "\c@"), 
+(chr(1) eq "\cA"), and so on.  Perl on EBCDIC machines has been 
+ported to take "\c@" -> chr(0) and "\cA" -> chr(1) as well, but the
+thirty three characters that result depend on which code page you are
+using.  The table below uses the character names from the previous table 
+but with substitions such as s/START OF/S.O./; s/END OF /E.O./; 
+s/TRANSMISSION/TRANS./; s/TABULATION/TAB./; s/VERTICAL/VERT./; 
+s/HORIZONTAL/HORIZ./; s/DEVICE CONTROL/D.C./; s/SEPARATOR/SEP./; 
+s/NEGATIVE ACKNOWLEDGE/NEG. ACK./;.  The POSIX-BC and 1047 sets are
+identical throughout this range and differ from the 0037 set at only 
+one spot (21 decimal).   Note that "\c\\" maps to two characters
+not one.
+
+    chr   ord  8859-1               0037                1047 && POSIX-BC     
+    ------------------------------------------------------------------------
+    "\c?" 127  <DELETE>             "                   "              ***><
+    "\c@"   0  <NULL>               <NULL>              <NULL>         ***><
+    "\cA"   1  <S.O. HEADING>       <S.O. HEADING>      <S.O. HEADING> 
+    "\cB"   2  <S.O. TEXT>          <S.O. TEXT>         <S.O. TEXT>
+    "\cC"   3  <E.O. TEXT>          <E.O. TEXT>         <E.O. TEXT>
+    "\cD"   4  <E.O. TRANS.>        <C1 28>             <C1 28> 
+    "\cE"   5  <ENQUIRY>            <HORIZ. TAB.>       <HORIZ. TAB.>    
+    "\cF"   6  <ACKNOWLEDGE>        <C1 6>              <C1 6>   
+    "\cG"   7  <BELL>               <DELETE>            <DELETE>   
+    "\cH"   8  <BACKSPACE>          <C1 23>             <C1 23>
+    "\cI"   9  <HORIZ. TAB.>        <C1 13>             <C1 13>
+    "\cJ"  10  <LINE FEED>          <C1 14>             <C1 14>
+    "\cK"  11  <VERT. TAB.>         <VERT. TAB.>        <VERT. TAB.>
+    "\cL"  12  <FORM FEED>          <FORM FEED>         <FORM FEED>    
+    "\cM"  13  <CARRIAGE RETURN>    <CARRIAGE RETURN>   <CARRIAGE RETURN> 
+    "\cN"  14  <SHIFT OUT>          <SHIFT OUT>         <SHIFT OUT>
+    "\cO"  15  <SHIFT IN>           <SHIFT IN>          <SHIFT IN>
+    "\cP"  16  <DATA LINK ESCAPE>   <DATA LINK ESCAPE>  <DATA LINK ESCAPE> 
+    "\cQ"  17  <D.C. ONE>           <D.C. ONE>          <D.C. ONE>
+    "\cR"  18  <D.C. TWO>           <D.C. TWO>          <D.C. TWO>
+    "\cS"  19  <D.C. THREE>         <D.C. THREE>        <D.C. THREE> 
+    "\cT"  20  <D.C. FOUR>          <C1 29>             <C1 29> 
+    "\cU"  21  <NEG. ACK.>          <C1 5>              <LINE FEED>    ***
+    "\cV"  22  <SYNCHRONOUS IDLE>   <BACKSPACE>         <BACKSPACE>
+    "\cW"  23  <E.O. TRANS. BLOCK>  <C1 7>              <C1 7>
+    "\cX"  24  <CANCEL>             <CANCEL>            <CANCEL>
+    "\cY"  25  <E.O. MEDIUM>        <E.O. MEDIUM>       <E.O. MEDIUM>
+    "\cZ"  26  <SUBSTITUTE>         <C1 18>             <C1 18>
+    "\c["  27  <ESCAPE>             <C1 15>             <C1 15>
+    "\c\\" 28  <FILE SEP.>\         <FILE SEP.>\        <FILE SEP.>\
+    "\c]"  29  <GROUP SEP.>         <GROUP SEP.>        <GROUP SEP.>
+    "\c^"  30  <RECORD SEP.>        <RECORD SEP.>       <RECORD SEP.>  ***><
+    "\c_"  31  <UNIT SEP.>          <UNIT SEP.>         <UNIT SEP.>    ***><
+
+
+=head1 FUNCTION DIFFERENCES
+
+=over 8
+
+=item chr()
+
+chr() must be given an EBCDIC code number argument to yield a desired 
+character return value on an EBCDIC machine.  For example:
+
+    $CAPITAL_LETTER_A = chr(193);
+
+=item ord()
+
+ord() will return EBCDIC code number values on an EBCDIC machine.
+For example:
+
+    $the_number_193 = ord("A");
+
+=item pack()
+
+The c and C templates for pack() are dependent upon character set 
+encoding.  Examples of usage on EBCDIC include:
+
+    $foo = pack("CCCC",193,194,195,196);
+    # $foo eq "ABCD"
+    $foo = pack("C4",193,194,195,196);
+    # same thing
+
+    $foo = pack("ccxxcc",193,194,195,196);
+    # $foo eq "AB\0\0CD"
+
+=item print()
+
+One must be careful with scalars and strings that are passed to
+print that contain ASCII encodings.  One common place
+for this to occur is in the output of the MIME type header for
+CGI script writing.  For example, many perl programming guides 
+recommend something similar to:
+
+    print "Content-type:\ttext/html\015\012\015\012"; 
+    # this may be wrong on EBCDIC
+
+Under the IBM OS/390 USS Web Server for example you should instead
+write that as:
+
+    print "Content-type:\ttext/html\r\n\r\n"; # OK for DGW et alia
+
+That is because the translation from EBCDIC to ASCII is done
+by the web server in this case (such code will not be appropriate for
+the Macintosh however).  Consult your web server's documentation for 
+further details.
+
+=item printf()
+
+The formats that can convert characters to numbers and vice versa
+will be different from their ASCII counterparts when executed
+on an EBCDIC machine.  Examples include:
+
+    printf("%c%c%c",193,194,195);  # prints ABC
+
+=item sort()
+
+EBCDIC sort results may differ from ASCII sort results especially for 
+mixed case strings.  This is discussed in more detail below.
+
+=item sprintf()
+
+See the discussion of printf() above.  An example of the use
+of sprintf would be:
+
+    $CAPITAL_LETTER_A = sprintf("%c",193);
+
+=item unpack()
+
+See the discussion of pack() above.
+
+=back
+
+=head1 REGULAR EXPRESSION DIFFERENCES
+
+As of perl 5.005_03 the letter range regular expression such as 
+[A-Z] and [a-z] have been especially coded to not pick up gap 
+characters.  For example characters such as <o WITH CIRCUMFLEX> 
+that lie between I and J would not be matched by C</[H-K]/>.  
+If you do want to match such characters in a single octet 
+regular expression try matching the hex or octal code such 
+as C</\313/> on EBCDIC or C</\364/> on ASCII machines to 
+have your regular expression match <o WITH CIRCUMFLEX>.
+
+Another place to be wary of is the inappropriate use of hex or
+octal constants in regular expressions.  Consider the following
+set of subs:
+
+    sub is_c0 {
+        my $char = substr(shift,0,1);
+        $char =~ /[\000-\037]/;
+    }
+
+    sub is_print_ascii {
+        my $char = substr(shift,0,1);
+        $char =~ /[\040-\176]/;
+    }
+
+    sub is_delete {
+        my $char = substr(shift,0,1);
+        $char eq "\177";
+    }
+
+    sub is_c1 {
+        my $char = substr(shift,0,1);
+        $char =~ /[\200-\237]/;
+    }
+
+    sub is_latin_1 {
+        my $char = substr(shift,0,1);
+        $char =~ /[\240-\377]/;
+    }
+
+The above would be adequate if the concern was only with numeric codepoints.
+However, we may actually be concerned with characters rather than codepoints 
+and on an EBCDIC machine would like for constructs such as 
+C<if (is_print_ascii("A")) {print "A is a printable character\n";}> to print
+out the expected message.  One way to represent the above collection
+of character classification subs that is capable of working across the
+four coded character sets discussed in this document is as follows:
+
+    sub Is_c0 {
+        my $char = substr(shift,0,1);
+        if (ord('^')==94)  { # ascii
+            return $char =~ /[\000-\037]/;
+        } 
+        if (ord('^')==176) { # 37
+            return $char =~ /[\000-\003\067\055-\057\026\005\045\013-\023\074\075\062\046\030\031\077\047\034-\037]/;
+        }
+        if (ord('^')==95 || ord('^')==106) { # 1047 || posix-bc
+            return $char =~ /[\000-\003\067\055-\057\026\005\025\013-\023\074\075\062\046\030\031\077\047\034-\037]/;
+        }
+    }
+
+    sub Is_print_ascii {
+        my $char = substr(shift,0,1);
+        $char =~ /[ !"\#\$%&'()*+,\-.\/0-9:;<=>?\@A-Z[\\\]^_`a-z{|}~]/;
+    }
+
+    sub Is_delete {
+        my $char = substr(shift,0,1);
+        if (ord('^')==94)  { # ascii
+            return $char eq "\177";
+        }
+        else  {              # ebcdic
+            return $char eq "\007";
+        }
+    }
+
+    sub Is_c1 {
+        my $char = substr(shift,0,1);
+        if (ord('^')==94)  { # ascii
+            return $char =~ /[\200-\237]/;
+        }
+        if (ord('^')==176) { # 37
+            return $char =~ /[\040-\044\025\006\027\050-\054\011\012\033\060\061\032\063-\066\010\070-\073\040\024\076\377]/;
+        }
+        if (ord('^')==95)  { # 1047
+            return $char =~ /[\040-\045\006\027\050-\054\011\012\033\060\061\032\063-\066\010\070-\073\040\024\076\377]/;
+        }
+        if (ord('^')==106) { # posix-bc
+            return $char =~ 
+              /[\040-\045\006\027\050-\054\011\012\033\060\061\032\063-\066\010\070-\073\040\024\076\137]/;
+        }
+    }
+
+    sub Is_latin_1 {
+        my $char = substr(shift,0,1);
+        if (ord('^')==94)  { # ascii
+            return $char =~ /[\240-\377]/;
+        }
+        if (ord('^')==176) { # 37
+            return $char =~ 
+              /[\101\252\112\261\237\262\152\265\275\264\232\212\137\312\257\274\220\217\352\372\276\240\266\263\235\332\233\213\267\270\271\253\144\145\142\146\143\147\236\150\164\161-\163\170\165-\167\254\151\355\356\353\357\354\277\200\375\376\373\374\255\256\131\104\105\102\106\103\107\234\110\124\121-\123\130\125-\127\214\111\315\316\313\317\314\341\160\335\336\333\334\215\216\337]/;
+        }
+        if (ord('^')==95)  { # 1047
+            return $char =~
+              /[\101\252\112\261\237\262\152\265\273\264\232\212\260\312\257\274\220\217\352\372\276\240\266\263\235\332\233\213\267\270\271\253\144\145\142\146\143\147\236\150\164\161-\163\170\165-\167\254\151\355\356\353\357\354\277\200\375\376\373\374\272\256\131\104\105\102\106\103\107\234\110\124\121-\123\130\125-\127\214\111\315\316\313\317\314\341\160\335\336\333\334\215\216\337]/; 
+        }
+        if (ord('^')==106) { # posix-bc
+            return $char =~ 
+              /[\101\252\260\261\237\262\320\265\171\264\232\212\272\312\257\241\220\217\352\372\276\240\266\263\235\332\233\213\267\270\271\253\144\145\142\146\143\147\236\150\164\161-\163\170\165-\167\254\151\355\356\353\357\354\277\200\340\376\335\374\255\256\131\104\105\102\106\103\107\234\110\124\121-\123\130\125-\127\214\111\315\316\313\317\314\341\160\300\336\333\334\215\216\337]/;
+        }
+    }
+
+Note however that only the C<Is_ascii_print()> sub is really independent 
+of coded character set.  Another way to write C<Is_latin_1()> would be 
+to use the characters in the range explicitly:
+
+    sub Is_latin_1 {
+        my $char = substr(shift,0,1);
+        $char =~ /[������������������������������������������������������������������������������������������������]/;
+    }
+
+Although that form may run into trouble in network transit (due to the 
+presence of 8 bit characters) or on non ISO-Latin character sets.
+ 
+
+=head1 SOCKETS
+
+Most socket programming assumes ASCII character encodings in network
+byte order.  Exceptions can include CGI script writing under a
+host web server where the server may take care of translation for you.
+Most host web servers convert EBCDIC data to ISO-8859-1 or Unicode on
+output.
+
+=head1 SORTING
+
+One big difference between ASCII based character sets and EBCDIC ones
+are the relative positions of upper and lower case letters and the
+letters compared to the digits.  If sorted on an ASCII based machine the
+two letter abbreviation for a physician comes before the two letter
+for drive, that is:
+
+    @sorted = sort(qw(Dr. dr.));  # @sorted holds qw(Dr. dr.) on ASCII,
+                                  # qw(dr. Dr.) on EBCDIC
+
+The property of lower case before uppercase letters in EBCDIC is
+even carried to the Latin 1 EBCDIC pages such as 0037 and 1047.
+An example would be that <E WITH DIAERESIS> (203) comes before
+<e WITH DIAERESIS> (235) on and ASCII machine, but the latter (83) 
+comes before the former (115) on an EBCDIC machine.  (Astute readers will 
+note that the upper case version of <SMALL LETTER SHARP S> is 
+simply "SS" and that the upper case version of <y WITH DIAERESIS> 
+is not in the 0..255 range but it is at U+x0178 in Unicode).
+
+The sort order will cause differences between results obtained on
+ASCII machines versus EBCDIC machines.  What follows are some suggestions
+on how to deal with these differences.
+
+=head2 Ignore ASCII vs EBCDIC sort differences.
+
+This is the least computationally expensive strategy.  It may require
+some user education.
+
+=head2 MONOCASE then sort data.
+
+In order to minimize the expense of monocasing mixed test try to
+C<tr///> towards the character set case most employed within the data.
+If the data are primarily UPPERCASE non Latin 1 then apply tr/[a-z]/[A-Z]/
+then sort().  If the data are primarily lowercase non Latin 1 then
+apply tr/[A-Z]/[a-z]/ before sorting.  If the data are primarily UPPERCASE
+and include Latin-1 characters then apply:  tr/[a-z]/[A-Z]/; 
+XXX
+
+This strategy does not preserve the case of the data and may not be
+acceptable.
+
+=head2 Convert, sort data, then reconvert.
+
+This is the most expensive proposition that does not employ a network
+connection.
+
+=head2 Perform sorting on one type of machine only.
+
+This strategy can employ a network connection.  As such
+it would be computationally expensive.
+
+=head1 URL ENCODING and DECODING
+
+Note that some URLs have hexadecimal ASCII codepoints in them in an
+attempt to overcome character limitation issues.  For example the
+tilde character is not on every keyboard hence a URL of the form:
+
+    http://www.pvhp.com/~pvhp/
+
+may also be expressed as either of:
+
+    http://www.pvhp.com/%7Epvhp/
+
+    http://www.pvhp.com/%7epvhp/
+
+where 7E is the hexadecimal ASCII codepoint for '~'.  Here is an example
+of decoding such a URL under CCSID 1047:
+
+    $url = 'http://www.pvhp.com/%7Epvhp/';
+    # this array assumes code page 1047
+    my @a2e_1047 = (
+          0,  1,  2,  3, 55, 45, 46, 47, 22,  5, 21, 11, 12, 13, 14, 15,
+         16, 17, 18, 19, 60, 61, 50, 38, 24, 25, 63, 39, 28, 29, 30, 31,
+         64, 90,127,123, 91,108, 80,125, 77, 93, 92, 78,107, 96, 75, 97,
+        240,241,242,243,244,245,246,247,248,249,122, 94, 76,126,110,111,
+        124,193,194,195,196,197,198,199,200,201,209,210,211,212,213,214,
+        215,216,217,226,227,228,229,230,231,232,233,173,224,189, 95,109,
+        121,129,130,131,132,133,134,135,136,137,145,146,147,148,149,150,
+        151,152,153,162,163,164,165,166,167,168,169,192, 79,208,161,  7,
+         32, 33, 34, 35, 36, 37,  6, 23, 40, 41, 42, 43, 44,  9, 10, 27,
+         48, 49, 26, 51, 52, 53, 54,  8, 56, 57, 58, 59,  4, 20, 62,255,
+         65,170, 74,177,159,178,106,181,187,180,154,138,176,202,175,188,
+        144,143,234,250,190,160,182,179,157,218,155,139,183,184,185,171,
+        100,101, 98,102, 99,103,158,104,116,113,114,115,120,117,118,119,
+        172,105,237,238,235,239,236,191,128,253,254,251,252,186,174, 89,
+         68, 69, 66, 70, 67, 71,156, 72, 84, 81, 82, 83, 88, 85, 86, 87,
+        140, 73,205,206,203,207,204,225,112,221,222,219,220,141,142,223
+    );
+    $url =~ s/%([0-9a-fA-F]{2})/pack("c",$a2e_1047[hex($1)])/ge;
+
+=head1 I18N AND L10N
+
+Internationalization(I18N) and localization(L10N) are supported at least 
+in principle even on EBCDIC machines.  The details are system dependent 
+and discussed under the L<perlebcdic/OS ISSUES> section below.
+
+=head1 MULTI OCTET CHARACTER SETS
+
+Double byte EBCDIC code pages (?) XXX.
+
+UTF-8, UTF-EBCDIC, (?) XXX.
+
+=head1 OS ISSUES
+
+There may be a few system dependent issues 
+of concern to EBCDIC Perl programmers.
+
+=head2 OS/400 
+
+=over 8
+
+=item IFS access
+
+XXX.
+
+=back
+
+=head2 OS/390 
+
+=over 8
+
+=item dataset access
+
+For sequential data set access try:
+
+    my @ds_records = `cat //DSNAME`;
+
+or:
+
+    my @ds_records = `cat //'HLQ.DSNAME'`;
+
+See also the OS390::Stdio module on CPAN.
+
+=item locales
+
+On OS/390 see L<locale> for information on locales.  The L10N files
+are in F</usr/nls/locale>.  $Config{d_setlocale} is 'define' on OS/390.
+
+=back
+
+=head2 VM/ESA?
+
+XXX.
+
+=head2 POSIX-BC?
+
+XXX.
+
+=head1 REFERENCES
+
+http://anubis.dkuug.dk/i18n/charmaps
+
+L<perllocale>.
+
+http://www.unicode.org/
+
+http://www.unicode.org/unicode/reports/tr16/
+
+B<The Unicode Standard Version 2.0> The Unicode Consortium, 
+ISBN 0-201-48345-9, Addison Wesley Developers Press, July 1996. 
+
+B<CDRA: IBM - Character Data Representation Architecture - 
+Reference and Registry>, IBM SC09-2190-00, December 1996. 
+
+"Demystifying Character Sets", Andrea Vine, Multilingual Computing 
+& Technology, B<#26 Vol. 10 Issue 4>, August/September 1999;
+ISSN 1523-0309; Multilingual Computing Inc. Sandpoint ID, USA.
+
+=head1 AUTHOR
+
+Peter Prymmer E<lt>pvhp@best.comE<gt> wrote this in 1999 and 2000 
+with CCSID 0819 and 0037 help from Chris Leach and 
+Andre' Pirard E<lt>A.Pirard@ulg.ac.beE<gt> as well as POSIX-BC 
+help from Thomas Dorner E<lt>Thomas.Dorner@start.deE<gt>.
+Thanks also to Philip Newton and Vickie Cooper.  Trademarks, registered 
+trademarks, service marks and registered service marks used in this 
+document are the property of their respective owners.
+
+
author	Jarkko Hietaniemi <jhi@iki.fi>	2000-08-17 14:44:02 +0000
committer	Jarkko Hietaniemi <jhi@iki.fi>	2000-08-17 14:44:02 +0000
commit	d396a55899b7bce58ef6008d9af7a500b5175b4a (patch)
tree	92bb4fc9fea98748bcd8bc310e3b9dd4fd5f54a0 /pod/perlebcdic.pod
parent	10c102662dfb8c226a9c3524f047501223fa8409 (diff)
download	perl-d396a55899b7bce58ef6008d9af7a500b5175b4a.tar.gz