diff options
author | Karl Williamson <public@khwilliamson.com> | 2013-03-11 21:13:38 -0600 |
---|---|---|
committer | Karl Williamson <public@khwilliamson.com> | 2013-03-11 21:21:03 -0600 |
commit | 4b9734bf16232aac75ed56df6352c09d1caad7b3 (patch) | |
tree | b1fe580a02d6ae63b54c758d033c9722519e86b5 | |
parent | 020c4f9110283940e8755ca2f70f6e943b42efe3 (diff) | |
download | perl-4b9734bf16232aac75ed56df6352c09d1caad7b3.tar.gz |
EBCDIC has the Unicode bug too
We have not had a working modern Perl on EBCDIC for some years. When I
started out, comments and code led me to conclude erroneously that
natively it supported semantics for all 256 characters 0-255. It turns
out that I was wrong; it natively (at least on some platforms) has the
same rules (essentially none) for the characters which don't correspond
to ASCII onees, as the rules for these on ASCII platforms.
This commit is documentation only, mostly just removing the special
mentions of EBCDIC.
-rw-r--r-- | autodoc.pl | 5 | ||||
-rw-r--r-- | handy.h | 22 | ||||
-rw-r--r-- | pod/perlfunc.pod | 20 | ||||
-rw-r--r-- | pod/perlre.pod | 12 | ||||
-rw-r--r-- | pod/perlrecharclass.pod | 16 | ||||
-rw-r--r-- | pod/perlunicode.pod | 9 | ||||
-rw-r--r-- | pod/perlunifaq.pod | 4 |
7 files changed, 21 insertions, 67 deletions
diff --git a/autodoc.pl b/autodoc.pl index 3b39696af5..925f2f541f 100644 --- a/autodoc.pl +++ b/autodoc.pl @@ -415,11 +415,6 @@ But the ordinals of characters differ between ASCII, EBCDIC, and the UTF- encodings, and a string encoded in UTF-EBCDIC may occupy more bytes than in UTF-8. -Also, on some EBCDIC machines, functions that are documented as operating on -US-ASCII (or Basic Latin in Unicode terminology) may in fact operate on all -256 characters in the EBCDIC range, not just the subset corresponding to -US-ASCII. - The listing below is alphabetical, case insensitive. _EOB_ @@ -489,19 +489,15 @@ Perl rules. If the input is a number that doesn't fit in an octet, FALSE is always returned. Variant C<isFOO_A> (e.g., C<isALPHA_A()>) will return TRUE only if the input is -also in the ASCII character set. For ASCII platforms, the base function with -no suffix and the one with the C<_A> suffix are identical. On EBCDIC -platforms, the C<_A> suffix function will not return true unless the specified -character also has an ASCII equivalent. - -Variant C<isFOO_L1> operates on the full Latin1 character set. For EBCDIC -platforms, the base function with no suffix and the one with the C<_L1> suffix -are identical. For ASCII platforms, the C<_L1> suffix imposes the Latin-1 -character set onto the platform. That is, the code points that are ASCII are -unaffected, since ASCII is a subset of Latin-1. But the non-ASCII code points -are treated as if they are Latin-1 characters. For example, C<isSPACE_L1()> -will return true when called with the code point 0xA0, which is the Latin-1 -NO-BREAK SPACE. +also in the ASCII character set. The base function with no suffix and the one +with the C<_A> suffix are identical. + +Variant C<isFOO_L1> imposes the Latin-1 (or EBCDIC equivlalent) character set +onto the platform. That is, the code points that are ASCII are unaffected, +since ASCII is a subset of Latin-1. But the non-ASCII code points are treated +as if they are Latin-1 characters. For example, C<isWORDCHAR_L1()> will return +true when called with the code point 0xDF, which is a word character in both +ASCII and EBCDIC (though it represent different characters in each). Variant C<isFOO_uni> is like the C<isFOO_L1> variant, but accepts any UV code point as input. If the code point is larger than 255, Unicode rules are used diff --git a/pod/perlfunc.pod b/pod/perlfunc.pod index 2c1e5f41fb..468b6b06b6 100644 --- a/pod/perlfunc.pod +++ b/pod/perlfunc.pod @@ -3287,19 +3287,9 @@ What gets returned depends on several factors: =item If C<use bytes> is in effect: -=over - -=item On EBCDIC platforms - -The results are what the C language system call C<tolower()> returns. - -=item On ASCII platforms - The results follow ASCII semantics. Only characters C<A-Z> change, to C<a-z> respectively. -=back - =item Otherwise, if C<use locale> (but not C<use locale ':not_characters'>) is in effect: Respects current LC_CTYPE locale for code points < 256; and uses Unicode @@ -3326,21 +3316,11 @@ Unicode semantics are used for the case change. =item Otherwise: -=over - -=item On EBCDIC platforms - -The results are what the C language system call C<tolower()> returns. - -=item On ASCII platforms - ASCII semantics are used for the case change. The lowercase of any character outside the ASCII range is the character itself. =back -=back - =item lcfirst EXPR X<lcfirst> X<lowercase> diff --git a/pod/perlre.pod b/pod/perlre.pod index 60173d60a0..80aa306b23 100644 --- a/pod/perlre.pod +++ b/pod/perlre.pod @@ -272,14 +272,6 @@ presenting another potential security issue. See L<http://unicode.org/reports/tr36> for a detailed discussion of Unicode security issues. -On the EBCDIC platforms that Perl handles, the native character set is -equivalent to Latin-1. Thus this modifier changes behavior only when -the C<"/i"> modifier is also specified, and it turns out it affects only -two characters, giving them full Unicode semantics: the C<MICRO SIGN> -will match the Greek capital and small letters C<MU>, otherwise not; and -the C<LATIN CAPITAL LETTER SHARP S> will match any of C<SS>, C<Ss>, -C<sS>, and C<ss>, otherwise not. - This modifier may be specified to be the default by C<use feature 'unicode_strings>, C<use locale ':not_characters'>, or C<L<use 5.012|perlfunc/use VERSION>> (or higher), @@ -326,8 +318,8 @@ results. See L<perlunicode/The "Unicode Bug">. The Unicode Bug has become rather infamous, leading to yet another (printable) name for this modifier, "Dodgy". -On ASCII platforms, the native rules are ASCII, and on EBCDIC platforms -(at least the ones that Perl handles), they are Latin-1. +Unless the pattern or string are encoded in UTF-8, only ASCII characters +can match positively. Here are some examples of how that works on an ASCII platform: diff --git a/pod/perlrecharclass.pod b/pod/perlrecharclass.pod index 681cd06e2a..2611618916 100644 --- a/pod/perlrecharclass.pod +++ b/pod/perlrecharclass.pod @@ -174,7 +174,7 @@ are generally used to add auxiliary markings to letters. C<\w> matches the platform's native underscore character plus whatever the locale considers to be alphanumeric. -=item if Unicode rules are in effect or if on an EBCDIC platform ... +=item if Unicode rules are in effect ... C<\w> matches exactly what C<\p{Word}> matches. @@ -232,7 +232,7 @@ in the table below. C<\s> matches whatever the locale considers to be whitespace. -=item if Unicode rules are in effect or if on an EBCDIC platform ... +=item if Unicode rules are in effect ... C<\s> matches exactly the characters shown with an "s" column in the table below. @@ -289,8 +289,8 @@ C<\s>, C<\h> and C<\v> as of Unicode 6.0. The first column gives the Unicode code point of the character (in hex format), the second column gives the (Unicode) name. The third column indicates -by which class(es) the character is matched (assuming no locale or EBCDIC code -page is in effect that changes the C<\s> matching). +by which class(es) the character is matched (assuming no locale is in +effect that changes the C<\s> matching). 0x0009 CHARACTER TABULATION h s 0x000a LINE FEED (LF) vs @@ -732,10 +732,6 @@ the terminal somehow: for example, newline and backspace are control characters. In the ASCII range, characters whose code points are between 0 and 31 inclusive, plus 127 (C<DEL>) are control characters. -On EBCDIC platforms, it is likely that the code page will define C<[[:cntrl:]]> -to be the EBCDIC equivalents of the ASCII controls, plus the controls -that in Unicode have code points from 128 through 159. - =item [3] Any character that is I<graphical>, that is, visible. This class consists @@ -815,7 +811,7 @@ The POSIX class matches according to the locale, except that C<word> uses the platform's native underscore character, no matter what the locale is. -=item if Unicode rules are in effect or if on an EBCDIC platform ... +=item if Unicode rules are in effect ... The POSIX class matches the same as the Full-range counterpart. @@ -834,7 +830,7 @@ L<perlre/Which character set modifier is in effect?>. It is proposed to change this behavior in a future release of Perl so that whether or not Unicode rules are in effect would not change the -behavior: Outside of locale or an EBCDIC code page, the POSIX classes +behavior: Outside of locale, the POSIX classes would behave like their ASCII-range counterparts. If you wish to comment on this proposal, send email to C<perl5-porters@perl.org>. diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod index 7a0b91593e..7a98285acc 100644 --- a/pod/perlunicode.pod +++ b/pod/perlunicode.pod @@ -98,13 +98,8 @@ while C<use locale ':not_characters'> effectively also selects C<use feature 'unicode_strings'> in its scope; see L<perllocale>.) Otherwise, Perl uses the platform's native byte semantics for characters whose code points are less than 256, and -Unicode semantics for those greater than 255. On EBCDIC platforms, this -is almost seamless, as the EBCDIC code pages that Perl handles are -equivalent to Unicode's first 256 code points. (The exception is that -EBCDIC regular expression case-insensitive matching rules are not as -as robust as Unicode's.) But on ASCII platforms, Perl uses US-ASCII -(or Basic Latin in Unicode terminology) byte semantics, meaning that characters -whose ordinal numbers are in the range 128 - 255 are undefined except for their +Unicode semantics for those greater than 255. That means that non-ASCII +characters are undefined except for their ordinal numbers. This means that none have case (upper and lower), nor are any a member of character classes, like C<[:alpha:]> or C<\w>. (But all do belong to the C<\W> class or the Perl regular expression extension C<[:^alpha:]>.) diff --git a/pod/perlunifaq.pod b/pod/perlunifaq.pod index ca3a180cfd..f952d1a3f9 100644 --- a/pod/perlunifaq.pod +++ b/pod/perlunifaq.pod @@ -149,8 +149,8 @@ rely on the way things worked before Unicode came along. Those older programs knew only about the ASCII character set, and so may not work properly for additional characters. When a string is encoded in UTF-8, Perl assumes that the program is prepared to deal with Unicode, but when -the string isn't, Perl assumes that only ASCII (unless it is an EBCDIC -platform) is wanted, and so those characters that are not ASCII +the string isn't, Perl assumes that only ASCII +is wanted, and so those characters that are not ASCII characters aren't recognized as to what they would be in Unicode. C<use feature 'unicode_strings'> tells Perl to treat all characters as Unicode, whether the string is encoded in UTF-8 or not, thus avoiding |