Document special EBCDIC [...] literal range handling

author: Karl Williamson <khw@cpan.org> 2014-10-06 12:14:36 -0600
committer: Karl Williamson <khw@cpan.org> 2014-10-07 08:51:11 -0600
commit: 09e4339761388239d17da23bf3fa0c882a0b04bf (patch)
tree: a978a3c7d9e007c0d8fc6d6757ffa9c47bdde128 /pod/perlrecharclass.pod
parent: 423df6e4ea0fd95811eb041174e9e88a3e25975f (diff)
download: perl-09e4339761388239d17da23bf3fa0c882a0b04bf.tar.gz
1 files changed, 30 insertions, 5 deletions
diff --git a/pod/perlrecharclass.pod b/pod/perlrecharclass.pod
index 3a38e5626c..4ab99ac54b 100644
--- a/pod/perlrecharclass.pod
+++ b/pod/perlrecharclass.pod
@@ -600,11 +600,6 @@ your set of characters to be matched and its position in the class is such
 that it could be considered part of a range, you must escape that hyphen
 with a backslash.
 
-The classes C<< [A-Z] >> and C<< [a-z] >> are special cased, in the sense
-they always match exactly the 26 upper/lower case letters, regardless
-of the platform (this only effects EBCDIC, which would otherwise include 
-some non-letters).
-
 Examples:
 
  [a-z]       #  Matches a character that is a lower case ASCII letter.
@@ -616,6 +611,36 @@ Examples:
  ['-?]       #  Matches any of the characters  '()*+,-./0123456789:;<=>?
              #  (But not on an EBCDIC platform).
 
+Perl guarantees that the ranges C<A-Z>, C<a-z>, C<0-9>, and any
+subranges of these match what an English-only speaker would expect them
+to match.  That is, C<[A-Z]> matches the 26 ASCII uppercase letters;
+C<[a-z]> matches the 26 lowercase letters; and C<[0-9]> matches the 10
+digits.  Subranges, like C<[h-k]>, match correspondingly, in this case
+just the four letters C<"h">, C<"i">, C<"j">, and C<"k">.  This is the
+natural behavior on ASCII platforms where the code points (ordinal
+values) for C<"h"> through C<"k"> are consecutive integers (0x68 through
+0x6B).  But special handling to achieve this may be needed on platforms
+with a non-ASCII native character set.  For example, on EBCDIC
+platforms, the code point for C<"h"> is 0x88, C<"i"> is 0x89, C<"j"> is
+0x91, and C<"k"> is 0x92.   Perl specially treats C<[h-k]> to exclude the
+seven code points in the gap: 0x8A through 0x90.  This special handling is
+only invoked when the range is a subrange of one of the ASCII uppercase,
+lowercase, and digit ranges, AND each end of the range is expressed
+either as a literal, like C<"A">, or as a named character (C<\N{...}>,
+including the C<\N{U+...> form).
+
+EBCDIC Examples:
+
+ [i-j]               #  Matches either "i" or "j"
+ [i-\N{LATIN SMALL LETTER J}]  # Same
+ [i-\N{U+6A}]        #  Same
+ [\N{U+69}-\N{U+6A}] #  Same
+ [\x{89}-\x{91}]     #  Matches 0x89 ("i"), 0x8A .. 0x90, 0x91 ("j")
+ [i-\x{91}]          #  Same
+ [\x{89}-j]          #  Same
+ [i-J]               #  Matches, 0x89 ("i") .. 0xC1 ("J"); special
+                     #  handling doesn't apply because range is mixed
+                     #  case
 
 =head3 Negation
author	Karl Williamson <khw@cpan.org>	2014-10-06 12:14:36 -0600
committer	Karl Williamson <khw@cpan.org>	2014-10-07 08:51:11 -0600
commit	09e4339761388239d17da23bf3fa0c882a0b04bf (patch)
tree	a978a3c7d9e007c0d8fc6d6757ffa9c47bdde128 /pod/perlrecharclass.pod
parent	423df6e4ea0fd95811eb041174e9e88a3e25975f (diff)
download	perl-09e4339761388239d17da23bf3fa0c882a0b04bf.tar.gz