summaryrefslogtreecommitdiff
path: root/pod/perlrecharclass.pod
diff options
context:
space:
mode:
authorKarl Williamson <khw@cpan.org>2014-10-06 12:14:36 -0600
committerKarl Williamson <khw@cpan.org>2014-10-07 08:51:11 -0600
commit09e4339761388239d17da23bf3fa0c882a0b04bf (patch)
treea978a3c7d9e007c0d8fc6d6757ffa9c47bdde128 /pod/perlrecharclass.pod
parent423df6e4ea0fd95811eb041174e9e88a3e25975f (diff)
downloadperl-09e4339761388239d17da23bf3fa0c882a0b04bf.tar.gz
Document special EBCDIC [...] literal range handling
Diffstat (limited to 'pod/perlrecharclass.pod')
-rw-r--r--pod/perlrecharclass.pod35
1 files changed, 30 insertions, 5 deletions
diff --git a/pod/perlrecharclass.pod b/pod/perlrecharclass.pod
index 3a38e5626c..4ab99ac54b 100644
--- a/pod/perlrecharclass.pod
+++ b/pod/perlrecharclass.pod
@@ -600,11 +600,6 @@ your set of characters to be matched and its position in the class is such
that it could be considered part of a range, you must escape that hyphen
with a backslash.
-The classes C<< [A-Z] >> and C<< [a-z] >> are special cased, in the sense
-they always match exactly the 26 upper/lower case letters, regardless
-of the platform (this only effects EBCDIC, which would otherwise include
-some non-letters).
-
Examples:
[a-z] # Matches a character that is a lower case ASCII letter.
@@ -616,6 +611,36 @@ Examples:
['-?] # Matches any of the characters '()*+,-./0123456789:;<=>?
# (But not on an EBCDIC platform).
+Perl guarantees that the ranges C<A-Z>, C<a-z>, C<0-9>, and any
+subranges of these match what an English-only speaker would expect them
+to match. That is, C<[A-Z]> matches the 26 ASCII uppercase letters;
+C<[a-z]> matches the 26 lowercase letters; and C<[0-9]> matches the 10
+digits. Subranges, like C<[h-k]>, match correspondingly, in this case
+just the four letters C<"h">, C<"i">, C<"j">, and C<"k">. This is the
+natural behavior on ASCII platforms where the code points (ordinal
+values) for C<"h"> through C<"k"> are consecutive integers (0x68 through
+0x6B). But special handling to achieve this may be needed on platforms
+with a non-ASCII native character set. For example, on EBCDIC
+platforms, the code point for C<"h"> is 0x88, C<"i"> is 0x89, C<"j"> is
+0x91, and C<"k"> is 0x92. Perl specially treats C<[h-k]> to exclude the
+seven code points in the gap: 0x8A through 0x90. This special handling is
+only invoked when the range is a subrange of one of the ASCII uppercase,
+lowercase, and digit ranges, AND each end of the range is expressed
+either as a literal, like C<"A">, or as a named character (C<\N{...}>,
+including the C<\N{U+...> form).
+
+EBCDIC Examples:
+
+ [i-j] # Matches either "i" or "j"
+ [i-\N{LATIN SMALL LETTER J}] # Same
+ [i-\N{U+6A}] # Same
+ [\N{U+69}-\N{U+6A}] # Same
+ [\x{89}-\x{91}] # Matches 0x89 ("i"), 0x8A .. 0x90, 0x91 ("j")
+ [i-\x{91}] # Same
+ [\x{89}-j] # Same
+ [i-J] # Matches, 0x89 ("i") .. 0xC1 ("J"); special
+ # handling doesn't apply because range is mixed
+ # case
=head3 Negation