diff options
author | Karl Williamson <khw@cpan.org> | 2014-11-24 13:19:21 -0700 |
---|---|---|
committer | Karl Williamson <khw@cpan.org> | 2014-11-24 13:43:07 -0700 |
commit | c7d255944c0b238f9cec18e728822535d42a9ed2 (patch) | |
tree | 4ac5dfc5e6cbd25c3a26fad3f166b37ab639acca /pod | |
parent | 22e7ef05c1f7a7fcd58d10d6e720579b9bbea728 (diff) | |
download | perl-c7d255944c0b238f9cec18e728822535d42a9ed2.tar.gz |
Make /[\N{}-\N{}]/ match Unicodely on EBCDIC
This makes [\N{U+06}-\N{U+09}] match U+06, U+07, U+08, U+09 even on
EBCDIC platforms, allowing one to write portable ranges. For 1047
EBCDIC this would match 0x2E, 0x2F, 0x16, and 0x05.
Thanks to Yaroslave Kuzmin for finding a bug in an earlier incarnation
of this patch.
Diffstat (limited to 'pod')
-rw-r--r-- | pod/perlre.pod | 10 | ||||
-rw-r--r-- | pod/perlrecharclass.pod | 18 |
2 files changed, 23 insertions, 5 deletions
diff --git a/pod/perlre.pod b/pod/perlre.pod index 891eb34644..f11e5ff268 100644 --- a/pod/perlre.pod +++ b/pod/perlre.pod @@ -2312,8 +2312,14 @@ Note also that the whole range idea is rather unportable between character sets--and even within character sets they may cause results you probably didn't expect. A sound principle is to use only ranges that begin from and end at either alphabetics of equal case ([a-e], -[A-E]), or digits ([0-9]). Anything else is unsafe. If in doubt, -spell out the character sets in full. +[A-E]), or digits ([0-9]). Anything else is unsafe or unclear. If in +doubt, spell out the character sets in full. Specifying the end points +of the range using the C<\N{...}> syntax, using Unicode names or code +points makes the range portable, but still likely not easily +understandable to someone reading the code. For example, +C<[\N{U+04}-\N{U+07}]> means to match the Unicode code points +C<\N{U+04}>, C<\N{U+05}>, C<\N{U+06}>, and C<\N{U+07}>, whatever their +native values may be on the platform. Characters may be specified using a metacharacter syntax much like that used in C: "\n" matches a newline, "\t" a tab, "\r" a carriage return, diff --git a/pod/perlrecharclass.pod b/pod/perlrecharclass.pod index c79c9a0399..fb5868d521 100644 --- a/pod/perlrecharclass.pod +++ b/pod/perlrecharclass.pod @@ -608,10 +608,22 @@ Examples: # hyphen ('-'), or the letter 'm'. ['-?] # Matches any of the characters '()*+,-./0123456789:;<=>? # (But not on an EBCDIC platform). - -Perl guarantees that the ranges C<A-Z>, C<a-z>, C<0-9>, and any + [\N{APOSTROPHE}-\N{QUESTION MARK}] + # Matches any of the characters '()*+,-./0123456789:;<=>? + # even on an EBCDIC platform. + [\N{U+27}-\N{U+3F}] # Same. (U+27 is "'", and U+3F is "?" + +As the final two examples above show, you can achieve portablity to +non-ASCII platforms by using the C<\N{...}> form for the range +endpoints. These indicate that the specified range is to be interpreted +using Unicode values, so C<[\N{U+27}-\N{U+3F}]> means to match +C<\N{U+27}>, C<\N{U+28}>, C<\N{U+29}>, ..., C<\N{U+3D}>, C<\N{U+3E}>, +and C<\N{U+3F}>, whatever the native code point versions for those are. + +Perl also guarantees that the ranges C<A-Z>, C<a-z>, C<0-9>, and any subranges of these match what an English-only speaker would expect them -to match. That is, C<[A-Z]> matches the 26 ASCII uppercase letters; +to match on any platform. That is, C<[A-Z]> matches the 26 ASCII +uppercase letters; C<[a-z]> matches the 26 lowercase letters; and C<[0-9]> matches the 10 digits. Subranges, like C<[h-k]>, match correspondingly, in this case just the four letters C<"h">, C<"i">, C<"j">, and C<"k">. This is the |