diff options
author | Karl Williamson <public@khwilliamson.com> | 2011-02-19 10:20:50 -0700 |
---|---|---|
committer | Karl Williamson <public@khwilliamson.com> | 2011-02-19 11:47:42 -0700 |
commit | 17580e7a366d68c68fa37fe63c284c1d83b245fe (patch) | |
tree | 083c8313722786593172d0a6ff14a4767c9d520c /pod/perlre.pod | |
parent | 41ce0a5ea4efb0067c8416f074bb757a70d2faa1 (diff) | |
download | perl-17580e7a366d68c68fa37fe63c284c1d83b245fe.tar.gz |
Fix locale caseless matching and utf8
As explained in the doc changes of this patch, under /l, caseless
matching of code points less than 256 now use locale rules regardless
of the utf8ness of the pattern or string. They now match the behavior
of things like \w, in this regard.
Diffstat (limited to 'pod/perlre.pod')
-rw-r--r-- | pod/perlre.pod | 14 |
1 files changed, 11 insertions, 3 deletions
diff --git a/pod/perlre.pod b/pod/perlre.pod index 56e42f8088..31e881703c 100644 --- a/pod/perlre.pod +++ b/pod/perlre.pod @@ -52,7 +52,8 @@ X<regular expression, case-insensitive> Do case-insensitive pattern matching. If C<use locale> is in effect, the case map is taken from the current -locale. See L<perllocale>. +locale for code points less than 255, and from Unicode rules for larger +code points. See L<perllocale>. =item x X</x> @@ -655,8 +656,15 @@ locale, and can differ from one match to another if there is an intervening call of the L<setlocale() function|perllocale/The setlocale function>. This modifier is automatically set if the regular expression is compiled -within the scope of a C<"use locale"> pragma. Results are not -well-defined when using this and matching against a utf8-encoded string. +within the scope of a C<"use locale"> pragma. +Perl only allows single-byte locales. This means that code points above +255 are treated as Unicode no matter what locale is in effect. +Under Unicode rules, there are a few case-insensitive matches that cross the +boundary 255/256 boundary. These are disallowed. For example, +0xFF does not caselessly match the character at 0x178, LATIN CAPITAL +LETTER Y WITH DIAERESIS, because 0xFF may not be LATIN SMALL LETTER Y +in the current locale, and Perl has no way of knowing if that character +even exists in the locale, much less what code point it is. C<"u"> means to use Unicode semantics when pattern matching. It is automatically set if the regular expression is encoded in utf8, or is |