summaryrefslogtreecommitdiff
path: root/pod/perlre.pod
diff options
context:
space:
mode:
authorKarl Williamson <public@khwilliamson.com>2011-02-19 10:20:50 -0700
committerKarl Williamson <public@khwilliamson.com>2011-02-19 11:47:42 -0700
commit17580e7a366d68c68fa37fe63c284c1d83b245fe (patch)
tree083c8313722786593172d0a6ff14a4767c9d520c /pod/perlre.pod
parent41ce0a5ea4efb0067c8416f074bb757a70d2faa1 (diff)
downloadperl-17580e7a366d68c68fa37fe63c284c1d83b245fe.tar.gz
Fix locale caseless matching and utf8
As explained in the doc changes of this patch, under /l, caseless matching of code points less than 256 now use locale rules regardless of the utf8ness of the pattern or string. They now match the behavior of things like \w, in this regard.
Diffstat (limited to 'pod/perlre.pod')
-rw-r--r--pod/perlre.pod14
1 files changed, 11 insertions, 3 deletions
diff --git a/pod/perlre.pod b/pod/perlre.pod
index 56e42f8088..31e881703c 100644
--- a/pod/perlre.pod
+++ b/pod/perlre.pod
@@ -52,7 +52,8 @@ X<regular expression, case-insensitive>
Do case-insensitive pattern matching.
If C<use locale> is in effect, the case map is taken from the current
-locale. See L<perllocale>.
+locale for code points less than 255, and from Unicode rules for larger
+code points. See L<perllocale>.
=item x
X</x>
@@ -655,8 +656,15 @@ locale, and can differ from one match to another if there is an
intervening call of the
L<setlocale() function|perllocale/The setlocale function>.
This modifier is automatically set if the regular expression is compiled
-within the scope of a C<"use locale"> pragma. Results are not
-well-defined when using this and matching against a utf8-encoded string.
+within the scope of a C<"use locale"> pragma.
+Perl only allows single-byte locales. This means that code points above
+255 are treated as Unicode no matter what locale is in effect.
+Under Unicode rules, there are a few case-insensitive matches that cross the
+boundary 255/256 boundary. These are disallowed. For example,
+0xFF does not caselessly match the character at 0x178, LATIN CAPITAL
+LETTER Y WITH DIAERESIS, because 0xFF may not be LATIN SMALL LETTER Y
+in the current locale, and Perl has no way of knowing if that character
+even exists in the locale, much less what code point it is.
C<"u"> means to use Unicode semantics when pattern matching. It is
automatically set if the regular expression is encoded in utf8, or is