diff options
author | Karl Williamson <public@khwilliamson.com> | 2011-08-21 11:49:28 -0600 |
---|---|---|
committer | Karl Williamson <public@khwilliamson.com> | 2011-08-22 08:25:39 -0600 |
commit | 5d1892be338d6bf95fbda32ca575c0395a66f1a7 (patch) | |
tree | 64f99cb03a474e36ecd602d429101565a394dc1c /pod/perlunicode.pod | |
parent | 3712c77bc49a626e2717a6289ed2e3948b64814e (diff) | |
download | perl-5d1892be338d6bf95fbda32ca575c0395a66f1a7.tar.gz |
Remove user-defined casing feature
This feature was deprecated in 5.14 and scheduled to remove in 5.16. A
CPAN module was written to provide better functionality without the
significant drawbacks of this implementation.
Diffstat (limited to 'pod/perlunicode.pod')
-rw-r--r-- | pod/perlunicode.pod | 207 |
1 files changed, 12 insertions, 195 deletions
diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod index 4779cc5dca..5e1ff36074 100644 --- a/pod/perlunicode.pod +++ b/pod/perlunicode.pod @@ -260,11 +260,12 @@ complement B<and> the full character-wide bit complement. =item * -You can define your own mappings to be used in C<lc()>, -C<lcfirst()>, C<uc()>, and C<ucfirst()> (or their double-quoted string inlined -versions such as C<\U>). See -L<User-Defined Case-Mappings|/"User-Defined Case Mappings (for serious hackers only)"> -for more details. +There is a CPAN module, L<Unicode::Casing>, which allows you to define +your own mappings to be used in C<lc()>, C<lcfirst()>, C<uc()>, and +C<ucfirst()> (or their double-quoted string inlined versions such as +C<\U>). (Prior to Perl 5.16, this functionality was partially provided +in the Perl core, but suffered from a number of insurmountable +drawbacks, so the CPAN module was written instead.) =back @@ -915,190 +916,12 @@ would be intersecting with nothing, resulting in an empty set. =head2 User-Defined Case Mappings (for serious hackers only) -B<This featured is deprecated and is scheduled to be removed in Perl -5.16.> -The CPAN module L<Unicode::Casing> provides better functionality -without the drawbacks described below. - -You can define your own mappings to be used in C<lc()>, -C<lcfirst()>, C<uc()>, and C<ucfirst()> (or their string-inlined versions, -C<\L>, C<\l>, C<\U>, and C<\u>). The mappings are currently only valid -on strings encoded in UTF-8, but see below for a partial workaround for -this restriction. - -The principle is similar to that of user-defined character -properties: define subroutines that do the mappings. -C<ToLower> is used for C<lc()>, C<\L>, C<lcfirst()>, and C<\l>; C<ToTitle> for -C<ucfirst()> and C<\u>; and C<ToUpper> for C<uc()> and C<\U>. - -C<ToUpper()> should look something like this: - - sub ToUpper { - return <<END; - 0061\t007A\t0041 - 0101\t\t0100 - END - } - -This sample C<ToUpper()> has the effect of mapping "a-z" to "A-Z", 0x101 -to 0x100, and all other characters map to themselves. The first -returned line means to map the code point at 0x61 ("a") to 0x41 ("A"), -the code point at 0x62 ("b") to 0x42 ("B"), ..., 0x7A ("z") to 0x5A -("Z"). The second line maps just the code point 0x101 to 0x100. Since -there are no other mappings defined, all other code points map to -themselves. - -This mechanism is not well behaved as far as affecting other packages -and scopes. All non-threaded programs have exactly one uppercasing -behavior, one lowercasing behavior, and one titlecasing behavior in -effect for utf8-encoded strings for the duration of the program. Each -of these behaviors is irrevocably determined the first time the -corresponding function is called to change a utf8-encoded string's case. -If a corresponding C<To-> function has been defined in the package that -makes that first call, the mapping defined by that function will be the -mapping used for the duration of the program's execution across all -packages and scopes. If no corresponding C<To-> function has been -defined in that package, the standard official mapping will be used for -all packages and scopes, and any corresponding C<To-> function anywhere -will be ignored. Threaded programs have similar behavior. If the -program's casing behavior has been decided at the time of a thread's -creation, the thread will inherit that behavior. But, if the behavior -hasn't been decided, the thread gets to decide for itself, and its -decision does not affect other threads nor its creator. - -As shown by the example above, you have to furnish a complete mapping; -you can't just override a couple of characters and leave the rest -unchanged. You can find all the official mappings in the directory -C<$Config{privlib}>F</unicore/To/>. The mapping data is returned as the -here-document. The C<utf8::ToSpecI<Foo>> hashes in those files are special -exception mappings derived from -C<$Config{privlib}>F</unicore/SpecialCasing.txt>. (The "Digit" and -"Fold" mappings that one can see in the directory are not directly -user-accessible, one can use either the L<Unicode::UCD> module, or just match -case-insensitively, which is what uses the "Fold" mapping. Neither are user -overridable.) - -If you have many mappings to change, you can take the official mapping data, -change by hand the affected code points, and place the whole thing into your -subroutine. But this will only be valid on Perls that use the same Unicode -version. Another option would be to have your subroutine read the official -mapping files and overwrite the affected code points. - -If you have only a few mappings to change, starting in 5.14 you can use the -following trick, here illustrated for Turkish. - - use Config; - use charnames ":full"; - - sub ToUpper { - my $official = do "$Config{privlib}/unicore/To/Upper.pl"; - $utf8::ToSpecUpper{'i'} = - "\N{LATIN CAPITAL LETTER I WITH DOT ABOVE}"; - return $official; - } - -This takes the official mappings and overrides just one, for "LATIN SMALL -LETTER I". The keys to the hash must be the bytes that form the UTF-8 -(on EBCDIC platforms, UTF-EBCDIC) of the character, as illustrated by -the inverse function. - - sub ToLower { - my $official = do $lower; - $utf8::ToSpecLower{"\xc4\xb0"} = "i"; - return $official; - } - -This example is for an ASCII platform, and C<\xc4\xb0> is the string of -bytes that together form the UTF-8 that represents C<\N{LATIN CAPITAL -LETTER I WITH DOT ABOVE}>, C<U+0130>. You can avoid having to figure out -these bytes, and at the same time make it work on all platforms by -instead writing: - - sub ToLower { - my $official = do $lower; - my $sequence = "\N{LATIN CAPITAL LETTER I WITH DOT ABOVE}"; - utf8::encode($sequence); - $utf8::ToSpecLower{$sequence} = "i"; - return $official; - } - -This works because C<utf8::encode()> takes the single character and -converts it to the sequence of bytes that constitute it. Note that we took -advantage of the fact that C<"i"> is the same in UTF-8 or UTF_EBCIDIC as not; -otherwise we would have had to write - - $utf8::ToSpecLower{$sequence} = "\N{LATIN SMALL LETTER I}"; - -in the ToLower example, and in the ToUpper example, use - - my $sequence = "\N{LATIN SMALL LETTER I}"; - utf8::encode($sequence); - -A big caveat to the above trick and to this whole mechanism in general, -is that they work only on strings encoded in UTF-8. You can partially -get around this by using C<use subs>. (But better to just convert to -use L<Unicode::Casing>.) For example: -(The trick illustrated here does work in earlier releases, but only if all the -characters you want to override have ordinal values of 256 or higher, or -if you use the other tricks given just below.) - -The mappings are in effect only for the package they are defined in, and only -on scalars that have been marked as having Unicode characters, for example by -using C<utf8::upgrade()>. Although probably not advisable, you can -cause the mappings to be used globally by importing into C<CORE::GLOBAL> -(see L<CORE>). - -You can partially get around the restriction that the source strings -must be in utf8 by using C<use subs> (or by importing into C<CORE::GLOBAL>) by: - - use subs qw(uc ucfirst lc lcfirst); - - sub uc($) { - my $string = shift; - utf8::upgrade($string); - return CORE::uc($string); - } - - sub lc($) { - my $string = shift; - utf8::upgrade($string); - - # Unless an I is before a dot_above, it turns into a dotless i. - # (The character class with the combining classes matches non-above - # marks following the I. Any number of these may be between the - # 'I'and the dot_above, and the dot_above will still apply to the - # 'I'. - use charnames ":full"; - $string =~ - s/I - (?! [^\p{ccc=0}\p{ccc=Above}]* \N{COMBINING DOT ABOVE} ) - /\N{LATIN SMALL LETTER DOTLESS I}/gx; - - # But when the I is followed by a dot_above, remove the - # dot_above so the end result will be i. - $string =~ s/I - ([^\p{ccc=0}\p{ccc=Above}]* ) - \N{COMBINING DOT ABOVE} - /i$1/gx; - return CORE::lc($string); - } - -These examples (also for Turkish) make sure the input is in UTF-8, and then -call the corresponding official function, which will use the C<ToUpper()> and -C<ToLower()> functions you have defined. -(For Turkish, there are other required functions: C<ucfirst>, C<lcfirst>, -and C<ToTitle>. These are very similar to the ones given above.) - -The reason this is only a partial fix is that it doesn't affect the C<\l>, -C<\L>, C<\u>, and C<\U> case-change operations in regular expressions, -which still require the source to be encoded in utf8 (see L</The "Unicode -Bug">). (Again, use L<Unicode::Casing> instead.) - -The C<lc()> example shows how you can add context-dependent casing. Note -that context-dependent casing suffers from the problem that the string -passed to the casing function may not have sufficient context to make -the proper choice. Also, it will not be called for C<\l>, C<\L>, C<\u>, -and C<\U>. +B<This feature has been removed as of Perl 5.16.> +The CPAN module L<Unicode::Casing> provides better functionality without +the drawbacks that this feature had. If you are using a Perl earlier +than 5.16, this feature was most fully documented in the 5.14 version of +this pod: +L<http://perldoc.perl.org/5.14.0/perlunicode.html#User-Defined-Case-Mappings-%28for-serious-hackers-only%29> =head2 Character Encodings for Input and Output @@ -1557,12 +1380,6 @@ In C<quotemeta> or its inline equivalent C<\Q>, no characters code points above 127 are quoted in UTF-8 encoded strings, but in byte encoded strings, code points between 128-255 are always quoted. -=item * - -User-defined case change mappings. You can create a C<ToUpper()> function, for -example, which overrides Perl's built-in case mappings. The scalar must be -encoded in utf8 for your function to actually be invoked. - =back This behavior can lead to unexpected results in which a string's semantics |