diff options
author | Karl Williamson <public@khwilliamson.com> | 2014-01-27 15:35:00 -0700 |
---|---|---|
committer | Karl Williamson <public@khwilliamson.com> | 2014-01-27 23:03:48 -0700 |
commit | 31f05a37c4e9c37a7263491f2fc0237d836e1a80 (patch) | |
tree | 7537c7e179350243b3de0f3a99d6747c9c7812e6 /lib | |
parent | cea315b64e0c4b1890867df0c925cafc8823ba38 (diff) | |
download | perl-31f05a37c4e9c37a7263491f2fc0237d836e1a80.tar.gz |
Work properly under UTF-8 LC_CTYPE locales
This large (sorry, I couldn't figure out how to meaningfully split it
up) commit causes Perl to fully support LC_CTYPE operations (case
changing, character classification) in UTF-8 locales.
As a side effect it resolves [perl #56820].
The basics are easy, but there were a lot of details, and one
troublesome edge case discussed below.
What essentially happens is that when the locale is changed to a UTF-8
one, a global variable is set TRUE (FALSE when changed to a non-UTF-8
locale). Within the scope of 'use locale', this variable is checked,
and if TRUE, the code that Perl uses for non-locale behavior is used
instead of the code for locale behavior. Since Perl's internal
representation is UTF-8, we get UTF-8 behavior for a UTF-8 locale.
More work had to be done for regular expressions. There are three
cases.
1) The character classes \w, [[:punct:]] needed no extra work, as
the changes fall out from the base work.
2) Strings that are to be matched case-insensitively. These form
EXACTFL regops (nodes). Notice that if such a string contains only
characters above-Latin1 that match only themselves, that the node can be
downgraded to an EXACT-only node, which presents better optimization
possibilities, as we now have a fixed string known at compile time to be
required to be in the target string to match. Similarly if all
characters in the string match only other above-Latin1 characters
case-insensitively, the node can be downgraded to a regular EXACTFU node
(match, folding, using Unicode, not locale, rules). The code changes
for this could be done without accepting UTF-8 locales fully, but there
were edge cases which needed to be handled differently if I stopped
there, so I continued on.
In an EXACTFL node, all such characters are now folded at compile time
(just as before this commit), while the other characters whose folds are
locale-dependent are left unfolded. This means that they have to be
folded at execution time based on the locale in effect at the moment.
Again, this isn't a change from before. The difference is that now some
of the folds that need to be done at execution time (in regexec) are
potentially multi-char. Some of the code in regexec was trivial to
extend to account for this because of existing infrastructure, but the
part dealing with regex quantifiers, had to have more work.
Also the code that joins EXACTish nodes together had to be expanded to
account for the possibility of multi-character folds within locale
handling. This was fairly easy, because it already has infrastructure
to handle these under somewhat different circumstances.
3) In bracketed character classes, represented by ANYOF nodes, a new
inversion list was created giving the characters that should be matched
by this node when the runtime locale is UTF-8. The list is ignored
except under that circumstance. To do this, I created a new ANYOF type
which has an extra SV for the inversion list.
The edge case that caused the most difficulty is folding involving the
MICRO SIGN, U+00B5. It folds to the GREEK SMALL LETTER MU, as does the
GREEK CAPITAL LETTER MU. The MICRO SIGN is the only 0-255 range
character that folds to outside that range. The issue is that it
doesn't naturally fall out that it will match the CAP MU. If we let the
CAP MU fold to the samll mu at compile time (which it can because both
are above-Latin1 and so the fold is the same no matter what locale is in
effect), it could appear that the regnode can be downgraded away from
EXACTFL to EXACTFU, but doing so would cause the MICRO SIGN to not case
insensitvely match the CAP MU. This could be special cased in regcomp
and regexec, but I wanted to avoid that. Instead the mktables tables
are set up to include the CAP MU as a character whose presence forbids
the downgrading, so the special casing is in mktables, and not in the C
code.
Diffstat (limited to 'lib')
-rw-r--r-- | lib/locale.pm | 18 | ||||
-rw-r--r-- | lib/locale.t | 8 | ||||
-rw-r--r-- | lib/unicore/mktables | 75 |
3 files changed, 73 insertions, 28 deletions
diff --git a/lib/locale.pm b/lib/locale.pm index fbb4a185af..f7575f5007 100644 --- a/lib/locale.pm +++ b/lib/locale.pm @@ -27,24 +27,6 @@ expressions, LC_COLLATE for string comparison, and LC_NUMERIC for number formatting). Each "use locale" or "no locale" affects statements to the end of the enclosing BLOCK. -Starting in Perl 5.16, a hybrid mode for this pragma is available, - - use locale ':not_characters'; - -which enables only the portions of locales that don't affect the character -set (that is, all except LC_COLLATE and LC_CTYPE). This is useful when mixing -Unicode and locales, including UTF-8 locales. - - use locale ':not_characters'; - use open ":locale"; # Convert I/O to/from Unicode - use POSIX qw(locale_h); # Import the LC_ALL constant - setlocale(LC_ALL, ""); # Generally required for the next - # statement to take effect - printf "%.2f\n", 12345.67' # Locale-defined formatting - @x = sort @y; # Unicode-defined sorting order. - # (Note that you will get better - # results using Unicode::Collate.) - See L<perllocale> for more detailed information on how Perl supports locales. diff --git a/lib/locale.t b/lib/locale.t index 8afbeab1cd..987d19a092 100644 --- a/lib/locale.t +++ b/lib/locale.t @@ -602,10 +602,10 @@ foreach my $Locale (@Locale) { next; } - # We test UTF-8 locales only under ':not_characters'; otherwise they have - # documented deficiencies. Non- UTF-8 locales are tested only under plain - # 'use locale', as otherwise we would have to convert everything in them - # to Unicode. + # We test UTF-8 locales only under ':not_characters'; It is easier to + # test them in other test files than here. Non- UTF-8 locales are tested + # only under plain 'use locale', as otherwise we would have to convert + # everything in them to Unicode. my %UPPER = (); # All alpha X for which uc(X) == X and lc(X) != X my %lower = (); # All alpha X for which lc(X) == X and uc(X) != X diff --git a/lib/unicore/mktables b/lib/unicore/mktables index 8ab0b4629e..8f6def0729 100644 --- a/lib/unicore/mktables +++ b/lib/unicore/mktables @@ -13849,6 +13849,50 @@ sub compile_perl() { my $any_folds = $perl->add_match_table("_Perl_Any_Folds", Description => "Code points that particpate in some fold", ); + my $loc_problem_folds = $perl->add_match_table( + "_Perl_Problematic_Locale_Folds", + Description => + "Code points that are in some way problematic under locale", + ); + + # This allows regexec.c to skip some work when appropriate. Some of the + # entries in _Perl_Problematic_Locale_Folds are multi-character folds, + my $loc_problem_folds_start = $perl->add_match_table( + "_Perl_Problematic_Locale_Foldeds_Start", + Description => + "The first character of every sequence in _Perl_Problematic_Locale_Folds", + ); + + my $cf = property_ref('Case_Folding'); + + # Every character 0-255 is problematic because what each folds to depends + # on the current locale + $loc_problem_folds->add_range(0, 255); + $loc_problem_folds_start += $loc_problem_folds; + + # Also problematic are anything these fold to outside the range. Likely + # forever the only thing folded to by these outside the 0-255 range is the + # GREEK SMALL MU (from the MICRO SIGN), but it's easy to make the code + # completely general, which should catch any unexpected changes or errors. + # We look at each code point 0-255, and add its fold (including each part + # of a multi-char fold) to the list. See the commit message for these + # changes for a more complete description of the MU issue. + foreach my $range ($loc_problem_folds->ranges) { + foreach my $code_point($range->start .. $range->end) { + my $fold_range = $cf->containing_range($code_point); + next unless defined $fold_range; + + my @hex_folds = split " ", $fold_range->value; + my $start_cp = hex $hex_folds[0]; + foreach my $i (0 .. @hex_folds - 1) { + my $cp = hex $hex_folds[$i]; + next unless $cp > 255; # Already have the < 256 ones + + $loc_problem_folds->add_range($cp, $cp); + $loc_problem_folds_start->add_range($start_cp, $start_cp); + } + } + } my $folds_to_multi_char = $perl->add_match_table( "_Perl_Folds_To_Multi_Char", @@ -13856,19 +13900,38 @@ sub compile_perl() { "Code points whose fold is a string of more than one character", ); - foreach my $range (property_ref('Case_Folding')->ranges) { + # Look through all the known folds to populate these tables. + foreach my $range ($cf->ranges) { my $start = $range->start; my $end = $range->end; $any_folds->add_range($start, $end); - my @hex_code_points = split " ", $range->value; - if (@hex_code_points > 1) { + my @hex_folds = split " ", $range->value; + if (@hex_folds > 1) { # Is multi-char fold $folds_to_multi_char->add_range($start, $end); } - foreach my $i (0 .. @hex_code_points - 1) { - my $code_point = hex $hex_code_points[$i]; - $any_folds->add_range($code_point, $code_point); + my $found_locale_problematic = 0; + + # Look at each of the folded-to characters... + foreach my $i (0 .. @hex_folds - 1) { + my $cp = hex $hex_folds[$i]; + $any_folds->add_range($cp, $cp); + + # The fold is problematic if any of the folded-to characters is + # already considered problematic. + if ($loc_problem_folds->contains($cp)) { + $loc_problem_folds->add_range($start, $end); + $found_locale_problematic = 1; + } + } + + # If this is a problematic fold, add to the start chars the + # folding-from characters and first folded-to character. + if ($found_locale_problematic) { + $loc_problem_folds_start->add_range($start, $end); + my $cp = hex $hex_folds[0]; + $loc_problem_folds_start->add_range($cp, $cp); } } |