diff options
author | Karl Williamson <khw@cpan.org> | 2016-05-17 20:50:55 -0600 |
---|---|---|
committer | Karl Williamson <khw@cpan.org> | 2016-05-24 10:28:37 -0600 |
commit | a4a439fb9cd74c575855119abb55dc091955bdf4 (patch) | |
tree | 72bf312a124186367d08e573bddca86d01126788 /embedvar.h | |
parent | ff52fcf1dae90deb49f680d7cdbf78a04458ac47 (diff) | |
download | perl-a4a439fb9cd74c575855119abb55dc091955bdf4.tar.gz |
Do better locale collation in UTF-8 locales
On some platforms, the libc strxfrm() works reasonably well on UTF-8
locales, giving a default collation ordering. It will assume that every
string passed to it is in UTF-8. This commit changes Perl to make sure
that strxfrm's expectations are met.
Likewise under a non-UTF-8 locale, strxfrm is expecting a non-UTF-8
string. And this commit makes sure of that as well.
So, simply meeting strxfrm's expectations allows Perl to start
supporting default collation in UTF-8 locales, and fixes it to work on
single-byte locales with UTF-8 input. (Unicode::Collate provides
tailorable functionality and is portable to platforms where strxfrm
isn't as intelligent, but is a much more heavy-weight solution that may
not be needed for particular applications.)
There is a problem in non-UTF-8 locales if the passed string contains
code points representable only in UTF-8. This commit causes them to be
changed, before being passed to strxfrm, into the highest collating
character in the locale that doesn't require UTF-8. They then will sort
the same as that character, which means after all other characters in
the locale but that one. In strings that don't have that character,
this will generally provide exactly correct operation. There still is a
problem, if that character, in the given locale, combines with adjacent
characters to form a specially weighted sequence. Then, the change of
these above-255 code points into that character can skew the results.
See the commit message for 6696cfa7cc3a0e1e0eab29a11ac131e6f5a3469e for
more on this. But it is really an illegal situation to have above-255
code points in a single-byte locale, so this behavior is a reasonable
degradation when given illegal input. If two transformed strings
compare exactly equal, Perl already uses the un-transformed versions to
break ties, and there, these faked-up strings will collate so the
above-255 code points sort after everything else, and in code point
order amongst themselves.
Diffstat (limited to 'embedvar.h')
-rw-r--r-- | embedvar.h | 1 |
1 files changed, 1 insertions, 0 deletions
diff --git a/embedvar.h b/embedvar.h index 67383680f5..c2831d642a 100644 --- a/embedvar.h +++ b/embedvar.h @@ -310,6 +310,7 @@ #define PL_stdingv (vTHX->Istdingv) #define PL_strtab (vTHX->Istrtab) #define PL_strxfrm_is_behaved (vTHX->Istrxfrm_is_behaved) +#define PL_strxfrm_max_cp (vTHX->Istrxfrm_max_cp) #define PL_strxfrm_min_char (vTHX->Istrxfrm_min_char) #define PL_sub_generation (vTHX->Isub_generation) #define PL_subline (vTHX->Isubline) |