Do better locale collation in UTF-8 locales

On some platforms, the libc strxfrm() works reasonably well on UTF-8 locales, giving a default collation ordering. It will assume that every string passed to it is in UTF-8. This commit changes Perl to make sure that strxfrm's expectations are met. Likewise under a non-UTF-8 locale, strxfrm is expecting a non-UTF-8 string. And this commit makes sure of that as well. So, simply meeting strxfrm's expectations allows Perl to start supporting default collation in UTF-8 locales, and fixes it to work on single-byte locales with UTF-8 input. (Unicode::Collate provides tailorable functionality and is portable to platforms where strxfrm isn't as intelligent, but is a much more heavy-weight solution that may not be needed for particular applications.) There is a problem in non-UTF-8 locales if the passed string contains code points representable only in UTF-8. This commit causes them to be changed, before being passed to strxfrm, into the highest collating character in the locale that doesn't require UTF-8. They then will sort the same as that character, which means after all other characters in the locale but that one. In strings that don't have that character, this will generally provide exactly correct operation. There still is a problem, if that character, in the given locale, combines with adjacent characters to form a specially weighted sequence. Then, the change of these above-255 code points into that character can skew the results. See the commit message for 6696cfa7cc3a0e1e0eab29a11ac131e6f5a3469e for more on this. But it is really an illegal situation to have above-255 code points in a single-byte locale, so this behavior is a reasonable degradation when given illegal input. If two transformed strings compare exactly equal, Perl already uses the un-transformed versions to break ties, and there, these faked-up strings will collate so the above-255 code points sort after everything else, and in code point order amongst themselves.
author: Karl Williamson <khw@cpan.org> 2016-05-17 20:50:55 -0600
committer: Karl Williamson <khw@cpan.org> 2016-05-24 10:28:37 -0600
commit: a4a439fb9cd74c575855119abb55dc091955bdf4 (patch)
tree: 72bf312a124186367d08e573bddca86d01126788 /embedvar.h
parent: ff52fcf1dae90deb49f680d7cdbf78a04458ac47 (diff)
download: perl-a4a439fb9cd74c575855119abb55dc091955bdf4.tar.gz
1 files changed, 1 insertions, 0 deletions
diff --git a/embedvar.h b/embedvar.h
index 67383680f5..c2831d642a 100644
--- a/embedvar.h
+++ b/embedvar.h
@@ -310,6 +310,7 @@
 #define PL_stdingv		(vTHX->Istdingv)
 #define PL_strtab		(vTHX->Istrtab)
 #define PL_strxfrm_is_behaved	(vTHX->Istrxfrm_is_behaved)
+#define PL_strxfrm_max_cp	(vTHX->Istrxfrm_max_cp)
 #define PL_strxfrm_min_char	(vTHX->Istrxfrm_min_char)
 #define PL_sub_generation	(vTHX->Isub_generation)
 #define PL_subline		(vTHX->Isubline)
author	Karl Williamson <khw@cpan.org>	2016-05-17 20:50:55 -0600
committer	Karl Williamson <khw@cpan.org>	2016-05-24 10:28:37 -0600
commit	a4a439fb9cd74c575855119abb55dc091955bdf4 (patch)
tree	72bf312a124186367d08e573bddca86d01126788 /embedvar.h
parent	ff52fcf1dae90deb49f680d7cdbf78a04458ac47 (diff)
download	perl-a4a439fb9cd74c575855119abb55dc091955bdf4.tar.gz