summaryrefslogtreecommitdiff
path: root/embed.fnc
diff options
context:
space:
mode:
authorKarl Williamson <khw@cpan.org>2016-05-17 20:50:55 -0600
committerKarl Williamson <khw@cpan.org>2016-05-24 10:28:37 -0600
commita4a439fb9cd74c575855119abb55dc091955bdf4 (patch)
tree72bf312a124186367d08e573bddca86d01126788 /embed.fnc
parentff52fcf1dae90deb49f680d7cdbf78a04458ac47 (diff)
downloadperl-a4a439fb9cd74c575855119abb55dc091955bdf4.tar.gz
Do better locale collation in UTF-8 locales
On some platforms, the libc strxfrm() works reasonably well on UTF-8 locales, giving a default collation ordering. It will assume that every string passed to it is in UTF-8. This commit changes Perl to make sure that strxfrm's expectations are met. Likewise under a non-UTF-8 locale, strxfrm is expecting a non-UTF-8 string. And this commit makes sure of that as well. So, simply meeting strxfrm's expectations allows Perl to start supporting default collation in UTF-8 locales, and fixes it to work on single-byte locales with UTF-8 input. (Unicode::Collate provides tailorable functionality and is portable to platforms where strxfrm isn't as intelligent, but is a much more heavy-weight solution that may not be needed for particular applications.) There is a problem in non-UTF-8 locales if the passed string contains code points representable only in UTF-8. This commit causes them to be changed, before being passed to strxfrm, into the highest collating character in the locale that doesn't require UTF-8. They then will sort the same as that character, which means after all other characters in the locale but that one. In strings that don't have that character, this will generally provide exactly correct operation. There still is a problem, if that character, in the given locale, combines with adjacent characters to form a specially weighted sequence. Then, the change of these above-255 code points into that character can skew the results. See the commit message for 6696cfa7cc3a0e1e0eab29a11ac131e6f5a3469e for more on this. But it is really an illegal situation to have above-255 code points in a single-byte locale, so this behavior is a reasonable degradation when given illegal input. If two transformed strings compare exactly equal, Perl already uses the un-transformed versions to break ties, and there, these faked-up strings will collate so the above-255 code points sort after everything else, and in code point order amongst themselves.
Diffstat (limited to 'embed.fnc')
-rw-r--r--embed.fnc6
1 files changed, 6 insertions, 0 deletions
diff --git a/embed.fnc b/embed.fnc
index 85166eb558..a874970001 100644
--- a/embed.fnc
+++ b/embed.fnc
@@ -910,6 +910,12 @@ Ap |I32 * |markstack_grow
p |int |magic_setcollxfrm|NN SV* sv|NN MAGIC* mg
: Defined in locale.c, used only in sv.c
p |char* |mem_collxfrm |NN const char* input_string|STRLEN len|NN STRLEN* xlen
+# if defined(PERL_IN_LOCALE_C) || defined(PERL_IN_SV_C)
+pM |char* |_mem_collxfrm |NN const char* input_string \
+ |STRLEN len \
+ |NN STRLEN* xlen \
+ |bool utf8
+# endif
#endif
Afpd |SV* |mess |NN const char* pat|...
Apd |SV* |mess_sv |NN SV* basemsg|bool consume