summaryrefslogtreecommitdiff
path: root/embed.fnc
diff options
context:
space:
mode:
authorKarl Williamson <khw@cpan.org>2021-03-12 10:30:53 -0700
committerKarl Williamson <khw@cpan.org>2022-11-29 12:53:39 -0700
commitf101e19ae74be4d1d912e4d176a6e7a305bee770 (patch)
tree8d7bf1779484d1cb8202789d431f653407cb7a80 /embed.fnc
parent4d597d86ca65c369680ced9b942edbde4a6ec0a7 (diff)
downloadperl-f101e19ae74be4d1d912e4d176a6e7a305bee770.tar.gz
Fix POSIX::strxfrm()
This commit does two things. Most simply it extends strxfrm() to handle strings containing NUL characters. Previously the transformation stopped at the first NUL encountered. Second, it combines the implementation of this with the existing implementation used for the 'cmp' operator, eliminating existing discrepancies and preventing future ones. This function takes an SV containing a PV. The encoding of that PV is based on the locale of the LC_CTYPE locale. It really doesn't make sense to collate based off of the sequencing of a different locale, which prior to this commit it would do (but not for 'cmp') if the LC_COLLATION locale were different. As an example, consider the string: my $string = quotemeta join "", map { chr } (1..255); and with LC_TYPE=8859-1 (Latin-1, used for several Western European languages), LC_COLLATE set to ja_JP.utf8. This doesn't make much sense, outside of specialty uses such as a lazy implementation of a Japanese/French dictionary, or for quoting snippets in one language in a document written in the other. ('lazy' because such text should really be changing locales to the language of the snippet currently being worked on.) Nevertheless Perl should do something as sensible as possible. and this commit changes POSIX::strxfrm() to use the method already in use by the code implementing 'cmp'. Prior to this commit, POSIX::strxfrm($string) yielded on glibc 12.1: ^\3^\4^\5^\6^\a^\b^\t^\n^\13^\f^\r^\16^\17^\20^\21^\22^\23^\24^\25^\26^\27^\30^\31^\32^\e^\34^\35^\36^\37^ ^!^\"^#^\$^%^&^'^(^)^*^+^,^-^.^/^0^123456789:;^<^=^>^?^\@^A^BCDEFGHIJKLMNOPQRSTUVWXYZ[\\^]^^^_^`a^bcdefghijklmnopqrstuvwxyz{|^}^~^\177^\302\200^\302\201^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3 These are effectively a sorting order, and it is not meant to be human understandable. But it is clear that most of the characters had the same weight of 3, so a libc sort would mark them as ties in sorting order. And after, ^\3^\4^\5^\6^\a^\b^\t^\n^\13^\f^\r^\16^\17^\20^\21^\22^\23^\24^\25^\26^\27^\30^\31^\32^\e^\34^\35^\36^\37^ ^!^\"^#^\$^%^&^'^(^)^*^+^,^-^.^/^0^123456789:;^<^=^>^?^\@^A^BCDEFGHIJKLMNOPQRSTUVWXYZ[\\^]^^^_^`a^bcdefghijklmnopqrstuvwxyz{|^}^~which shows that most of the ties have been resolved, and hence the results are more sensible
Diffstat (limited to 'embed.fnc')
-rw-r--r--embed.fnc1
1 files changed, 1 insertions, 0 deletions
diff --git a/embed.fnc b/embed.fnc
index 70656e6dfe..81e3952cad 100644
--- a/embed.fnc
+++ b/embed.fnc
@@ -1382,6 +1382,7 @@ Cp |I32 * |markstack_grow
#if defined(USE_LOCALE_COLLATE)
p |int |magic_setcollxfrm|NN SV* sv|NN MAGIC* mg
p |int |magic_freecollxfrm|NN SV* sv|NN MAGIC* mg
+EXop |SV * |strxfrm |NN SV * src
: Defined in locale.c, used only in sv.c
# if defined(PERL_IN_LOCALE_C) || defined(PERL_IN_SV_C) || defined(PERL_IN_MATHOMS_C)
Ep |char* |mem_collxfrm_ |NN const char* input_string \