diff options
author | Karl Williamson <khw@cpan.org> | 2021-03-12 10:30:53 -0700 |
---|---|---|
committer | Karl Williamson <khw@cpan.org> | 2022-11-29 12:53:39 -0700 |
commit | f101e19ae74be4d1d912e4d176a6e7a305bee770 (patch) | |
tree | 8d7bf1779484d1cb8202789d431f653407cb7a80 /embed.fnc | |
parent | 4d597d86ca65c369680ced9b942edbde4a6ec0a7 (diff) | |
download | perl-f101e19ae74be4d1d912e4d176a6e7a305bee770.tar.gz |
Fix POSIX::strxfrm()
This commit does two things.
Most simply it extends strxfrm() to handle strings containing NUL
characters. Previously the transformation stopped at the first NUL
encountered.
Second, it combines the implementation of this with the existing
implementation used for the 'cmp' operator, eliminating existing
discrepancies and preventing future ones.
This function takes an SV containing a PV. The encoding of that
PV is based on the locale of the LC_CTYPE locale. It really doesn't
make sense to collate based off of the sequencing of a different locale,
which prior to this commit it would do (but not for 'cmp') if the
LC_COLLATION locale were different.
As an example, consider the string:
my $string = quotemeta join "", map { chr } (1..255);
and with LC_TYPE=8859-1 (Latin-1, used for several Western European
languages), LC_COLLATE set to ja_JP.utf8. This doesn't make much sense,
outside of specialty uses such as a lazy implementation of a
Japanese/French dictionary, or for quoting snippets in one language in a
document written in the other. ('lazy' because such text should really
be changing locales to the language of the snippet currently being
worked on.) Nevertheless Perl should do something as sensible as
possible. and this commit changes POSIX::strxfrm() to use the method
already in use by the code implementing 'cmp'. Prior to this commit,
POSIX::strxfrm($string) yielded on glibc 12.1:
^\3^\4^\5^\6^\a^\b^\t^\n^\13^\f^\r^\16^\17^\20^\21^\22^\23^\24^\25^\26^\27^\30^\31^\32^\e^\34^\35^\36^\37^ ^!^\"^#^\$^%^&^'^(^)^*^+^,^-^.^/^0^123456789:;^<^=^>^?^\@^A^BCDEFGHIJKLMNOPQRSTUVWXYZ[\\^]^^^_^`a^bcdefghijklmnopqrstuvwxyz{|^}^~^\177^\302\200^\302\201^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3^\3
These are effectively a sorting order, and it is not meant to be human
understandable. But it is clear that most of the characters had the
same weight of 3, so a libc sort would mark them as ties in sorting
order.
And after,
^\3^\4^\5^\6^\a^\b^\t^\n^\13^\f^\r^\16^\17^\20^\21^\22^\23^\24^\25^\26^\27^\30^\31^\32^\e^\34^\35^\36^\37^ ^!^\"^#^\$^%^&^'^(^)^*^+^,^-^.^/^0^123456789:;^<^=^>^?^\@^A^BCDEFGHIJKLMNOPQRSTUVWXYZ[\\^]^^^_^`a^bcdefghijklmnopqrstuvwxyz{|^}^~
which shows that most of the ties have been resolved, and hence the
results are more sensible
Diffstat (limited to 'embed.fnc')
-rw-r--r-- | embed.fnc | 1 |
1 files changed, 1 insertions, 0 deletions
@@ -1382,6 +1382,7 @@ Cp |I32 * |markstack_grow #if defined(USE_LOCALE_COLLATE) p |int |magic_setcollxfrm|NN SV* sv|NN MAGIC* mg p |int |magic_freecollxfrm|NN SV* sv|NN MAGIC* mg +EXop |SV * |strxfrm |NN SV * src : Defined in locale.c, used only in sv.c # if defined(PERL_IN_LOCALE_C) || defined(PERL_IN_SV_C) || defined(PERL_IN_MATHOMS_C) Ep |char* |mem_collxfrm_ |NN const char* input_string \ |