Change mem_collxfrm() algorithm for embedded NULs

One of the problems in implementing Perl is that the C library routines forbid embedded NUL characters, which Perl accepts. This is true for the case of strxfrm() which handles collation under locale. The best solution as far as functionality goes, would be for Perl to write its own strxfrm replacement which would handle the specific needs of Perl. But that is not going to happen because of the huge complexity in handling it across many platforms. We would have to know the location and format of the locale definition files for every such platform. Some might follow POSIX guidelines, some might not. strxfrm creates a transformation of its input into a new string consisting of weight bytes. In the typical but general case, a 3 character NUL-terminated input string 'A B C 00' (spaces added for readability) gets transformed into something like: A¹ B¹ C¹ 01 A² B² C² 01 A³ B³ C³ 00 where the superscripted characters are weights for the corresponding input characters. Superscript 1 represents (essentially) the primary sorting key; 2, the secondary, etc, for as many levels as the locale definition gives. The 01 byte is likely to be the separator between levels, but not necessarily, and there could be some other mechanisms used on various platforms. To handle embedded NULs, the simplest thing would be to just remove them before passing in to strxfrm(). Then they would be entirely ignored, which might not be what you want. You might want them to have some weight at the tertiary level, for example. It also causes problems because strxfrm is very context sensitive. The locale definition can define weights for specific sequences of any length (and the weights can be multi-byte), and by removing a NUL, two characters now become adjacent that weren't in the input, and they could now form one of those special sequences and thus throw things off. Another way to handle NULs, that seemingly ignores them, but actually doesn't, is the mechanism in use prior to this commit. The input string is split at the NULs, and the substrings are independently passed to strxfrm, and the results concatenated together. This doesn't work either. In our example 'A B C 00', suppose B is a NUL, and should have some weight at the tertiary level. What we want is: A¹ C¹ 01 A² C² 01 A³ B³ C³ 00 But that's not at all what you get. Instead it is: A¹ 01 A² 01 A³ C¹ 01 C² 01 C³ 00 The primary weight of C comes immediately after the teriary weight of A, but more importantly, a NUL, instead of being ignored at the primary levels, is significant at all levels, so that "a\0c" would sort before "ab". Still another possibility is to replace the NUL with some other character before passing it to strxfrm. That was my original plan, to replace each NUL with the character that this code determines has the lowest collation order for the current locale. On strings that don't contain that character, the results would be as good as it gets for that locale. That character is likely to be ignored at higher weight levels, but have some small non-ignored weight at the lowest ones. And hopefully the character would rarely be encountered in practice. When it does happen, it and NUL would sort identically; hardly the end of the world. If the entire strings sorted identically, the NUL-containing one would come out before the other one, since the original Perl strings are used as a tie breaker. However, testing showed a problem with this. If that other character is part of a sequence that has special weighting, the results won't be correct. With gcc, U+00B4 ACUTE ACCENT is the lowest collating character in many UTF-8 locales. It combines in Romanian and Vietnamese with some other characters to change weights, and hence changing NULs into U+B4 screws things up. What I finally have come to is to do is a modification of this final approach, where the possible NUL replacements are limited to just characters that are controls in the locale. NULs are replaced by the lowest collating control. It would really be a defective locale if this control combined with some other character to form a special sequence. Often the character will be a 01, START OF HEADING. In the very unlikely case that there are absolutely no controls in the locale, 01 is used, because we have to replace it with something. The code added by this commit is mostly utf8-ready. A few commits from now will make Perl properly work with UTF-8 (if the platform supports it). But until that time, this isn't a full implementation; it only looks for the lowest-sorting control that is invariant, where the the UTF8ness doesn't matter. The added tests are marked as TODO until then.
author: Karl Williamson <khw@cpan.org> 2016-04-09 15:52:05 -0600
committer: Karl Williamson <khw@cpan.org> 2016-05-24 10:26:29 -0600
commit: 6696cfa7cc3a0e1e0eab29a11ac131e6f5a3469e (patch)
tree: 33c10ae363a06ff143829278473a93ff4e9e2e1a /pod/perllocale.pod
parent: 59c018b996263ec705a1e7182f7fa996b72207da (diff)
download: perl-6696cfa7cc3a0e1e0eab29a11ac131e6f5a3469e.tar.gz
1 files changed, 8 insertions, 7 deletions
diff --git a/pod/perllocale.pod b/pod/perllocale.pod
index 0c7e769111..d842a0781a 100644
--- a/pod/perllocale.pod
+++ b/pod/perllocale.pod
@@ -1567,13 +1567,14 @@ called, and whatever it does is what you get.
 
 =head2 Collation of strings containing embedded C<NUL> characters
 
-Perl handles C<NUL> characters in the middle of strings.  In many
-locales, control characters are ignored unless the strings otherwise
-compare equal.  Unlike other control characters, C<NUL> characters are
-never ignored.   For example, if given that C<"b"> sorts after
-C<"\001">, and C<"c"> sorts after C<"b">, C<"a\0c"> always sorts before
-C<"ab">.  This is true even in locales in which C<"ab"> sorts before
-C<"a\001c">.
+C<NUL> characters will sort the same as the lowest collating control
+character does, or to C<"\001"> in the unlikely event that there are no
+control characters at all in the locale.  In cases where the strings
+don't contain this non-C<NUL> control, the results will be correct, and
+in many locales, this control, whatever it might be, will rarely be
+encountered.  But there are cases where a C<NUL> should sort before this
+control, but doesn't.  If two strings do collate identically, the one
+containing the C<NUL> will sort to earlier.
 
 =head2 Broken systems
author	Karl Williamson <khw@cpan.org>	2016-04-09 15:52:05 -0600
committer	Karl Williamson <khw@cpan.org>	2016-05-24 10:26:29 -0600
commit	6696cfa7cc3a0e1e0eab29a11ac131e6f5a3469e (patch)
tree	33c10ae363a06ff143829278473a93ff4e9e2e1a /pod/perllocale.pod
parent	59c018b996263ec705a1e7182f7fa996b72207da (diff)
download	perl-6696cfa7cc3a0e1e0eab29a11ac131e6f5a3469e.tar.gz