summaryrefslogtreecommitdiff
path: root/embedvar.h
Commit message (Collapse)AuthorAgeFilesLines
* Add hook for Unicode private use overrideKarl Williamson2019-03-071-0/+2
| | | | | | | | | | I am starting to write a Unicode::Private_Use module which will allow one to specify the Unicode properties of private use code points, thus making them actually useful. This commit adds a hook to regcomp.c to accommodate this module. The changes are pretty minimal. This way we don't have to wait another release cycle to get it out there. I don't want to document this interface, until it's proven.
* fix thread issue with PERL_GLOBAL_STRUCTDavid Mitchell2019-02-191-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | The MY_CXT subsystem allows per-thread pseudo-static data storage. Part of the implementation for this involves each XS module being assigned a unique index in its my_cxt_index static var when first loaded. Because PERL_GLOBAL_STRUCT bans any static vars, under those builds there is instead a table which maps the MY_CXT_KEY identifying string to index. Unfortunately, this table was allocated per-interpreter rather than globally, meaning if multiple threads tried to load the same XS module, crashes could ensue. This manifested itself in failures in ext/XS-APItest/t/keyword_plugin_threads.t The fix is relatively straightforward: allocate PL_my_cxt_keys globally rather than per-interpreter. Also record the size of this struct in a new var, PL_my_cxt_keys_size, rather than doing double duty on PL_my_cxt_size.
* foo_cloexec() under PERL_GLOBAL_STRUCT_PRIVATEDavid Mitchell2019-02-191-0/+18
| | | | | | | | | | | | | | | | | | Fix the various Perl_PerlSock_dup2_cloexec() type functions so that t/porting/liberl.a passes under -DPERL_GLOBAL_STRUCT_PRIVATE builds. In these builds it is forbidden to have any static variables, but each of these functions (via convoluted macros) has a static var called 'strategy' which records, for each function, whether a run-time probe has been done to determine the best way of achieving close-exec functionality, and the result. Replace them all with 'global' vars: PL_strategy_dup2 etc. NB these vars aren't thread-safe but it doesn't really matter, as the worst that can happen is for a redundant probe or two to be done before a suitable "don't probe any more" value is written to the var and seen by all the threads.
* Add global hash to handle \p{user-defined}Karl Williamson2019-02-141-0/+4
| | | | | | | A global hash has to be specially handled. The keys can't be shared, and all the SVs stored into it must be in its thread. This commit adds the hash, and initialization, and macros for context change, but doesn't use them. The code to deal with this is entirely confined to regcomp.c.
* Add mutex for dealing with qr/\p{user-defined}/Karl Williamson2019-02-141-0/+2
| | | | This will be used in future commits
* Add variable for if the current UTF-8 locale is TurkicKarl Williamson2019-02-051-0/+1
| | | | It currently is always set false, until later in this series of commits.
* regen/mk_invlists.pl: Create new inversion listKarl Williamson2019-02-051-0/+2
| | | | This will be used in a future commit.
* Change name of PL_NonL1NonFinalFoldKarl Williamson2018-12-251-2/+2
| | | | | The inversion list this refers to now includes the Latin 1 range, so the name was misleading.
* Change name of PL_utf8_foldable variableKarl Williamson2018-12-251-2/+2
| | | | | | This variable's name was out-of-date and misleading. It is the name of an inversion list that contains all the code points in the current version of Unicode that participate in any way in a /i type of fold.
* regen/mk_invlists.pl: Add new tableKarl Williamson2018-12-071-0/+2
| | | | | | | This table contains all the code points that are in any multi-character fold (not the folded-from character, but what that character folds to). It will be used in a future commit.
* Make global two interpreter variablesKarl Williamson2018-07-141-2/+4
| | | | | These variables are constant, once initialized, through the life of a program, so having them be per instance is a waste of time and space
* regcomp.c: SimplifyKarl Williamson2018-06-251-0/+2
| | | | | | | | | | | | Under /a pattern matching, the matches of the [:posix:] classes are restricted to the ASCII range. Previously, in a time/space trade-off that favored space, we created the list of matching characters at pattern compilation time by ANDing the full-range Posix class with the set of ASCII characters. But now, the tables for just the ASCII-range classes are generated anyway, so there's no need to do that compilation-time intersection. This slightly simplifies the code.
* Use compiled-in C structure for inverted case foldsKarl Williamson2018-03-311-1/+2
| | | | | | | | | | This commit changes to use the C data structures generated by the previous commit to compute what characters fold to a given one. This is used to find out what things should match under /i. This now avoids the expensive start up cost of switching to perl utf8_heavy.pl, loading a file from disk, and constructing a hash from it.
* Remove obsolete variablesKarl Williamson2018-03-311-1/+0
| | | | | These were for when some of the Posix character classes were implemented as swashes, which is no longer the case, so these can be removed.
* Use charnames inversion listsKarl Williamson2018-03-311-2/+4
| | | | | | | | This commit makes the inversion lists for parsing character name global instead of interpreter level, so can be initialized once per process, and no copies are created upon new thread instantiation. More importantly, this is another instance where utf8_heavy.pl no longer needs to be loaded, and the definition files read from disk.
* Move case change invlists from interpreter to globalKarl Williamson2018-03-261-5/+10
| | | | | These are now constant through the life of the program, so don't need to be duplicated at each new thread instantiation.
* Move UTF-8 case changing data into coreKarl Williamson2018-03-261-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Prior to this commit, if a program wanted to compute the case-change of a character above 0xFF, the C code would switch to perl, loading lib/utf8heavy.pl and then read another file from disk, and then create a hash. Future references would use the hash, but the start up cost is quite large. There are five case change types, uc, lc, tc, fc, and simple fc. Only the first encountered requires loading of utf8_heavy, but each required switching to utf8_heavy, and reading the appropriate file from disk. This commit changes these functions to use compiled-in C data structures (inversion maps) to represent the data. To look something up requires a binary search instead of a hash lookup. An individual hash lookup tends to be faster than a binary search, but the differences are small for small sizes. I did some benchmarking some years ago, (commit message 87367d5f9dc9bbf7db1a6cf87820cea76571bf1a) and the results were that for fewer than 512 entries, the binary search was just as fast as a hash, if not actually faster. Now, I've done some more benchmarks on blead, using the tool benchmark.pl, which wasn't available back then. The results below indicate that the differences are minimal up through 2047 entries, which all Unicode properties are well within. A hash, PL_foldclosures, is still constructed at runtime for the case of regular expression /i matching, and this could be generated at Perl compile time, as a further enhancement for later. But reading a file from disk is no longer required to do this. ======================= benchmarking results ======================= Key: Ir Instruction read Dr Data read Dw Data write COND conditional branches IND indirect branches _m branch predict miss _m1 level 1 cache miss _mm last cache (e.g. L3) miss - indeterminate percentage (e.g. 1/0) The numbers represent raw counts per loop iteration. "\x{10000}" =~ qr/\p{CWKCF}/" swash invlist Ratio % fetch search ------ ------- ------- Ir 2259.0 2264.0 99.8 Dr 665.0 664.0 100.2 Dw 406.0 404.0 100.5 COND 406.0 405.0 100.2 IND 17.0 15.0 113.3 COND_m 8.0 8.0 100.0 IND_m 4.0 4.0 100.0 Ir_m1 8.9 17.0 52.4 Dr_m1 4.5 3.4 132.4 Dw_m1 1.9 1.2 158.3 Ir_mm 0.0 0.0 100.0 Dr_mm 0.0 0.0 100.0 Dw_mm 0.0 0.0 100.0 These were constructed by using the file whose contents are below, which uses the property in Unicode that currently has the largest number of entries in its inversion list, > 1600. The test was run on blead -O2, no debugging, no threads. Then the cut-off boundary was changed from 512 to 2047 for when we use a hash vs an inversion list, and the test run again. This yields the difference between a hash fetch and an inversion list binary search ===================== The benchmark file is below =============== no warnings 'once'; my @benchmarks; push @benchmarks, 'swash' => { desc => '"\x{10000}" =~ qr/\p{CWKCF}/"', setup => 'no warnings "once"; my $re = qr/\p{CWKCF}/; my $a = "\x{10000}";', code => '$a =~ $re;', }; \@benchmarks;
* Make Unicode data structures globalKarl Williamson2018-03-141-19/+38
| | | | | | | | | | These structures are read-only, use const C strings, and are truly global, so no need to have them be interpreter level. This saves duplicating and freeing them as threads come and go. In doing this, I noticed that not every one was properly being copied/deallocated, so this fixes some potential unreported bugs, and leaks.
* Add thread-safe locale handlingKarl Williamson2018-02-181-0/+1
| | | | | | This (large) commit allows locales to be used in threaded perls on platforms that support it. This includes recent Windows and Posix 2008 ones.
* Latch LC_NUMERIC during critical sectionsKarl Williamson2018-02-181-0/+3
| | | | | | | | | | | | | | It is possible for operations on threaded perls which don't 'use locale' to still change the locale. This happens when calling POSIX::localeconv() and I18N::Langinfo(), and in earlier perls, it can happen for other operations when perl has been initialized with the environment causing the various locale categories to not have a uniform locale. This commit causes the areas where the locale for this category should predictably be in one or the other state to be a critical section where another thread can't interrupt and change it. This is a separate mutex, so that only these particular operations will be held up.
* Add Perl_setlocale()Karl Williamson2018-02-181-0/+2
| | | | | | | | | | | | | | | | | | khw could not find any modules on CPAN that correctly use the C library function setlocale(). (The very few that do try, do not use it correctly, looking at the return value incorrectly, so they are broken.) This analysis does not include modules that call non-Perl libaries that may call setlocale(). And, a future commit will render the setlocale() function useless in some configurations on some platforms. So this commit adds Perl_setlocale(), for XS code to call, and which is always effective, but it should not be used to alter the locale except on platforms where the predefined variable ${^SAFE_LOCALES} evaluates to 1. This function is also what POSIX::setlocale() calls to do the real work.
* Avoid changing locale when finding radix charKarl Williamson2018-01-301-0/+1
| | | | | | | | | On systems that have the POSIX 2008 operations, including nl_langinfo_l(), this commit causes them to not have to actually change the locale when determining what the decimal point character is. The locale may have to change during the printing/reading of numbers, but eventually we can use sprintf_l(), if available, to avoid that too.
* Cache locale UTF8-ness lookupsKarl Williamson2018-01-301-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Some locales are UTF-8, some are not. Knowledge of this is needed in various circumstances. This commit saves the results of the last several lookups so they don't have to be recalculated each time. The full generality of POSIX locales is such that you can have error messages be displayed in one locale, say Spanish, while other things are in French. To accommodate this generality, the program can loop through all the locale categories finding the UTF8ness of the locale it points to. However, in almost all instances, people are going to be in either French or in Spanish, and not in some combination. Suppose it is a French UTF-8 locale for all categories. This new cache will know that the French locale is UTF-8, and the queries for all but the first category can return that immediately. This simple cache avoids the overhead of hashes. This also fixes a bug I realized exists in threaded perls, but haven't reproduced. We do not support locales in such perls, and the user must not change the locale or 'use locale'. But perl itself could change the locale behind the scenes, leading to segfaults or incorrect results. One such instance is the determination of UTF8ness. But this only could happen if the full generality of locales is used so that the categories are not all in the same locale. This could only happen (if the user doesn't change locales) if the environment is such that the perl program is started up so that the categories are in such a state. This commit fixes this potential bug by caching the UTF8ness of each category at startup, before any threads are instantiated, and so checking for it later just looks it up in the cache, without perl changing the locale.
* Avoid some unnecessary changing of localesKarl Williamson2018-01-301-0/+1
| | | | | | | | | | The LC_NUMERIC locale category is kept so that generally the decimal point (radix) is a dot. For some (mostly) output purposes, it needs to be swapped into the program's current underlying locale so that a non-dot can be printed. This commit changes things so that if the current underlying locale uses a decimal point, the swap doesn't happen, as it's not needed.
* Remove unused interpreter variableKarl Williamson2017-12-261-1/+0
| | | | This somehow became unused or never got used; I didn't do the research.
* Add script_run regex featureKarl Williamson2017-12-241-0/+1
| | | | As explained in the docs, this helps detect spoofing attacks.
* make exec keep its argument list more reliablyZefram2017-12-141-2/+0
| | | | | | | | | | Bits of exec code were putting the constructed commands into globals PL_Argv and PL_Cmd, which could then be clobbered by reentrancy. These are only global in order to manage their freeing, but that's better managed by using the scope stack. So replace them with automatic variables, with ENTER/SAVEFREEPV/LEAVE to free the memory. Also copy the strings acquired from SVs, to avoid magic clobbering the buffers of SVs already read. Fixes [perl #129888].
* add wrap_keyword_plugin function (RT #132413)Lukas Mai2017-11-111-0/+2
|
* Change name of locale per-interpreter variableKarl Williamson2017-11-081-1/+1
| | | | | | The real purpose of this internal variable is to give the name of the locale that is the underlying one for the C program. Various macros already indicate that. This furthers the process.
* (perl #127663) create a separate random source for internal useTony Cook2017-09-111-0/+1
| | | | | and use it to initialize hash randomization and to innoculate against quadratic behaviour in pp_sort
* Add API function Perl_langinfo()Karl Williamson2017-09-091-0/+2
| | | | | | This is designed to generally replace nl_langinfo() in XS code. It is thread-safer, hides the quirks of perl's LC_NUMERIC handling, and can be used on systems lacking nl_langinfo.
* Make immortal SVs contiguousDavid Mitchell2017-07-271-0/+1
| | | | | | | | | | | | | | | | | | | Ensure that PL_sv_yes, PL_sv_undef, PL_sv_no and PL_sv_zero are allocated adjacently in memory. This allows the SvIMMORTAL() test to be more efficient, and will (in the next commit) allow SvTRUE() to be more efficient. In MULTIPLICITY builds the constraint is already met by virtue of them being adjacent items in the interpreter struct. For non-MULTIPLICITY builds, they were just 4 global vars with no guarantees of where they would be allocated. For this case, PL_sv_undef are deleted as global vars and replaced with a new global var PL_sv_immortals[4], with #define PL_sv_yes (PL_sv_immortals[0]) etc in their place.
* add PL_sv_zeroDavid Mitchell2017-07-271-0/+1
| | | | | | | | | | it's like PL_sv_no, except that its string value is "0" rather than "". It can be used for example where pp function wants to push a zero return value on the stack. The next commit will start to use it. Also update the SvIMMORTAL() to be more efficient: it now checks whether the SV's address is in a range rather than individually checking against &PL_sv_undef, &PL_sv_no etc.
* Restore "Add new hashing and "hash with state" infrastructure"Yves Orton2017-06-011-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | This reverts commit e6a172f358c0f48c4b744dbd5e9ef6ff0b4ff289, which was a revert of a3bf60fbb1f05cd2c69d4ff0a2ef99537afdaba7. Add new hashing and "hash with state" infrastructure This adds support for three new hash functions: StadtX, Zaphod32 and SBOX, and reworks some of our hash internals infrastructure to do so. SBOX is special in that it is designed to be used in conjuction with any other hash function for hashing short strings very efficiently and very securely. It features compile time options on how much memory and startup time are traded off to control the length of keys that SBOX hashes. This also adds support for caching the hash values of single byte characters which can be used in conjuction with any other hash, including SBOX, although SBOX itself is as fast as the lookup cache, so typically you wouldnt use both at the same time. This also *removes* support for Jenkins One-At-A-Time. It has served us well, but it's day is done. This patch adds three new files: zaphod32_hash.h, stadtx_hash.h, sbox32_hash.h
* Eliminate remaining uses of PL_statbufDagfinn Ilmari Mannsåker2017-06-011-1/+0
| | | | | | | | | | Give Perl_nextargv its own statbuf and pass a pointer to it into Perl_do_open_raw and thence S_openn_cleanup when needed. Also reduce the scope of the existing statbuf in Perl_nextargv to make it clear it's distinct from the one populated by do_open_raw. Fix perldelta entry for PL_statbuf removal
* Revert "Add new hashing and "hash with state" infrastructure"Yves Orton2017-04-231-4/+0
| | | | | | This reverts commit a3bf60fbb1f05cd2c69d4ff0a2ef99537afdaba7. Accidentally pushed work pending unfreeze.
* Add new hashing and "hash with state" infrastructureYves Orton2017-04-231-0/+4
| | | | | | | | | | | | | | | | | | | | | This adds support for three new hash functions: StadtX, Zaphod32 and SBOX, and reworks some of our hash internals infrastructure to do so. SBOX is special in that it is designed to be used in conjuction with any other hash function for hashing short strings very efficiently and very securely. It features compile time options on how much memory and startup time are traded off to control the length of keys that SBOX hashes. This also adds support for caching the hash values of single byte characters which can be used in conjuction with any other hash, including SBOX, although SBOX itself is as fast as the lookup cache, so typically you wouldnt use both at the same time. This also *removes* support for Jenkins One-At-A-Time. It has served us well, but it's day is done. This patch adds three new files: zaphod32_hash.h, stadtx_hash.h, sbox32_hash.h
* Create inversion list for Assigned code pointsKarl Williamson2016-12-231-0/+1
| | | | This will be used in a future commit.
* Deprecate isFOO_utf8() macrosKarl Williamson2016-12-231-0/+1
| | | | | | These macros are being replaced by a safe version; they now generate a deprecation message at each call site upon the first use there in each program run.
* Change name of PL_ variableKarl Williamson2016-11-281-1/+1
| | | | | | | This variable really means the character that replaces any embedded NULs when doing collation. Change the name accordingly. (Embedded NULs must be replaced because the libc function strxfrm is used, and it operates on C strings which have no embedded NULs.)
* rework perl #129903 - inf recursion from use of empty pattern in regex codeblockYves Orton2016-11-011-0/+1
| | | | | | | | | | | | | | | | | | FC didn't like my previous patch for this issue, so here is the one he likes better. With tests and etc. :-) The basic problem is that code like this: /(?{ s!!! })/ can trigger infinite recursion on the C stack (not the normal perl stack) when the last successful pattern in scope is itself. Since the C stack overflows this manifests as an untrappable error/segfault, which then kills perl. We avoid the segfault by simply forbidding the use of the empty pattern when it would resolve to the currently executing pattern. I imagine with a bit of effort someone can trigger the original SEGV, unlike my original fix which forbade use of the empty pattern in a regex code block. So if someone actually reports such a bug we might have to revert to the older approach of prohibiting this.
* Make PERLLIB_SEP dynamic on VMS.Craig A. Berry2016-09-011-0/+4
| | | | | | | Because if we're running under a Unix shell, the path separator is likely to meet the expectations of Unix shell scripts better if it's the Unix ':' rather than the VMS '|'. There is no change when running under DCL.
* Remove PL_maxoFather Chrysostomos2016-08-141-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We have an interpreter variable using memory, PL_maxo, which is defined to be the same as MAXO, a #defined constant. As far as I can tell, it is never used in lvalue context, in core or on CPAN, except for the initialisation in intrpvar.h. It can simply be removed and replaced with a macro defined as equiva- lent to MAXO. It was added in this commit: commit 84ea024ac9cdf20f21223e686dddea82d5eceb4f Author: Perl 5 Porters <perl5-porters.nicoh.com> Date: Tue Jan 2 23:21:55 1996 +0000 perl 5.002beta1h patch: perl.h 5.002beta1 attempted some memory optimizations, but unfortunately they can result in a memory leak problem. This can be avoided by #define STRANGE_MALLOC. I do that here until consensus is reached on a better strategy for handling the memory optimizations. Include maxo for the maximum number of operations (needed for the Safe extension). But apparently it is not needed for the Safe extension (tests pass without it).
* locale.c: Revamp my_strerror() for thread-safenessKarl Williamson2016-07-291-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | This commit is the first step in making locale handling thread-safe. [perl #127708] was solved for 5.24 by adding a mutex in this function. That bug was caused by the code changing the locale even if the calling program is not consciously using locales. Posix 2008 introduced thread-safe locale functions. This commit changes this function to use them if the perl is threaded and the platform has them available. This means that the mutex is avoided on modern platforms. It restructures the function to return a mortal copy of the error message. This is a step towards making the function completely thread safe. Right now, as documented, if you do 'use locale', locale handling isn't thread-safe. A global C locale object is created and used here if necessary. It is destroyed at the end of the program. Note that some platforms have a strerror_r(), which is automatically used instead of strerror() if available. It differs form straight strerror() by taking a buffer to place the returned string, so the return does not point to internal static storage. One could test for the existence of this and avoid the mortal copy.
* Remove PL_(lex_)encoding and all dependent codeFather Chrysostomos2016-07-131-2/+0
|
* Do better locale collation in UTF-8 localesKarl Williamson2016-05-241-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On some platforms, the libc strxfrm() works reasonably well on UTF-8 locales, giving a default collation ordering. It will assume that every string passed to it is in UTF-8. This commit changes Perl to make sure that strxfrm's expectations are met. Likewise under a non-UTF-8 locale, strxfrm is expecting a non-UTF-8 string. And this commit makes sure of that as well. So, simply meeting strxfrm's expectations allows Perl to start supporting default collation in UTF-8 locales, and fixes it to work on single-byte locales with UTF-8 input. (Unicode::Collate provides tailorable functionality and is portable to platforms where strxfrm isn't as intelligent, but is a much more heavy-weight solution that may not be needed for particular applications.) There is a problem in non-UTF-8 locales if the passed string contains code points representable only in UTF-8. This commit causes them to be changed, before being passed to strxfrm, into the highest collating character in the locale that doesn't require UTF-8. They then will sort the same as that character, which means after all other characters in the locale but that one. In strings that don't have that character, this will generally provide exactly correct operation. There still is a problem, if that character, in the given locale, combines with adjacent characters to form a specially weighted sequence. Then, the change of these above-255 code points into that character can skew the results. See the commit message for 6696cfa7cc3a0e1e0eab29a11ac131e6f5a3469e for more on this. But it is really an illegal situation to have above-255 code points in a single-byte locale, so this behavior is a reasonable degradation when given illegal input. If two transformed strings compare exactly equal, Perl already uses the un-transformed versions to break ties, and there, these faked-up strings will collate so the above-255 code points sort after everything else, and in code point order amongst themselves.
* locale.c: Change algorithm for strxfrm() trialsKarl Williamson2016-05-241-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | It's kind of guess work deciding how big a buffer to give to strxfrm(). If you give it too small a one, it will fail. Prior to this commit, the buffer size was doubled and then strxfrm() was called again, looping until it worked, or we used too much memory. Each time a new locale is made, we try to minimize the necessity of doing this by calculating numbers 'm' and 'b' that can be plugged into the equation mx + b where 'x' is the size of the string passed to strxfrm(). strxfrm() is roughly linear with respect to its input's length, so this generally works without us having to do many loops to get a large enough size. But on many systems, strxfrm(), in failing, returns how much space you should have given it. On such systems, we can just use that number on the 2nd try and not have to keep guessing. This commit changes to do that. But on other systems this doesn't work. So the original method is retained if we determine that there are problems with strxfrm(), either from previous experience, or because using the size returned from the first trial didn't work
* Change mem_collxfrm() algorithm for embedded NULsKarl Williamson2016-05-241-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | One of the problems in implementing Perl is that the C library routines forbid embedded NUL characters, which Perl accepts. This is true for the case of strxfrm() which handles collation under locale. The best solution as far as functionality goes, would be for Perl to write its own strxfrm replacement which would handle the specific needs of Perl. But that is not going to happen because of the huge complexity in handling it across many platforms. We would have to know the location and format of the locale definition files for every such platform. Some might follow POSIX guidelines, some might not. strxfrm creates a transformation of its input into a new string consisting of weight bytes. In the typical but general case, a 3 character NUL-terminated input string 'A B C 00' (spaces added for readability) gets transformed into something like: A¹ B¹ C¹ 01 A² B² C² 01 A³ B³ C³ 00 where the superscripted characters are weights for the corresponding input characters. Superscript 1 represents (essentially) the primary sorting key; 2, the secondary, etc, for as many levels as the locale definition gives. The 01 byte is likely to be the separator between levels, but not necessarily, and there could be some other mechanisms used on various platforms. To handle embedded NULs, the simplest thing would be to just remove them before passing in to strxfrm(). Then they would be entirely ignored, which might not be what you want. You might want them to have some weight at the tertiary level, for example. It also causes problems because strxfrm is very context sensitive. The locale definition can define weights for specific sequences of any length (and the weights can be multi-byte), and by removing a NUL, two characters now become adjacent that weren't in the input, and they could now form one of those special sequences and thus throw things off. Another way to handle NULs, that seemingly ignores them, but actually doesn't, is the mechanism in use prior to this commit. The input string is split at the NULs, and the substrings are independently passed to strxfrm, and the results concatenated together. This doesn't work either. In our example 'A B C 00', suppose B is a NUL, and should have some weight at the tertiary level. What we want is: A¹ C¹ 01 A² C² 01 A³ B³ C³ 00 But that's not at all what you get. Instead it is: A¹ 01 A² 01 A³ C¹ 01 C² 01 C³ 00 The primary weight of C comes immediately after the teriary weight of A, but more importantly, a NUL, instead of being ignored at the primary levels, is significant at all levels, so that "a\0c" would sort before "ab". Still another possibility is to replace the NUL with some other character before passing it to strxfrm. That was my original plan, to replace each NUL with the character that this code determines has the lowest collation order for the current locale. On strings that don't contain that character, the results would be as good as it gets for that locale. That character is likely to be ignored at higher weight levels, but have some small non-ignored weight at the lowest ones. And hopefully the character would rarely be encountered in practice. When it does happen, it and NUL would sort identically; hardly the end of the world. If the entire strings sorted identically, the NUL-containing one would come out before the other one, since the original Perl strings are used as a tie breaker. However, testing showed a problem with this. If that other character is part of a sequence that has special weighting, the results won't be correct. With gcc, U+00B4 ACUTE ACCENT is the lowest collating character in many UTF-8 locales. It combines in Romanian and Vietnamese with some other characters to change weights, and hence changing NULs into U+B4 screws things up. What I finally have come to is to do is a modification of this final approach, where the possible NUL replacements are limited to just characters that are controls in the locale. NULs are replaced by the lowest collating control. It would really be a defective locale if this control combined with some other character to form a special sequence. Often the character will be a 01, START OF HEADING. In the very unlikely case that there are absolutely no controls in the locale, 01 is used, because we have to replace it with something. The code added by this commit is mostly utf8-ready. A few commits from now will make Perl properly work with UTF-8 (if the platform supports it). But until that time, this isn't a full implementation; it only looks for the lowest-sorting control that is invariant, where the the UTF8ness doesn't matter. The added tests are marked as TODO until then.
* Keep track of if collation locale is UTF-8 or notKarl Williamson2016-05-241-0/+1
| | | | This will be used in future commits
* Add locale mutexKarl Williamson2016-04-091-0/+2
| | | | | This adds a new mutex for use in the next commit for use with locale handling.