| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
| |
I am starting to write a Unicode::Private_Use module which will allow
one to specify the Unicode properties of private use code points, thus
making them actually useful. This commit adds a hook to regcomp.c to
accommodate this module. The changes are pretty minimal. This way we
don't have to wait another release cycle to get it out there.
I don't want to document this interface, until it's proven.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The MY_CXT subsystem allows per-thread pseudo-static data storage.
Part of the implementation for this involves each XS module being
assigned a unique index in its my_cxt_index static var when first
loaded.
Because PERL_GLOBAL_STRUCT bans any static vars, under those builds
there is instead a table which maps the MY_CXT_KEY identifying string to
index.
Unfortunately, this table was allocated per-interpreter rather than
globally, meaning if multiple threads tried to load the same XS module,
crashes could ensue.
This manifested itself in failures in
ext/XS-APItest/t/keyword_plugin_threads.t
The fix is relatively straightforward: allocate PL_my_cxt_keys globally
rather than per-interpreter.
Also record the size of this struct in a new var, PL_my_cxt_keys_size,
rather than doing double duty on PL_my_cxt_size.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Fix the various Perl_PerlSock_dup2_cloexec() type functions so that
t/porting/liberl.a passes under -DPERL_GLOBAL_STRUCT_PRIVATE builds.
In these builds it is forbidden to have any static variables, but each
of these functions (via convoluted macros) has a static var called
'strategy' which records, for each function, whether a run-time probe
has been done to determine the best way of achieving close-exec
functionality, and the result.
Replace them all with 'global' vars: PL_strategy_dup2 etc.
NB these vars aren't thread-safe but it doesn't really matter, as the
worst that can happen is for a redundant probe or two to be done before
a suitable "don't probe any more" value is written to the var and seen
by all the threads.
|
|
|
|
|
|
|
| |
A global hash has to be specially handled. The keys can't be shared,
and all the SVs stored into it must be in its thread. This commit adds
the hash, and initialization, and macros for context change, but doesn't
use them. The code to deal with this is entirely confined to regcomp.c.
|
|
|
|
| |
This will be used in future commits
|
|
|
|
| |
It currently is always set false, until later in this series of commits.
|
|
|
|
| |
This will be used in a future commit.
|
|
|
|
|
| |
The inversion list this refers to now includes the Latin 1 range, so the
name was misleading.
|
|
|
|
|
|
| |
This variable's name was out-of-date and misleading. It is the name of
an inversion list that contains all the code points in the current
version of Unicode that participate in any way in a /i type of fold.
|
|
|
|
|
|
|
| |
This table contains all the code points that are in any multi-character
fold (not the folded-from character, but what that character folds to).
It will be used in a future commit.
|
|
|
|
|
| |
These variables are constant, once initialized, through the life of a
program, so having them be per instance is a waste of time and space
|
|
|
|
|
|
|
|
|
|
|
|
| |
Under /a pattern matching, the matches of the [:posix:] classes are
restricted to the ASCII range. Previously, in a time/space trade-off
that favored space, we created the list of matching characters at
pattern compilation time by ANDing the full-range Posix class with the
set of ASCII characters.
But now, the tables for just the ASCII-range classes are generated
anyway, so there's no need to do that compilation-time intersection.
This slightly simplifies the code.
|
|
|
|
|
|
|
|
|
|
| |
This commit changes to use the C data structures generated by the
previous commit to compute what characters fold to a given one. This is
used to find out what things should match under /i.
This now avoids the expensive start up cost of switching to perl
utf8_heavy.pl, loading a file from disk, and constructing a hash from
it.
|
|
|
|
|
| |
These were for when some of the Posix character classes were implemented
as swashes, which is no longer the case, so these can be removed.
|
|
|
|
|
|
|
|
| |
This commit makes the inversion lists for parsing character name global
instead of interpreter level, so can be initialized once per process,
and no copies are created upon new thread instantiation. More
importantly, this is another instance where utf8_heavy.pl no longer
needs to be loaded, and the definition files read from disk.
|
|
|
|
|
| |
These are now constant through the life of the program, so don't need to
be duplicated at each new thread instantiation.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Prior to this commit, if a program wanted to compute the case-change of
a character above 0xFF, the C code would switch to perl, loading
lib/utf8heavy.pl and then read another file from disk, and then create a
hash. Future references would use the hash, but the start up cost is
quite large. There are five case change types, uc, lc, tc, fc, and
simple fc. Only the first encountered requires loading of utf8_heavy,
but each required switching to utf8_heavy, and reading the appropriate
file from disk.
This commit changes these functions to use compiled-in C data structures
(inversion maps) to represent the data. To look something up requires a
binary search instead of a hash lookup.
An individual hash lookup tends to be faster than a binary search, but
the differences are small for small sizes. I did some benchmarking some
years ago, (commit message 87367d5f9dc9bbf7db1a6cf87820cea76571bf1a) and
the results were that for fewer than 512 entries, the binary search was
just as fast as a hash, if not actually faster. Now, I've done some
more benchmarks on blead, using the tool benchmark.pl, which wasn't
available back then. The results below indicate that the differences
are minimal up through 2047 entries, which all Unicode properties are
well within.
A hash, PL_foldclosures, is still constructed at runtime for the case of
regular expression /i matching, and this could be generated at Perl
compile time, as a further enhancement for later. But reading a file
from disk is no longer required to do this.
======================= benchmarking results =======================
Key:
Ir Instruction read
Dr Data read
Dw Data write
COND conditional branches
IND indirect branches
_m branch predict miss
_m1 level 1 cache miss
_mm last cache (e.g. L3) miss
- indeterminate percentage (e.g. 1/0)
The numbers represent raw counts per loop iteration.
"\x{10000}" =~ qr/\p{CWKCF}/"
swash invlist Ratio %
fetch search
------ ------- -------
Ir 2259.0 2264.0 99.8
Dr 665.0 664.0 100.2
Dw 406.0 404.0 100.5
COND 406.0 405.0 100.2
IND 17.0 15.0 113.3
COND_m 8.0 8.0 100.0
IND_m 4.0 4.0 100.0
Ir_m1 8.9 17.0 52.4
Dr_m1 4.5 3.4 132.4
Dw_m1 1.9 1.2 158.3
Ir_mm 0.0 0.0 100.0
Dr_mm 0.0 0.0 100.0
Dw_mm 0.0 0.0 100.0
These were constructed by using the file whose contents are below, which
uses the property in Unicode that currently has the largest number of
entries in its inversion list, > 1600. The test was run on blead -O2,
no debugging, no threads. Then the cut-off boundary was changed from
512 to 2047 for when we use a hash vs an inversion list, and the test
run again. This yields the difference between a hash fetch and an
inversion list binary search
===================== The benchmark file is below ===============
no warnings 'once';
my @benchmarks;
push @benchmarks, 'swash' => {
desc => '"\x{10000}" =~ qr/\p{CWKCF}/"',
setup => 'no warnings "once"; my $re = qr/\p{CWKCF}/; my $a =
"\x{10000}";',
code => '$a =~ $re;',
};
\@benchmarks;
|
|
|
|
|
|
|
|
|
|
| |
These structures are read-only, use const C strings, and are truly
global, so no need to have them be interpreter level. This saves
duplicating and freeing them as threads come and go.
In doing this, I noticed that not every one was properly being
copied/deallocated, so this fixes some potential unreported bugs, and
leaks.
|
|
|
|
|
|
| |
This (large) commit allows locales to be used in threaded perls on
platforms that support it. This includes recent Windows and Posix 2008
ones.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
It is possible for operations on threaded perls which don't 'use locale'
to still change the locale. This happens when calling
POSIX::localeconv() and I18N::Langinfo(), and in earlier perls, it can
happen for other operations when perl has been initialized with the
environment causing the various locale categories to not have a uniform
locale.
This commit causes the areas where the locale for this category should
predictably be in one or the other state to be a critical section where
another thread can't interrupt and change it. This is a separate
mutex, so that only these particular operations will be held up.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
khw could not find any modules on CPAN that correctly use the C library
function setlocale(). (The very few that do try, do not use it
correctly, looking at the return value incorrectly, so they are broken.)
This analysis does not include modules that call non-Perl libaries that
may call setlocale().
And, a future commit will render the setlocale() function useless in
some configurations on some platforms.
So this commit adds Perl_setlocale(), for XS code to call, and which is
always effective, but it should not be used to alter the locale except
on platforms where the predefined variable ${^SAFE_LOCALES} evaluates to
1.
This function is also what POSIX::setlocale() calls to do the real work.
|
|
|
|
|
|
|
|
|
| |
On systems that have the POSIX 2008 operations, including
nl_langinfo_l(), this commit causes them to not have to actually change
the locale when determining what the decimal point character is.
The locale may have to change during the printing/reading of numbers,
but eventually we can use sprintf_l(), if available, to avoid that too.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Some locales are UTF-8, some are not. Knowledge of this is needed in
various circumstances. This commit saves the results of the last
several lookups so they don't have to be recalculated each time.
The full generality of POSIX locales is such that you can have error
messages be displayed in one locale, say Spanish, while other things are
in French. To accommodate this generality, the program can loop through
all the locale categories finding the UTF8ness of the locale it points
to. However, in almost all instances, people are going to be in either
French or in Spanish, and not in some combination. Suppose it is a
French UTF-8 locale for all categories. This new cache will know that
the French locale is UTF-8, and the queries for all but the first
category can return that immediately.
This simple cache avoids the overhead of hashes.
This also fixes a bug I realized exists in threaded perls, but haven't
reproduced. We do not support locales in such perls, and the user must
not change the locale or 'use locale'. But perl itself could change the
locale behind the scenes, leading to segfaults or incorrect results.
One such instance is the determination of UTF8ness. But this only could
happen if the full generality of locales is used so that the categories
are not all in the same locale. This could only happen (if the user
doesn't change locales) if the environment is such that the perl program
is started up so that the categories are in such a state. This commit
fixes this potential bug by caching the UTF8ness of each category at
startup, before any threads are instantiated, and so checking for it
later just looks it up in the cache, without perl changing the locale.
|
|
|
|
|
|
|
|
|
|
| |
The LC_NUMERIC locale category is kept so that generally the decimal
point (radix) is a dot. For some (mostly) output purposes, it needs to
be swapped into the program's current underlying locale so that a
non-dot can be printed.
This commit changes things so that if the current underlying locale uses
a decimal point, the swap doesn't happen, as it's not needed.
|
|
|
|
| |
This somehow became unused or never got used; I didn't do the research.
|
|
|
|
| |
As explained in the docs, this helps detect spoofing attacks.
|
|
|
|
|
|
|
|
|
|
| |
Bits of exec code were putting the constructed commands into globals
PL_Argv and PL_Cmd, which could then be clobbered by reentrancy.
These are only global in order to manage their freeing, but that's
better managed by using the scope stack. So replace them with automatic
variables, with ENTER/SAVEFREEPV/LEAVE to free the memory. Also copy
the strings acquired from SVs, to avoid magic clobbering the buffers of
SVs already read. Fixes [perl #129888].
|
| |
|
|
|
|
|
|
| |
The real purpose of this internal variable is to give the name of the
locale that is the underlying one for the C program. Various macros
already indicate that. This furthers the process.
|
|
|
|
|
| |
and use it to initialize hash randomization and to innoculate against
quadratic behaviour in pp_sort
|
|
|
|
|
|
| |
This is designed to generally replace nl_langinfo() in XS code. It is
thread-safer, hides the quirks of perl's LC_NUMERIC handling, and can be
used on systems lacking nl_langinfo.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Ensure that PL_sv_yes, PL_sv_undef, PL_sv_no and PL_sv_zero are allocated
adjacently in memory.
This allows the SvIMMORTAL() test to be more efficient, and will (in the
next commit) allow SvTRUE() to be more efficient.
In MULTIPLICITY builds the constraint is already met by virtue of them
being adjacent items in the interpreter struct. For non-MULTIPLICITY
builds, they were just 4 global vars with no guarantees of where
they would be allocated. For this case, PL_sv_undef are deleted
as global vars and replaced with a new global var PL_sv_immortals[4],
with
#define PL_sv_yes (PL_sv_immortals[0])
etc in their place.
|
|
|
|
|
|
|
|
|
|
| |
it's like PL_sv_no, except that its string value is "0" rather than "".
It can be used for example where pp function wants to push a zero return
value on the stack. The next commit will start to use it.
Also update the SvIMMORTAL() to be more efficient: it now checks whether
the SV's address is in a range rather than individually checking against
&PL_sv_undef, &PL_sv_no etc.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This reverts commit e6a172f358c0f48c4b744dbd5e9ef6ff0b4ff289,
which was a revert of a3bf60fbb1f05cd2c69d4ff0a2ef99537afdaba7.
Add new hashing and "hash with state" infrastructure
This adds support for three new hash functions: StadtX, Zaphod32 and SBOX,
and reworks some of our hash internals infrastructure to do so.
SBOX is special in that it is designed to be used in conjuction with any
other hash function for hashing short strings very efficiently and very
securely. It features compile time options on how much memory and startup
time are traded off to control the length of keys that SBOX hashes.
This also adds support for caching the hash values of single byte characters
which can be used in conjuction with any other hash, including SBOX, although
SBOX itself is as fast as the lookup cache, so typically you wouldnt use both
at the same time.
This also *removes* support for Jenkins One-At-A-Time. It has served us
well, but it's day is done.
This patch adds three new files: zaphod32_hash.h, stadtx_hash.h,
sbox32_hash.h
|
|
|
|
|
|
|
|
|
|
| |
Give Perl_nextargv its own statbuf and pass a pointer to it into
Perl_do_open_raw and thence S_openn_cleanup when needed.
Also reduce the scope of the existing statbuf in Perl_nextargv to make
it clear it's distinct from the one populated by do_open_raw.
Fix perldelta entry for PL_statbuf removal
|
|
|
|
|
|
| |
This reverts commit a3bf60fbb1f05cd2c69d4ff0a2ef99537afdaba7.
Accidentally pushed work pending unfreeze.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This adds support for three new hash functions: StadtX, Zaphod32 and SBOX,
and reworks some of our hash internals infrastructure to do so.
SBOX is special in that it is designed to be used in conjuction with any
other hash function for hashing short strings very efficiently and very
securely. It features compile time options on how much memory and startup
time are traded off to control the length of keys that SBOX hashes.
This also adds support for caching the hash values of single byte characters
which can be used in conjuction with any other hash, including SBOX, although
SBOX itself is as fast as the lookup cache, so typically you wouldnt use both
at the same time.
This also *removes* support for Jenkins One-At-A-Time. It has served us
well, but it's day is done.
This patch adds three new files: zaphod32_hash.h, stadtx_hash.h,
sbox32_hash.h
|
|
|
|
| |
This will be used in a future commit.
|
|
|
|
|
|
| |
These macros are being replaced by a safe version; they now generate a
deprecation message at each call site upon the first use there in each
program run.
|
|
|
|
|
|
|
| |
This variable really means the character that replaces any embedded NULs
when doing collation. Change the name accordingly. (Embedded NULs must
be replaced because the libc function strxfrm is used, and it operates
on C strings which have no embedded NULs.)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
FC didn't like my previous patch for this issue, so here is the
one he likes better. With tests and etc. :-)
The basic problem is that code like this: /(?{ s!!! })/ can trigger
infinite recursion on the C stack (not the normal perl stack) when the
last successful pattern in scope is itself. Since the C stack overflows
this manifests as an untrappable error/segfault, which then kills perl.
We avoid the segfault by simply forbidding the use of the empty pattern
when it would resolve to the currently executing pattern.
I imagine with a bit of effort someone can trigger the original SEGV,
unlike my original fix which forbade use of the empty pattern in a
regex code block. So if someone actually reports such a bug we might
have to revert to the older approach of prohibiting this.
|
|
|
|
|
|
|
| |
Because if we're running under a Unix shell, the path separator is
likely to meet the expectations of Unix shell scripts better if it's
the Unix ':' rather than the VMS '|'. There is no change when
running under DCL.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We have an interpreter variable using memory, PL_maxo, which is
defined to be the same as MAXO, a #defined constant. As far as I can
tell, it is never used in lvalue context, in core or on CPAN, except
for the initialisation in intrpvar.h.
It can simply be removed and replaced with a macro defined as equiva-
lent to MAXO.
It was added in this commit:
commit 84ea024ac9cdf20f21223e686dddea82d5eceb4f
Author: Perl 5 Porters <perl5-porters.nicoh.com>
Date: Tue Jan 2 23:21:55 1996 +0000
perl 5.002beta1h patch: perl.h
5.002beta1 attempted some memory optimizations, but unfortunately
they can result in a memory leak problem. This can be
avoided by #define STRANGE_MALLOC. I do that here until
consensus is reached on a better strategy for handling the
memory optimizations.
Include maxo for the maximum number of operations (needed
for the Safe extension).
But apparently it is not needed for the Safe extension (tests pass
without it).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This commit is the first step in making locale handling thread-safe.
[perl #127708] was solved for 5.24 by adding a mutex in this function.
That bug was caused by the code changing the locale even if the calling
program is not consciously using locales.
Posix 2008 introduced thread-safe locale functions. This commit changes
this function to use them if the perl is threaded and the platform has
them available. This means that the mutex is avoided on modern
platforms.
It restructures the function to return a mortal copy of the error
message. This is a step towards making the function completely thread
safe. Right now, as documented, if you do 'use locale', locale handling
isn't thread-safe.
A global C locale object is created and used here if necessary. It is
destroyed at the end of the program.
Note that some platforms have a strerror_r(), which is automatically
used instead of strerror() if available. It differs form straight
strerror() by taking a buffer to place the returned string, so the
return does not point to internal static storage. One could test for
the existence of this and avoid the mortal copy.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
On some platforms, the libc strxfrm() works reasonably well on UTF-8
locales, giving a default collation ordering. It will assume that every
string passed to it is in UTF-8. This commit changes Perl to make sure
that strxfrm's expectations are met.
Likewise under a non-UTF-8 locale, strxfrm is expecting a non-UTF-8
string. And this commit makes sure of that as well.
So, simply meeting strxfrm's expectations allows Perl to start
supporting default collation in UTF-8 locales, and fixes it to work on
single-byte locales with UTF-8 input. (Unicode::Collate provides
tailorable functionality and is portable to platforms where strxfrm
isn't as intelligent, but is a much more heavy-weight solution that may
not be needed for particular applications.)
There is a problem in non-UTF-8 locales if the passed string contains
code points representable only in UTF-8. This commit causes them to be
changed, before being passed to strxfrm, into the highest collating
character in the locale that doesn't require UTF-8. They then will sort
the same as that character, which means after all other characters in
the locale but that one. In strings that don't have that character,
this will generally provide exactly correct operation. There still is a
problem, if that character, in the given locale, combines with adjacent
characters to form a specially weighted sequence. Then, the change of
these above-255 code points into that character can skew the results.
See the commit message for 6696cfa7cc3a0e1e0eab29a11ac131e6f5a3469e for
more on this. But it is really an illegal situation to have above-255
code points in a single-byte locale, so this behavior is a reasonable
degradation when given illegal input. If two transformed strings
compare exactly equal, Perl already uses the un-transformed versions to
break ties, and there, these faked-up strings will collate so the
above-255 code points sort after everything else, and in code point
order amongst themselves.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
It's kind of guess work deciding how big a buffer to give to strxfrm().
If you give it too small a one, it will fail. Prior to this commit, the
buffer size was doubled and then strxfrm() was called again, looping
until it worked, or we used too much memory.
Each time a new locale is made, we try to minimize the necessity of
doing this by calculating numbers 'm' and 'b' that can be plugged into
the equation
mx + b
where 'x' is the size of the string passed to strxfrm(). strxfrm() is
roughly linear with respect to its input's length, so this generally
works without us having to do many loops to get a large enough size.
But on many systems, strxfrm(), in failing, returns how much space you
should have given it. On such systems, we can just use that number on
the 2nd try and not have to keep guessing. This commit changes to do
that.
But on other systems this doesn't work. So the original method is
retained if we determine that there are problems with strxfrm(), either
from previous experience, or because using the size returned from the
first trial didn't work
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
One of the problems in implementing Perl is that the C library routines
forbid embedded NUL characters, which Perl accepts. This is true for
the case of strxfrm() which handles collation under locale.
The best solution as far as functionality goes, would be for Perl to
write its own strxfrm replacement which would handle the specific needs
of Perl. But that is not going to happen because of the huge complexity
in handling it across many platforms. We would have to know the
location and format of the locale definition files for every such
platform. Some might follow POSIX guidelines, some might not.
strxfrm creates a transformation of its input into a new string
consisting of weight bytes. In the typical but general case, a 3
character NUL-terminated input string 'A B C 00' (spaces added for
readability) gets transformed into something like:
A¹ B¹ C¹ 01 A² B² C² 01 A³ B³ C³ 00
where the superscripted characters are weights for the corresponding
input characters. Superscript 1 represents (essentially) the primary
sorting key; 2, the secondary, etc, for as many levels as the locale
definition gives. The 01 byte is likely to be the separator between
levels, but not necessarily, and there could be some other mechanisms
used on various platforms.
To handle embedded NULs, the simplest thing would be to just remove them
before passing in to strxfrm(). Then they would be entirely ignored,
which might not be what you want. You might want them to have some
weight at the tertiary level, for example. It also causes problems
because strxfrm is very context sensitive. The locale definition can
define weights for specific sequences of any length (and the weights can
be multi-byte), and by removing a NUL, two characters now become
adjacent that weren't in the input, and they could now form one of those
special sequences and thus throw things off.
Another way to handle NULs, that seemingly ignores them, but actually
doesn't, is the mechanism in use prior to this commit. The input string
is split at the NULs, and the substrings are independently passed to
strxfrm, and the results concatenated together. This doesn't work
either. In our example 'A B C 00', suppose B is a NUL, and should have
some weight at the tertiary level. What we want is:
A¹ C¹ 01 A² C² 01 A³ B³ C³ 00
But that's not at all what you get. Instead it is:
A¹ 01 A² 01 A³ C¹ 01 C² 01 C³ 00
The primary weight of C comes immediately after the teriary weight of A,
but more importantly, a NUL, instead of being ignored at the primary
levels, is significant at all levels, so that "a\0c" would sort before
"ab".
Still another possibility is to replace the NUL with some other
character before passing it to strxfrm. That was my original plan, to
replace each NUL with the character that this code determines has the
lowest collation order for the current locale. On strings that don't
contain that character, the results would be as good as it gets for that
locale. That character is likely to be ignored at higher weight levels,
but have some small non-ignored weight at the lowest ones. And
hopefully the character would rarely be encountered in practice. When
it does happen, it and NUL would sort identically; hardly the end of the
world. If the entire strings sorted identically, the NUL-containing one
would come out before the other one, since the original Perl strings are
used as a tie breaker. However, testing showed a problem with this. If
that other character is part of a sequence that has special weighting,
the results won't be correct. With gcc, U+00B4 ACUTE ACCENT is the
lowest collating character in many UTF-8 locales. It combines in
Romanian and Vietnamese with some other characters to change weights,
and hence changing NULs into U+B4 screws things up.
What I finally have come to is to do is a modification of this final
approach, where the possible NUL replacements are limited to just
characters that are controls in the locale. NULs are replaced by the
lowest collating control. It would really be a defective locale if this
control combined with some other character to form a special sequence.
Often the character will be a 01, START OF HEADING. In the very
unlikely case that there are absolutely no controls in the locale, 01 is
used, because we have to replace it with something.
The code added by this commit is mostly utf8-ready. A few commits from
now will make Perl properly work with UTF-8 (if the platform supports
it). But until that time, this isn't a full implementation; it only
looks for the lowest-sorting control that is invariant, where the
the UTF8ness doesn't matter. The added tests are marked as TODO until
then.
|
|
|
|
| |
This will be used in future commits
|
|
|
|
|
| |
This adds a new mutex for use in the next commit for use with locale
handling.
|