| Commit message (Collapse) | Author | Age | Files | Lines |
| |
|
|
|
|
|
|
|
|
| |
The central doc change is in perlsec.pod. This not only explains
that you can build a perl that doesn't support taint,
but shows how you can check whether your perl supports taint or not.
The other doc changes are mainly to note that taint might not
be supported, and to refer the reader to perlsec for more details.
|
| |
|
| |
|
| |
|
|
|
|
|
|
|
|
|
| |
For: https://github.com/Perl/perl5/pull/18201
Committer: Samanta Navarro is now a Perl author.
To keep 'make test_porting' happy: Increment $VERSION in several files.
Regenerate uconfig.h via './perl -Ilib regen/uconfig_h.pl'.
|
|
|
|
|
| |
Mostly in comments and docs, but some in diagnostic messages and one
case of 'or die die'.
|
|
|
|
|
| |
(Committer changed author address to the already-known @cpan one, so
this would pass porting tests.)
|
| |
|
| |
|
|
|
|
|
|
| |
GitHub issue tracker
The perlbug utility and perlbug@perl.org should no longer be used to submit bug reports or patches.
|
| |
|
| |
|
|
|
|
| |
Not by name.
|
| |
|
| |
|
| |
|
|
|
|
|
|
|
| |
These thread-safe functions are not normally used on unthreaded builds,
retaining the use of the library functions that have long been used.
But, it is now possible to tell Configure to use them on unthreaded
builds.
|
|
|
|
| |
Using Z<> instead of NBSP improves the pod spacing.
|
| |
|
|
|
|
|
|
| |
This (large) commit allows locales to be used in threaded perls on
platforms that support it. This includes recent Windows and Posix 2008
ones.
|
| |
|
| |
|
|
|
|
|
| |
This automatically fixes the bug where it always returned a dot for the
decimal point character.
|
| |
|
| |
|
|
|
|
|
| |
Some locales aren't compatible with Perl. Note the potential bad
consequences of using them.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
On some platforms, the libc strxfrm() works reasonably well on UTF-8
locales, giving a default collation ordering. It will assume that every
string passed to it is in UTF-8. This commit changes Perl to make sure
that strxfrm's expectations are met.
Likewise under a non-UTF-8 locale, strxfrm is expecting a non-UTF-8
string. And this commit makes sure of that as well.
So, simply meeting strxfrm's expectations allows Perl to start
supporting default collation in UTF-8 locales, and fixes it to work on
single-byte locales with UTF-8 input. (Unicode::Collate provides
tailorable functionality and is portable to platforms where strxfrm
isn't as intelligent, but is a much more heavy-weight solution that may
not be needed for particular applications.)
There is a problem in non-UTF-8 locales if the passed string contains
code points representable only in UTF-8. This commit causes them to be
changed, before being passed to strxfrm, into the highest collating
character in the locale that doesn't require UTF-8. They then will sort
the same as that character, which means after all other characters in
the locale but that one. In strings that don't have that character,
this will generally provide exactly correct operation. There still is a
problem, if that character, in the given locale, combines with adjacent
characters to form a specially weighted sequence. Then, the change of
these above-255 code points into that character can skew the results.
See the commit message for 6696cfa7cc3a0e1e0eab29a11ac131e6f5a3469e for
more on this. But it is really an illegal situation to have above-255
code points in a single-byte locale, so this behavior is a reasonable
degradation when given illegal input. If two transformed strings
compare exactly equal, Perl already uses the un-transformed versions to
break ties, and there, these faked-up strings will collate so the
above-255 code points sort after everything else, and in code point
order amongst themselves.
|
|
|
|
|
| |
Two html anchors in this pod were identical, which isn't a problem
unless you try to link to one of them, as the next commit does
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Every time a new collation locale is set, two coefficients are calculated
that are used in predicting how much space is needed in the
transformation of a string by strxfrm(). The transformed string is
roughly linear with the the length of the input string, so we are
calcaulating 'm' and 'b' such that
transformed_length = m * input_length + b
Space is allocated based on this prediction. If it is too small, the
strxfrm() will fail, and we will have to increase the allotted amount
and try again. It's better to get the prediction right to avoid
multiple, expensive strxfrm() calls.
Prior to this commit, the calculation was not rigorous, and failed on
some platforms that don't have a fully conforming strxfrm().
This commit changes to not panic if a locale has an apparent defective
collation, but instead silently change to use C-locale collation. It
could be argued that a warning should additionally be raised.
This commit fixes [perl #121734].
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
One of the problems in implementing Perl is that the C library routines
forbid embedded NUL characters, which Perl accepts. This is true for
the case of strxfrm() which handles collation under locale.
The best solution as far as functionality goes, would be for Perl to
write its own strxfrm replacement which would handle the specific needs
of Perl. But that is not going to happen because of the huge complexity
in handling it across many platforms. We would have to know the
location and format of the locale definition files for every such
platform. Some might follow POSIX guidelines, some might not.
strxfrm creates a transformation of its input into a new string
consisting of weight bytes. In the typical but general case, a 3
character NUL-terminated input string 'A B C 00' (spaces added for
readability) gets transformed into something like:
A¹ B¹ C¹ 01 A² B² C² 01 A³ B³ C³ 00
where the superscripted characters are weights for the corresponding
input characters. Superscript 1 represents (essentially) the primary
sorting key; 2, the secondary, etc, for as many levels as the locale
definition gives. The 01 byte is likely to be the separator between
levels, but not necessarily, and there could be some other mechanisms
used on various platforms.
To handle embedded NULs, the simplest thing would be to just remove them
before passing in to strxfrm(). Then they would be entirely ignored,
which might not be what you want. You might want them to have some
weight at the tertiary level, for example. It also causes problems
because strxfrm is very context sensitive. The locale definition can
define weights for specific sequences of any length (and the weights can
be multi-byte), and by removing a NUL, two characters now become
adjacent that weren't in the input, and they could now form one of those
special sequences and thus throw things off.
Another way to handle NULs, that seemingly ignores them, but actually
doesn't, is the mechanism in use prior to this commit. The input string
is split at the NULs, and the substrings are independently passed to
strxfrm, and the results concatenated together. This doesn't work
either. In our example 'A B C 00', suppose B is a NUL, and should have
some weight at the tertiary level. What we want is:
A¹ C¹ 01 A² C² 01 A³ B³ C³ 00
But that's not at all what you get. Instead it is:
A¹ 01 A² 01 A³ C¹ 01 C² 01 C³ 00
The primary weight of C comes immediately after the teriary weight of A,
but more importantly, a NUL, instead of being ignored at the primary
levels, is significant at all levels, so that "a\0c" would sort before
"ab".
Still another possibility is to replace the NUL with some other
character before passing it to strxfrm. That was my original plan, to
replace each NUL with the character that this code determines has the
lowest collation order for the current locale. On strings that don't
contain that character, the results would be as good as it gets for that
locale. That character is likely to be ignored at higher weight levels,
but have some small non-ignored weight at the lowest ones. And
hopefully the character would rarely be encountered in practice. When
it does happen, it and NUL would sort identically; hardly the end of the
world. If the entire strings sorted identically, the NUL-containing one
would come out before the other one, since the original Perl strings are
used as a tie breaker. However, testing showed a problem with this. If
that other character is part of a sequence that has special weighting,
the results won't be correct. With gcc, U+00B4 ACUTE ACCENT is the
lowest collating character in many UTF-8 locales. It combines in
Romanian and Vietnamese with some other characters to change weights,
and hence changing NULs into U+B4 screws things up.
What I finally have come to is to do is a modification of this final
approach, where the possible NUL replacements are limited to just
characters that are controls in the locale. NULs are replaced by the
lowest collating control. It would really be a defective locale if this
control combined with some other character to form a special sequence.
Often the character will be a 01, START OF HEADING. In the very
unlikely case that there are absolutely no controls in the locale, 01 is
used, because we have to replace it with something.
The code added by this commit is mostly utf8-ready. A few commits from
now will make Perl properly work with UTF-8 (if the platform supports
it). But until that time, this isn't a full implementation; it only
looks for the lowest-sorting control that is invariant, where the
the UTF8ness doesn't matter. The added tests are marked as TODO until
then.
|
|
|
|
| |
And add a TODO test, because this shortly will be improved upon
|
|
|
|
| |
We say something here that is no longer true; update it.
|
|
|
|
|
| |
This comes from our increased understanding of their perils, given
ticket #127708
|
| |
|
|
|
|
| |
Commit modifies 4 of 5 files in patch submitted by author in RT #124335.
|
| |
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
| |
Some questions and loose ends:
XXX gv.c:S_gv_magicalize - why are we using SSize_t for paren?
XXX mg.c:Perl_magic_set - need appopriate error handling for $)
XXX regcomp.c:S_reg - need to check if we do the right thing if parno
was not grokked
Perl_get_debug_opts should probably return something unsigned; not sure
if that's something we can change.
|
|
|
|
| |
Extracted from patch submitted by Lajos Veres in RT #123693.
|
| |
|
|
|
|
|
|
|
|
|
| |
See http://nntp.perl.org/group/perl.perl5.porters/211909
Something is quite likely wrong with the logic if say in a Greek locale,
Unicode characters (especially Greek ones) are encountered. The same
character will be represented by two different code points. This
warning alerts the user to this undesirable state of affairs.
|
| |
|
|
|
|
|
| |
This reverts commit 1244bd171b8d1fd4b6179e537f7b95c38bd8f099,
thus reinstating commit 3d3a881c1b0eb9c855d257a2eea1f72666e30fbc.
|
|
|
|
|
|
|
| |
This reverts commit 3d3a881c1b0eb9c855d257a2eea1f72666e30fbc.
Win32 with a 1252 code page was failing blead. Revert until I have time
to look at it.
|
| |
|
|
|
|
|
|
|
|
|
| |
Perl only supports single-byte locales (except for UTF-8 ones), and has
poor support for 7-bit locales that aren't supersets of ASCII (these
should be exceedingly rare these days).
This commit raises warnings in the new locale warning category when
such a locale is entered.
|
| |
|