| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This reverts commit e6a172f358c0f48c4b744dbd5e9ef6ff0b4ff289,
which was a revert of a3bf60fbb1f05cd2c69d4ff0a2ef99537afdaba7.
Add new hashing and "hash with state" infrastructure
This adds support for three new hash functions: StadtX, Zaphod32 and SBOX,
and reworks some of our hash internals infrastructure to do so.
SBOX is special in that it is designed to be used in conjuction with any
other hash function for hashing short strings very efficiently and very
securely. It features compile time options on how much memory and startup
time are traded off to control the length of keys that SBOX hashes.
This also adds support for caching the hash values of single byte characters
which can be used in conjuction with any other hash, including SBOX, although
SBOX itself is as fast as the lookup cache, so typically you wouldnt use both
at the same time.
This also *removes* support for Jenkins One-At-A-Time. It has served us
well, but it's day is done.
This patch adds three new files: zaphod32_hash.h, stadtx_hash.h,
sbox32_hash.h
|
|
|
|
|
|
|
|
|
|
| |
Give Perl_nextargv its own statbuf and pass a pointer to it into
Perl_do_open_raw and thence S_openn_cleanup when needed.
Also reduce the scope of the existing statbuf in Perl_nextargv to make
it clear it's distinct from the one populated by do_open_raw.
Fix perldelta entry for PL_statbuf removal
|
|
|
|
|
|
| |
This reverts commit a3bf60fbb1f05cd2c69d4ff0a2ef99537afdaba7.
Accidentally pushed work pending unfreeze.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This adds support for three new hash functions: StadtX, Zaphod32 and SBOX,
and reworks some of our hash internals infrastructure to do so.
SBOX is special in that it is designed to be used in conjuction with any
other hash function for hashing short strings very efficiently and very
securely. It features compile time options on how much memory and startup
time are traded off to control the length of keys that SBOX hashes.
This also adds support for caching the hash values of single byte characters
which can be used in conjuction with any other hash, including SBOX, although
SBOX itself is as fast as the lookup cache, so typically you wouldnt use both
at the same time.
This also *removes* support for Jenkins One-At-A-Time. It has served us
well, but it's day is done.
This patch adds three new files: zaphod32_hash.h, stadtx_hash.h,
sbox32_hash.h
|
|
|
|
| |
This will be used in a future commit.
|
|
|
|
|
|
| |
These macros are being replaced by a safe version; they now generate a
deprecation message at each call site upon the first use there in each
program run.
|
|
|
|
|
|
|
| |
This variable really means the character that replaces any embedded NULs
when doing collation. Change the name accordingly. (Embedded NULs must
be replaced because the libc function strxfrm is used, and it operates
on C strings which have no embedded NULs.)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
FC didn't like my previous patch for this issue, so here is the
one he likes better. With tests and etc. :-)
The basic problem is that code like this: /(?{ s!!! })/ can trigger
infinite recursion on the C stack (not the normal perl stack) when the
last successful pattern in scope is itself. Since the C stack overflows
this manifests as an untrappable error/segfault, which then kills perl.
We avoid the segfault by simply forbidding the use of the empty pattern
when it would resolve to the currently executing pattern.
I imagine with a bit of effort someone can trigger the original SEGV,
unlike my original fix which forbade use of the empty pattern in a
regex code block. So if someone actually reports such a bug we might
have to revert to the older approach of prohibiting this.
|
|
|
|
|
|
|
| |
Because if we're running under a Unix shell, the path separator is
likely to meet the expectations of Unix shell scripts better if it's
the Unix ':' rather than the VMS '|'. There is no change when
running under DCL.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We have an interpreter variable using memory, PL_maxo, which is
defined to be the same as MAXO, a #defined constant. As far as I can
tell, it is never used in lvalue context, in core or on CPAN, except
for the initialisation in intrpvar.h.
It can simply be removed and replaced with a macro defined as equiva-
lent to MAXO.
It was added in this commit:
commit 84ea024ac9cdf20f21223e686dddea82d5eceb4f
Author: Perl 5 Porters <perl5-porters.nicoh.com>
Date: Tue Jan 2 23:21:55 1996 +0000
perl 5.002beta1h patch: perl.h
5.002beta1 attempted some memory optimizations, but unfortunately
they can result in a memory leak problem. This can be
avoided by #define STRANGE_MALLOC. I do that here until
consensus is reached on a better strategy for handling the
memory optimizations.
Include maxo for the maximum number of operations (needed
for the Safe extension).
But apparently it is not needed for the Safe extension (tests pass
without it).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This commit is the first step in making locale handling thread-safe.
[perl #127708] was solved for 5.24 by adding a mutex in this function.
That bug was caused by the code changing the locale even if the calling
program is not consciously using locales.
Posix 2008 introduced thread-safe locale functions. This commit changes
this function to use them if the perl is threaded and the platform has
them available. This means that the mutex is avoided on modern
platforms.
It restructures the function to return a mortal copy of the error
message. This is a step towards making the function completely thread
safe. Right now, as documented, if you do 'use locale', locale handling
isn't thread-safe.
A global C locale object is created and used here if necessary. It is
destroyed at the end of the program.
Note that some platforms have a strerror_r(), which is automatically
used instead of strerror() if available. It differs form straight
strerror() by taking a buffer to place the returned string, so the
return does not point to internal static storage. One could test for
the existence of this and avoid the mortal copy.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
On some platforms, the libc strxfrm() works reasonably well on UTF-8
locales, giving a default collation ordering. It will assume that every
string passed to it is in UTF-8. This commit changes Perl to make sure
that strxfrm's expectations are met.
Likewise under a non-UTF-8 locale, strxfrm is expecting a non-UTF-8
string. And this commit makes sure of that as well.
So, simply meeting strxfrm's expectations allows Perl to start
supporting default collation in UTF-8 locales, and fixes it to work on
single-byte locales with UTF-8 input. (Unicode::Collate provides
tailorable functionality and is portable to platforms where strxfrm
isn't as intelligent, but is a much more heavy-weight solution that may
not be needed for particular applications.)
There is a problem in non-UTF-8 locales if the passed string contains
code points representable only in UTF-8. This commit causes them to be
changed, before being passed to strxfrm, into the highest collating
character in the locale that doesn't require UTF-8. They then will sort
the same as that character, which means after all other characters in
the locale but that one. In strings that don't have that character,
this will generally provide exactly correct operation. There still is a
problem, if that character, in the given locale, combines with adjacent
characters to form a specially weighted sequence. Then, the change of
these above-255 code points into that character can skew the results.
See the commit message for 6696cfa7cc3a0e1e0eab29a11ac131e6f5a3469e for
more on this. But it is really an illegal situation to have above-255
code points in a single-byte locale, so this behavior is a reasonable
degradation when given illegal input. If two transformed strings
compare exactly equal, Perl already uses the un-transformed versions to
break ties, and there, these faked-up strings will collate so the
above-255 code points sort after everything else, and in code point
order amongst themselves.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
It's kind of guess work deciding how big a buffer to give to strxfrm().
If you give it too small a one, it will fail. Prior to this commit, the
buffer size was doubled and then strxfrm() was called again, looping
until it worked, or we used too much memory.
Each time a new locale is made, we try to minimize the necessity of
doing this by calculating numbers 'm' and 'b' that can be plugged into
the equation
mx + b
where 'x' is the size of the string passed to strxfrm(). strxfrm() is
roughly linear with respect to its input's length, so this generally
works without us having to do many loops to get a large enough size.
But on many systems, strxfrm(), in failing, returns how much space you
should have given it. On such systems, we can just use that number on
the 2nd try and not have to keep guessing. This commit changes to do
that.
But on other systems this doesn't work. So the original method is
retained if we determine that there are problems with strxfrm(), either
from previous experience, or because using the size returned from the
first trial didn't work
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
One of the problems in implementing Perl is that the C library routines
forbid embedded NUL characters, which Perl accepts. This is true for
the case of strxfrm() which handles collation under locale.
The best solution as far as functionality goes, would be for Perl to
write its own strxfrm replacement which would handle the specific needs
of Perl. But that is not going to happen because of the huge complexity
in handling it across many platforms. We would have to know the
location and format of the locale definition files for every such
platform. Some might follow POSIX guidelines, some might not.
strxfrm creates a transformation of its input into a new string
consisting of weight bytes. In the typical but general case, a 3
character NUL-terminated input string 'A B C 00' (spaces added for
readability) gets transformed into something like:
A¹ B¹ C¹ 01 A² B² C² 01 A³ B³ C³ 00
where the superscripted characters are weights for the corresponding
input characters. Superscript 1 represents (essentially) the primary
sorting key; 2, the secondary, etc, for as many levels as the locale
definition gives. The 01 byte is likely to be the separator between
levels, but not necessarily, and there could be some other mechanisms
used on various platforms.
To handle embedded NULs, the simplest thing would be to just remove them
before passing in to strxfrm(). Then they would be entirely ignored,
which might not be what you want. You might want them to have some
weight at the tertiary level, for example. It also causes problems
because strxfrm is very context sensitive. The locale definition can
define weights for specific sequences of any length (and the weights can
be multi-byte), and by removing a NUL, two characters now become
adjacent that weren't in the input, and they could now form one of those
special sequences and thus throw things off.
Another way to handle NULs, that seemingly ignores them, but actually
doesn't, is the mechanism in use prior to this commit. The input string
is split at the NULs, and the substrings are independently passed to
strxfrm, and the results concatenated together. This doesn't work
either. In our example 'A B C 00', suppose B is a NUL, and should have
some weight at the tertiary level. What we want is:
A¹ C¹ 01 A² C² 01 A³ B³ C³ 00
But that's not at all what you get. Instead it is:
A¹ 01 A² 01 A³ C¹ 01 C² 01 C³ 00
The primary weight of C comes immediately after the teriary weight of A,
but more importantly, a NUL, instead of being ignored at the primary
levels, is significant at all levels, so that "a\0c" would sort before
"ab".
Still another possibility is to replace the NUL with some other
character before passing it to strxfrm. That was my original plan, to
replace each NUL with the character that this code determines has the
lowest collation order for the current locale. On strings that don't
contain that character, the results would be as good as it gets for that
locale. That character is likely to be ignored at higher weight levels,
but have some small non-ignored weight at the lowest ones. And
hopefully the character would rarely be encountered in practice. When
it does happen, it and NUL would sort identically; hardly the end of the
world. If the entire strings sorted identically, the NUL-containing one
would come out before the other one, since the original Perl strings are
used as a tie breaker. However, testing showed a problem with this. If
that other character is part of a sequence that has special weighting,
the results won't be correct. With gcc, U+00B4 ACUTE ACCENT is the
lowest collating character in many UTF-8 locales. It combines in
Romanian and Vietnamese with some other characters to change weights,
and hence changing NULs into U+B4 screws things up.
What I finally have come to is to do is a modification of this final
approach, where the possible NUL replacements are limited to just
characters that are controls in the locale. NULs are replaced by the
lowest collating control. It would really be a defective locale if this
control combined with some other character to form a special sequence.
Often the character will be a 01, START OF HEADING. In the very
unlikely case that there are absolutely no controls in the locale, 01 is
used, because we have to replace it with something.
The code added by this commit is mostly utf8-ready. A few commits from
now will make Perl properly work with UTF-8 (if the platform supports
it). But until that time, this isn't a full implementation; it only
looks for the lowest-sorting control that is invariant, where the
the UTF8ness doesn't matter. The added tests are marked as TODO until
then.
|
|
|
|
| |
This will be used in future commits
|
|
|
|
|
| |
This adds a new mutex for use in the next commit for use with locale
handling.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The regex engine when displaying debugging info, say under -Dr, will elide
data in order to keep the output from getting too long. For example,
the number of code points in all of Unicode matched by \w is quite
large, and so when displaying a pattern that matches this, only the
first some number of them are printed, and the rest are truncated,
represented by "...".
Sometimes, one wants to see more than what the
compiled-into-the-engine-max shows. This commit creates code to read
this environment variable to override the default max lengths. This
changes the lengths for everything to the input number, even if they
have different compiled maximums in the absence of this variable.
I'm not currently documenting this variable, as I don't think it works
properly under threads, and we may want to alter the behavior in various
ways as a result of gaining experience with using it.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This adds the final Unicode boundary type previously missing from core
Perl: the LineBreak one. This feature is already available in the
Unicode::LineBreak module, but I've been told that there are portability
and some other issues with that module. What's added here is a
light-weight version that is lacking the customizable features of the
module.
This implements the default Line Breaking algorithm, but with the
customizations that Unicode is expecting everybody to add, as their
test file tests for them. In other words, this passes Unicode's fairly
extensive furnished tests, but wouldn't if it didn't include certain
customizations specified by Unicode beyond the basic algorithm.
The implementation uses a look-up table of the characters surrounding a
boundary to see if it is a suitable place to break a line. In a few
cases, context needs to be taken into account, so there is code in
addition to the lookup table to handle those.
This should meet the needs for line breaking of many applications,
without having to load the module.
The algorithm is somewhat independent of the Unicode version, just like
the other boundary types. Only if new rules are added, or existing ones
modified is there need to go in and change this code. Otherwise,
running regen/mk_invlists.pl should be sufficient when a new Unicode
release is done to keep it up-to-date, again like the other Unicode
boundary types.
|
|
|
|
| |
Saves memory in interp struct.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
These two commits:
v5.21.3-759-gff2a62e "Skip no-common-vars optimisation for aliases"
v5.21.4-210-gc997e36 "Make list assignment respect foreach aliasing"
added a run-time mechanism to detect aliased package variables,
by either "*pkg = ...," or "for $pkg (...)", and used that information
to enable the OPpASSIGN_COMMON mechanism at runtime for detecting common
elements in a list assign, e.g.
for $alias ($a, ...) {
($a,$b) = (1,$alias);
}
The previous commit but one changed the OPpASSIGN_COMMON mechanism such
that it no longer uses PL_sawalias. So this var and the mechanism for
setting it can now be removed.
This commit removes:
* the PL_sawalias variable
* the GPf_ALIASED_SV GP flag
* the SAVEt_GP_ALIASED_SV and save_aliased_sv() save type.
|
| |
|
| |
|
|
|
|
|
| |
A previous commit changed how \X is implemented, and now we don't need
these anymore.
|
|
|
|
|
|
|
|
|
|
|
| |
A function implements seeing if the space between any two characters is
a grapheme cluster break. Afer I wrote this, I realized that an array
lookup might be a better implementation, but the deadline for v5.22 was
too close to change it. I did see that my gcc optimized it down to
an array lookup.
This makes the implementation of \X go from being complicated to
trivial.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Commit 8c6180a91de91a1194f427fc639694f43a903a78 added a warning message
for when Perl determines that the program's underlying locale just
switched into is poorly supported. At the time it was thought that this
would be an extremely rare occurrence. However, a bug in HP-UX -
B.11.00/64 causes this message to be raised for the "C" locale. A
workaround was done that silenced those. However, before it got fixed,
this message would occur gobs of times executing the test suite. It was
raised even if the script is not locale-aware, so that the underlying
locale was completely irrelevant. There is a good prospect that someone
using an older Asian locale as their default would get this message
inappropriately, even if they don't use locales, or switch to a
supported one before using them.
This commit causes the message to be raised only if it actually is
relevant. When not in the scope of 'use locale', the message is stored,
not raised. Upon the first locale-dependent operation within a bad
locale, the saved message is raised, and the storage cleared. I was
able to do this without adding extra branching to the main-line
non-locale execution code. This was done by adding regnodes which get
jumped to by switch statements, and refactoring some existing C tests so
they exclude non-locale right off the bat.
These changes would have been necessary for another locale warning that
I previously agreed to implement, and which is coming a few commits from
now.
I do not know of any way to add tests in the test suite for this. It is
in fact rare for modern locales to have these issues. The way I tested
this was to temporarily change the C code so that all locales are viewed
as defective, and manually note that the warnings came out where
expected, and only where expected.
I chose not to try to output this warning on any POSIX functions called.
I believe that all that are affected are deprecated or scheduled to be
deprecated anyway. And POSIX is closer to the hardware of the machine.
For convenience, I also don't output the message for some zero-length
pattern matches. If something is going to be matched, the message will
likely very soon be raised anyway.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This op is an optimisation for any series of one or more array or hash
lookups and dereferences, where the key/index is a simple constant or
package/lexical variable. If the first-level lookup is of a simple
array/hash variable or scalar ref, then that is included in the op too.
So all of the following are replaced with a single op:
$h{foo}
$a[$i]
$a[5][$k][$i]
$r->{$k}
local $a[0][$i]
exists $a[$i]{$k}
delete $h{foo}
while these aren't:
$a[0] already handled by OP_AELEMFAST
$a[$x+1] not a simple index
and these are partially replaced:
(expr)->[0]{$k} the bit following (expr) is replaced
$h{foo}[$x+1][0] the first and third lookups are each done with
a multideref op, while the $x+1 expression and
middle lookup are done by existing add, aelem etc
ops.
Up until now, aggregate dereferencing has been very heavyweight in ops; for
example, $r->[0]{$x} is compiled as:
gv[*r] s
rv2sv sKM/DREFAV,1
rv2av[t2] sKR/1
const[IV 0] s
aelem sKM/DREFHV,2
rv2hv sKR/1
gvsv[*x] s
helem vK/2
When executing this, in addition to the actual calls to av_fetch() and
hv_fetch(), there is a lot of overhead of pushing SVs on and off the
stack, and calling lots of little pp() functions from the runops loop
(each with its potential indirect branch miss).
The multideref op avoids that by running all the code in a loop in a
switch statement. It makes use of the new UNOP_AUX type to hold an array
of
typedef union {
PADOFFSET pad_offset;
SV *sv;
IV iv;
UV uv;
} UNOP_AUX_item;
In something like $a[7][$i]{foo}, the GVs or pad offsets for @a and $i are
stored as items in the array, along with a pointer to a const SV holding
'foo', and the UV 7 is stored directly. Along with this, some UVs are used
to store a sequence of actions (several actions are squeezed into a single
UV).
Then the main body of pp_multideref is a big while loop round a switch,
which reads actions and values from the AUX array. The two big branches in
the switch are ones that are affectively unrolled (/DREFAV, rv2av, aelem)
and (/DREFHV, rv2hv, helem) triplets. The other branches are various entry
points that handle retrieving the different types of initial value; for
example 'my %h; $h{foo}' needs to get %h from the pad, while '(expr)->{foo}'
needs to pop expr off the stack.
Note that there is a slight complication with /DEREF; in the example above
of $r->[0]{$x}, the aelem op is actually
aelem sKM/DREFHV,2
which means that the aelem, after having retrieved a (possibly undef)
value from the array, is responsible for autovivifying it into a hash,
ready for the next op. Similarly, the rv2sv that retrieves $r from the
typeglob is responsible for autovivifying it into an AV. This action
of doing the next op's work for it complicates matters somewhat. Within
pp_multideref, the autovivification action is instead included as the
first step of the current action.
In terms of benchmarking with Porting/bench.pl, a simple lexical
$a[$i][$j] shows a reduction of approx 40% in numbers of instructions
executed, while $r->[0][0][0] uses 54% fewer. The speed-up for hash
accesses is relatively more modest, since the actual hash lookup (i.e.
hv_fetch()) is more expensive than an array lookup. A lexical $h{foo}
uses 10% fewer, while $r->{foo}{bar}{baz} uses 34% fewer instructions.
Overall,
bench.pl --tests='/expr::(array|hash)/' ...
gives:
PRE POST
------ ------
Ir 100.00 145.00
Dr 100.00 165.30
Dw 100.00 175.74
COND 100.00 132.02
IND 100.00 171.11
COND_m 100.00 127.65
IND_m 100.00 203.90
with cache misses unchanged at 100%.
In general, the more lookups done, the bigger the proportionate saving.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This reverts commit 8771da69db30134352181c38401c7e50753a7ee8.
Pad lists need to carry IDs around with them, so that when something
tries to close over a pad, it is possible to confirm that the right
pad is being closed over (either the original outer pad, or a clone of
it). (See the commit message of db4cf31d1, in which commit I added an
ID to the padlist struct.)
In 8771da69 I found that I could use the memory address of the pad’s
name list (name lists are shared) and avoid the extra field.
Some time after 8771da69 I realised that a pad list could be freed,
and the same address reused for another pad list, so using a memory
address may not be so wise. I thought it highly unlikely, though, and
put it on the back burner.
I have just run into that. t/comp/form_scope.t is now failing
for me with test 13, added by db4cf31d1. It bisects to 3d6de2cd1
(PERL_PADNAME_MINIMAL), but that’s a red herring. Trivial changes
to the script make the problem go away. And it only happens on non-
debugging builds, and only on my machine. Stepping through with gdb
shows that the format-cloning is following the format prototype’s out-
side pointer and confirming that it is has the correct pad (yes, the
memory addresses are the same), which I know it doesn’t, because I can
see what the test is doing.
While generation numbers can still fall afoul of the same problem, it
is much less likely.
Anyway, the worst thing about 8771da69 is the typo in the first word
of the commit message.
|
|
|
|
|
| |
These will replace the current use of &PL_sv_undef and &PL_sv_no as
pad names.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The encoding pragma is deprecated, but in the meantime it causes spooky
action at a distance with other modules that it may be combined with.
In these modules, operations such as chr(), ord(), and utf8::upgrade()
will suddenly start doing the wrong thing.
The documentation for 'encoding' has said to call it after loading other
modules, but this may be impractical. This is especially bad with
anything that auto-loads at first use, like \N{} does now for charnames.
There is an issue with combining this with setting the variable
${^ENCODING} directly. The potential for conflicts has always been
there, and remains. This commit introduces a shadow hidden variable,
subservient to ${^ENCODING} (to preserve backwards compatibility) that
has lexical scope validity.
The pod for 'encoding' has been revamped to be more concise, clear, use
more idiomatic English, and to speak from a modern perspective.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
- this improves the error message on ABI incompatibility, per
[perl #123136]
- reduce the number of gv_fetchfile calls in newXS over registering many
XSUBs
- "v" was not stripped from PERL_API_VERSION_STRING since string
"vX.XX.X\0", a typical version number is 8 bytes long, and aligned to
4/8 by most compilers in an image. A double digit maint release is
extremely unlikely.
- newXS_deffile saves on machine code in bootstrap functions by not passing
arg filename
- move newXS to where the rest of the newXS*()s live
- move the "no address" panic closer to the start to get it out of the way
sooner flow wise (it nothing to do with var gv or cv)
- move CvANON_on to not check var name twice
- change die message to use %p, more efficient on 32 ptr/64 IV platforms
see ML post "about commit "util.c: fix comiler warnings""
- vars cv/xs_spp (stack pointer pointer)/xs_interp exist for inspection by
a C debugger in an unoptimized build
|
|
|
|
|
| |
Commit 0e42d607f5 made PL_apiversion unused. Remove it to save memory in
interp struct.
|
|
|
|
|
|
|
|
|
| |
This prevents perl recursing infinitely when an overloaded object is
assigned to $DB::single, $DB::trace or $DB::signal
This is done by referencing their values as IVs instead of as SVs in
dbstate, and by adding magic to those variables so that assignments to
the scalars update the PL_DBcontrol array.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The ‘no common vars’ optimisation allows perl to copy the values
straight from the rhs to the lhs in a list assignment.
In ($a,$b) = ($c,$d), that means $c gets assigned to $a,
then $d to $b.
If the same variable occurs on both sides of the expression
(($a,$b)=($b,$a)), then it is necessary to make temporary copies of
the variables on the rhs, before assigning them to the left.
If some variables have been aliased to others, then the common vars
detection can be fooled:
*x = *y;
$x = 3;
($x, $z) = (1, $y);
That assigns 1 to $x, and then goes to assign $y to $z, but $y is
the same as $x, which has just been clobbered. So 1 gets assigned
instead of 3.
This commit solves this by recording in each typeglob whether the sca-
lar is an alias of a scalar from elsewhere.
If such a glob is encountered, then the entire expression is ‘tainted’
such that list assignments will assume there might be common vars.
|
|
|
|
|
|
|
|
| |
This commit allows Perl to be compiled with a bitmap size that is larger
than 256. This bitmap is used to directly look up whether a character
matches or not, without having to do a binary search or hash lookup. It
might improve the performance for some installations that have a lot of
use of scripts that are above the Latin1 range.
|
| |
|
|
|
|
|
|
|
|
| |
Because some platforms (like HP-UX 10.*) have HUGE_VAL as DBL_MAX,
which, while large, is not quite the infinity. So have infinity
own our very own.
Similarly for NV_NAN.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
PL_padix keeps track of the position in the pad when pad_alloc has to
start scanning for an available slot.
The availability of a slot is determined differently for targets
(which may reuse slots that are already targets from previous state-
ments, at least when pad_reset is enabled) and constants (which may
not reuse targets).
Having the same index for both may require scanning the entire pad for
allocating a constant or GV.
t/re/uniprops.t was running far too slowly under USE_BROKEN_PAD_RESET
because of this. pad_reset would reset PL_padix to point to the
beginning of a pad with a few hundred thousand entries. pad_alloc
would then have to scan the entire pad before adding a GV to the end.
It is still too slow, even with this commit, but for other reasons.
(This is just a partial fix.)
|
|
|
|
|
|
| |
MAD = Misc Attribute Decoration; unmaintained attempt at preserving
the Perl parse tree more faithfully so that automatic conversion to
Perl 6 would have been easier.
|
|
|
|
| |
This was scheduled to be removed in 5.20, but was forgotten.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This large (sorry, I couldn't figure out how to meaningfully split it
up) commit causes Perl to fully support LC_CTYPE operations (case
changing, character classification) in UTF-8 locales.
As a side effect it resolves [perl #56820].
The basics are easy, but there were a lot of details, and one
troublesome edge case discussed below.
What essentially happens is that when the locale is changed to a UTF-8
one, a global variable is set TRUE (FALSE when changed to a non-UTF-8
locale). Within the scope of 'use locale', this variable is checked,
and if TRUE, the code that Perl uses for non-locale behavior is used
instead of the code for locale behavior. Since Perl's internal
representation is UTF-8, we get UTF-8 behavior for a UTF-8 locale.
More work had to be done for regular expressions. There are three
cases.
1) The character classes \w, [[:punct:]] needed no extra work, as
the changes fall out from the base work.
2) Strings that are to be matched case-insensitively. These form
EXACTFL regops (nodes). Notice that if such a string contains only
characters above-Latin1 that match only themselves, that the node can be
downgraded to an EXACT-only node, which presents better optimization
possibilities, as we now have a fixed string known at compile time to be
required to be in the target string to match. Similarly if all
characters in the string match only other above-Latin1 characters
case-insensitively, the node can be downgraded to a regular EXACTFU node
(match, folding, using Unicode, not locale, rules). The code changes
for this could be done without accepting UTF-8 locales fully, but there
were edge cases which needed to be handled differently if I stopped
there, so I continued on.
In an EXACTFL node, all such characters are now folded at compile time
(just as before this commit), while the other characters whose folds are
locale-dependent are left unfolded. This means that they have to be
folded at execution time based on the locale in effect at the moment.
Again, this isn't a change from before. The difference is that now some
of the folds that need to be done at execution time (in regexec) are
potentially multi-char. Some of the code in regexec was trivial to
extend to account for this because of existing infrastructure, but the
part dealing with regex quantifiers, had to have more work.
Also the code that joins EXACTish nodes together had to be expanded to
account for the possibility of multi-character folds within locale
handling. This was fairly easy, because it already has infrastructure
to handle these under somewhat different circumstances.
3) In bracketed character classes, represented by ANYOF nodes, a new
inversion list was created giving the characters that should be matched
by this node when the runtime locale is UTF-8. The list is ignored
except under that circumstance. To do this, I created a new ANYOF type
which has an extra SV for the inversion list.
The edge case that caused the most difficulty is folding involving the
MICRO SIGN, U+00B5. It folds to the GREEK SMALL LETTER MU, as does the
GREEK CAPITAL LETTER MU. The MICRO SIGN is the only 0-255 range
character that folds to outside that range. The issue is that it
doesn't naturally fall out that it will match the CAP MU. If we let the
CAP MU fold to the samll mu at compile time (which it can because both
are above-Latin1 and so the fold is the same no matter what locale is in
effect), it could appear that the regnode can be downgraded away from
EXACTFL to EXACTFU, but doing so would cause the MICRO SIGN to not case
insensitvely match the CAP MU. This could be special cased in regcomp
and regexec, but I wanted to avoid that. Instead the mktables tables
are set up to include the CAP MU as a character whose presence forbids
the downgrading, so the special casing is in mktables, and not in the C
code.
|
|
|
|
|
|
|
|
|
|
|
|
| |
This global array is no longer used, having been removed in previous
commits in this series.
Since it is a global, consideration need be given to possible uses of it
outside the core. It has never been externally documented, and is an
opaque structure whose internals have changed with every release. The
functions used to access it are almost all static to regcomp.c; those
few that aren't have been hidden from all but the few .c files that need
to have access to them, via #if's.
|
|
|
|
|
|
| |
This reverts commit c1cec775e9019cc8ae244d4db239a7ea5c0b343e.
See ticket #120864.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The cop address for each breakable line was being stored in the IVX
slot of ${"_<$file"}[$line]. This value itself, writable from Perl
space, was being used as the address of the op to be flagged, whenever
a breakpoint was set.
This meant writing to ${"_<$file"}[$line] and assigning a number (like
42) would cause perl to use 42 as an op address, and crash when trying
to flag the op.
Furthermore, since the array holding the lines could outlive the ops,
setting a breakpoint on the op could write to freed memory or to an
unrelated op (even a different type), potentially changing the beha-
viour of unrelated code.
This commit solves those pitfalls by moving breakpoints into a global
breakpoint bitfield. Dbstate ops now have an extra field on the end
holding a sequence number, representing which bit holds the breakpoint
for that op.
|
|
|
|
|
|
|
|
|
|
|
|
| |
PL_ASCII contains an inversion list to match the ASCII-range code
points. It is unusable outside the core regular expression code because
all the functions that manipulate inversion lists are defined only
within a few core files. Therefore no outside code should be depending
on it.
It turns out that there are arrays of similar inversion lists, and these
all have slots which should have this inversion list in them. This
commit fills them, instead of using PL_ASCII.
|
|
|
|
|
| |
This is the upper half of the Latin1 range. This simplifies some code
very slightly, but will be of use in future commits.
|
|
|
|
|
| |
Added as an experiment in 462e5cf6, it never quite worked, and
recently wasn't even using registers.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Based on Yves's random branch work.
This version makes the new random number visible to external modules,
for example, List::Util's XS shuffle() implementation.
I've also added a 64-bit implementation when HAS_QUAD is true, this
should be significantly faster, even on 32-bit CPUs. This is intended to
produce exactly the same sequence as the original implementation.
The original version of this commit retained the "freebsd" name from
Yves's original work for the function and data structure names. I've
removed "freebsd" from most function names so the name isn't an issue
if we choose to replace the implementation,
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
PL_hints stores the hints at compile time that get copied into the
cop_hints field of each COP (in newSTATEOP).
Since perl-5.8.0-8053-gd5ec298, COPs have stored all the hints.
Before that, COPs used to store only some of the hints. The hints
were copied here and there into PL_compiling, a static COP-shaped buf-
fer used during compilation, so that things like constant folding
would see the correct hints. a0ed51b3 back in 1998 did that.
Now that COPs can store all the hints, we can just use
PL_compiling.cop_hints to avoid having to copy them from PL_hints from
time to time.
This simplifies the code and avoids creating bugs like those that
a547fd219 and 1c75beb82 fixed.
|
|
|
|
|
|
|
|
|
|
|
|
| |
This reverts commit c82ecf346.
It turn out to be faulty, because a location shared betweens threads
(the cop) was holding a reference count on a pad entry in a particu-
lar thread. So when you free the cop, how do you know where to do
SvREFCNT_dec?
In reverting c82ecf346, this commit still preserves the bug fix from
1311cfc0a7b, but shifts it around.
|