| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
| |
This function is equivalent to sv_setsv(sv, &PL_sv_undef), but more
efficient.
Also change the obvious places in the core to use the new idiom.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
RT #130132
My recent commit v5.25.6-266-ga083329 made it so that perl could
sometimes avoid mortalising the referent when assigning to a reference
(e.g. for $ref1 = $ref2, where $$ref1 has a ref count of 1).
Unfortunately it turns out that list assign relied on this behaviour
to avoid premature freeing, e.g.
($ref1, $x) = ($y, $$ref1);
where $$ref1 needs to continue to live for at least the rest of the
assign.
This commit fixes it by mortalising the referent in pp_assign when
required.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When assigning to a ref, the old referent is mortalised if its refcount
is 1, to avoid a premature free on things like $r = $$r or $r = $r->[0].
For the shortcut case where $ref1 and $ref2 are simple refs (no magic etc)
it's possible to do the assign then SvREFCNT_dec() the old value without
having to mortalise it. Which is faster.
Even when it doesn't have to be mortalised (RC > 1) this commit makes it
slightly faster as it no longer calls sv_unref_flags().
Conversely, this commit also makes the short-cut integer assign code path
infinitesimally slower.
|
|
|
|
|
|
| |
rename them from expr::assign::* to expr::sassign::* so as to more easily
distinguish them from expr::aassign::, and move them to the correct place
in the file
|
|
|
|
|
| |
'index' is a perl function, so it should really be in the 'func'
top-level namespace of benchmarks
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
[perl #127999] Slowdown in split + list assign
Re-implement the code that handles e.g.
(..., @a) = (...);
(..., %h) = (...);
to make it a lot faster - more than reversing a performance regression
introduced in 5.24.0 - and fix some bugs. In particular, it now
special-cases an empty RHS, which just clears the aggregate, e.g.
(..., @a) = ()
Getting list assignment correct is quite tricky, due to the possibility of
premature frees, like @a = ($a[0]), and magic/tied values on the LHS or
RHS being triggered too soon/late, which might have side-effects. This
often requires making a copy of each RHS element (and indeed for assigning
to an array or hash, the values need copying anyway). But copying too soon
can result in leaked SVs if magic (such as calling FETCH()) dies. This
usually involves mortalising all the copies, which slows things down.
Further, a bug fix in 5.24.0 added the SV_NOSTEAL flag when copying SVs.
This meant in something like @a = (split(...))[0,0], where the two SvTEMPs
on the RHS are the same, the first copy is no longer allowed to steal the
PVX buffer, which would have made the second SV undef. But this means that
PVX buffers are now always copied, which resulted in the slowdown seen in
RT #127999.
Amongst the general rewriting and optimising, this commit does the
following specific things to boost performance (and fix RT #127999).
* If the SVs on the RHS are non-magical SvTEMPs with a ref count of 1, then
the SV isn't copied; instead it is stored directly in the array/hash. This
more than undoes the cost of SV_NOSTEAL.
* The tmps stack is now used as a temporary refcounted version of the
argument stack frame, meaning that args placed there will be freed on
croak. In something like @a = (....), each RHS element is copied, with
the copy placed on the temps stack. Then @a is cleared. Then the elements
on the tmps stack are stored in the array, and removed from the temps
stack (with the ownership of 1 reference count transferring from the temps
stack to the array). Normally by the time pp_aassign() returns, there is
nothing left on the tmps stack and tmps_free() isn't called - this is the
novel element that distinguishes this from the normal use of mortalising.
* For hash assignment, the keys and values are processed in separate
loops, with keys not normally being copied.
* The ENTER/SAVEFREESV(ary/hash)/LEAVE has been removed, and the array or
hash kept temporarily alive by using the temps stack along with all the
other copied SVs.
* The main 'for each LHS element' loop has been split into two loops: the
second one is run when there no more RHS elements to consume. The second
loop is much simpler, and makes things like @a = () much faster.
Here are the average expr::aassign:: benchmarks for selected perls
(raw numbers - lower is better)
5.6.1 5.22.0 5.24.0 5.25.5 this
------ ------ ------ ------ ------
Ir 1355.9 1497.8 1387.0 1382.0 1146.6
Dr 417.2 454.2 410.1 411.1 335.2
Dw 260.6 270.8 249.0 246.8 194.5
COND 193.5 223.2 212.0 207.7 174.4
IND 25.3 17.6 10.8 10.8 10.0
COND_m 4.1 3.1 3.1 3.7 2.8
IND_m 8.9 6.1 5.5 5.5 5.5
And this code:
my @a;
for my $i (1..10_000_000) {
@a = (1,2,3);
#@a = ();
}
with the empty assign is 33% faster than blead, and without is 12% faster
than blead.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
RT #127999 Slowdown in split + list assign
The compile-time common-value detection mechanism for OP_ASSIGN
was getting OP_SPLIT wrong.
It was assuming that OP_SPLIT was always dangerous. In fact,
OP_SPLIT is usually completely safe, not passing though any of its
arguments, except where the assign in (@a = split()) has been optimised
away and the array attached directly to the OP_SPLIT op, or the ops that
produce the array have been appended as an extra child of the OP_SPLIT op
(OPf_STACKED).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
There are currently two optimisations for when the results of a split
are assigned to an array.
For the first,
@array = split(...);
the aassign and padav/rv2av are optimised away, and pp_split() directly
assigns to the array attached to the split op (via op_pmtargetoff or
op_pmtargetgv).
For the second,
my @array = split(...);
local @array = split(...);
@{$expr} = split(...);
The aassign is optimised away, but the padav/rv2av is kept as an additional
arg to split. pp_split itself then uses the first arg popped off the stack
as the array (This was introduced by FC with v5.21.4-409-gef7999f).
This commit moves these two:
my @array = split(...);
local @array = split(...);
from the second case to the first case, by simply setting OPpLVAL_INTRO
on the OP_SPLIT, and making pp_split() do SAVECLEARSV() or save_ary()
as appropriate.
This makes my @a = split(...) a few percent faster.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Most ops that execute a regex, such as match and subst, are of type PMOP.
A PMOP allows the actual regex to be attached directly to that op, due
to its extra fields.
OP_SPLIT is different; it is just a plain LISTOP, but it always has an
OP_PUSHRE as its first child, which *is* a PMOP and which has the regex
attached.
At runtime, pp_pushre()'s only job is to push itself (i.e. the current
PL_op) onto the stack. Later pp_split() pops this to get access to the
regex it wants to execute.
This is a bit unpleasant, because we're pushing an OP* onto the stack,
which is supposed to be an array of SV*'s. As a bit of a hack, on
DEBUGGING builds we push a PVLV with the PL_op address embedded instead,
but this still isn't very satisfactory.
Now that regexes are first-class SVs, we could push a REGEXP onto the
stack rather than PL_op. However, there is an optimisation of @array =
split which eliminates the assign and embeds the array's GV/padix directly
in the PUSHRE op. So split still needs access to that op. But the pushre
op will always be splitop->op_first anyway, so one possibility is to just
skip executing the pushre altogether, and make pp_split just directly
access op_first instead to get the regex and @array info.
But if we're doing that, then why not just go the full hog and make
OP_SPLIT into a PMOP, and eliminate the OP_PUSHRE op entirely: with the
data that was spread across the two ops now combined into just the one
split op.
That is exactly what this commit does.
For a simple compile-time pattern like split(/foo/, $s, 1), the optree
looks like:
before:
<@> split[t2] lK
</> pushre(/"foo"/) s/RTIME
<0> padsv[$s:1,2] s
<$> const(IV 1) s
after:
</> split(/"foo"/)[t2] lK/RTIME
<0> padsv[$s:1,2] s
<$> const[IV 1] s
while for a run-time expression like split(/$pat/, $s, 1),
before:
<@> split[t3] lK
</> pushre() sK/RTIME
<|> regcomp(other->8) sK
<0> padsv[$pat:2,3] s
<0> padsv[$s:1,3] s
<$> const(IV 1)s
after:
</> split()[t3] lK/RTIME
<|> regcomp(other->8) sK
<0> padsv[$pat:2,3] s
<0> padsv[$s:1,3] s
<$> const[IV 1] s
This makes the code faster and simpler.
At the same time, two new private flags have been added for OP_SPLIT -
OPpSPLIT_ASSIGN and OPpSPLIT_LEX - which make it explicit that the
assign op has been optimised away, and if so, whether the array is
lexical.
Also, deparsing of split has been improved, to the extent that
perl TEST -deparse op/split.t
now passes.
Also, a couple of panic messages in pp_split() have been replaced with
asserts().
|
|
|
|
|
|
|
| |
When converting a POK SV to an IOK SV, short-cut the relatively
common case of a string that is only one char long and consists of a
single digit, e.g. "0". Thus skipping all the floating-point, infinity,
whitespace etc complexity.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
RT #128340
The in-place sorting optimisation @a = sort @a, was preserving the
elements of @a rather than (logically) making copies. So make a copy
of any element whose refcount is greater than 1. This may not be the
perfect condition, but keeps performance for the common cases.
Note that several of the tests in t/op/sort.t actually relied on this
behaviour to test whether the sort was being in-placed, so I've added
tests for in-placing to t/perf/opcount.t instead.
See the previous commit for a general discussion of performance;
to the A, B, C in that commit message, here's a fourth column added:
D is like C but with this commit added:
A B C D
------ ------ ------ ------
Ir 5238.0 2324.0 2772.0 2801.0
Dr 1464.0 649.0 765.0 765.0
Dw 919.0 298.0 370.0 380.0
COND 782.0 320.0 405.0 405.0
IND 25.0 25.0 26.0 26.0
COND_m 14.9 13.0 17.0 17.1
IND_m 8.0 5.0 5.0 5.0
so it has little effect on performance.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
There's currently an optimisation that converts at compile-time
@a = sort { .... } @a
into (approximately)
sort { ... } \@a
Then at run time, rather than passing an svp pointer to the appropriate
sort routine which points to a list of SV*'s on the stack, pp_sort()
passes a pointer to @a's AvARRAY. This allows the array to be sorted
in-place, which is more efficient.
However, it has some issues. First, the @a visible to the sort routine
will be varying, whereas logically it should still hold the original list
of values until after the '@a = ...' assignment.
Secondly, the mergesort algorithm cureently used internally, when in
mid-sort, temporarily stores pointers in the array which aren't pointers
to SVs - this means that if @a elements are accessed mid-sort, it can
crash.
The solution to both these problems is for pp_sort() to push the elements
of @a onto the stack at the beginning, sort the stack (like normal sorts
do), then copy back to @a at the end. This is less efficient than before,
but is still a lot more efficient than executing separate padav and
aassign ops.
Here are benchmark results in raw instruction counts etc (lower is better)
for the sort line in this code block:
my (@a, @b);
@a = reverse 1..10;
@b = sort { $a <=> $b } @a;
A is for a non-in-place sort, i.e. @b = sort ... @a;
B and C are for an inline sort, i.e. as above, but @a = sort ... @a;
where B is blead before this commit and C is this commit.
A B C
------ ------ ------
Ir 5238.0 2324.0 2772.0
Dr 1464.0 649.0 765.0
Dw 919.0 298.0 370.0
COND 782.0 320.0 405.0
IND 25.0 25.0 26.0
COND_m 14.9 13.0 17.0
IND_m 8.0 5.0 5.0
As can be seen, this partial pessimisation slows down in-place sorting by
round 20%, but overall in-place is still nearly twice the speed as without
the optimisation.
These are the figures for a plain numeric sort (which is optimised to use
a C comparison function); for other types of sort, the cost of the
comparator dominates, and so the slowdown is much less marked.
|
|
|
|
| |
I forgot to do this earlier.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Commit 11fa937b2766784f0f812687a7f81dd3761e605f, from August 1999, changed
constant folding to treat barewords as unfoldable. Public mailing-list
archives from that period are incomplete, unfortunately, and this change in
particular is no longer well documented; the only message still available
seems to be <199907140903.CAA07505@activestate.com>, with subject "Re:
Unwanted perl helpfullness". That message quotes a bug reporter pointing out
that a constant-folded bareword would fail to trigger the "not allowed while
strict subs in use" error under "use strict".
However, there's no information available about why the bug was fixed in this
way, rather than by merely ensuring that constant folding would produce that
error when necessary. The no_bareword_allowed() routine did already exist at
that point, for example. This change therefore adopts that approach.
This causes one minor change in behaviour: since barewords now take part in
constant folding, concatenating a bareword with another constant in void
context now produces a warning for the concatenated string, rather than for
the concatenation op. This seems like an acceptable change, given that
non-bareword constants already behave in the same way.
|
| |
|
|
|
|
|
| |
If a 'next' without a label appears directly in the scope of the current
loop, then skip searching the context stack for a suitable LOOP context.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently pp_leavesub() doesn't call FREETMPS. Presumably this is due to
the danger of freeing temps that need to exist beyond the return, such
as the mortal copies we make of return args, or return args that are
already temps.
The down side of this is that temps aren't freed until the next nextstate
is executed following the function call. If the function isn't near a
statement boundary, then it may be a while before temps are freed; e.g.
in f(g()), when g() returns, its temps continue to exist during the call
to f(). For recursive subs it gets worse, although there is a specific
hack already in pp_leavesub() that says in scalar context if CvDEPTH > 1,
then temporarily RC++ the single return value then do a FREETMPS. This
can in theory leak if something dies during the FREETMPS.
This commit provides a more general solution. During the course of
processing (usually mortal copying) the return args, it divides the
current temps stack frame into two halves, with the temps that need
keeping migrating to the bottom half. It then does a FREETMPS equivalent
only of the top half.
This is actually more efficient than it sounds; but in addition, the code
has been heavily optimised: in particular the call to sv_mortalcopy()
has been unrolled and inlined, and within that code common cases (such as
IV, RV) handled directly, and the mortal stack is only checked/extended
once, rather than for every arg.
Also, the arg adjust / freetmps code has been moved out of pp_leavesub()
and into a separate static function.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This commit does two things.
First, it streamlines and re-orders some of the initial tests,
such as 'is sstr already freed?'.
Second, it looks for a reasonably common case where both sstr and dstr are
SVt_NULL/SVt_IV. This covers undef, int and refs, where the SV hasn't
previously been used for other things (like strings).
With just SVt_NULL/SVt_IV, we know that the SV won't have a real body, and
won't need one and can be very quick.
The check for SVt_NULL/SVt_IV is a single compare-and-branch, so
has a minimal effect on users of sv_setsv_flags() that have more complex
types.
|
| |
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Normally at the start of a 'for' iteration (pp_enteriter), we store the
address of the GV or the address of the pad slot of the iteration variable
in the CX struct, so we can quickly access and update it in pp_iter. For
the ITHREADS case however, the pad may change if the thread is cloned, so
instead the address of PL_comppad was stored, and then updated during
cloning. This meant that on each iter, we would have to retrieve the saved
PL_comppad address from the ctx struct, retrieve the targ from the saved
my_op, and do a (perlish) array indexing.
Thuis commit makes this faster by, for the ITHREADS case, storing both
PL_comppad and svp = &PLcurpad[targ]. In pp_iter(), we're fast by directly
accessing *svp; while storing PL_comppad allows us to update both it and
svp during thread cloning.
This requires one extra pointer in the block_loop part of the context
union under threads, but this is already smaller than some other parts of
the union, so has no effect.
Note that the oldcomppad field was formerly part of the itervar_u union,
but is now a separate field within the larger block_loop struct (but only
on threaded builds).
Note also that the tests I've added I retrieved from an old WIP private
branch that contained a forerunner of this commit, so they may not be
entirely relevant to the current form of this commit. But extra tests
can never hurt, so I've included them.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
pp_preinc() handles both ++$x and --$x (and the integer variants
pp_i_preinc/dec). Split it into two separate functions, as handling
both inc and dec in the same function requires 3 extra conditionals.
At the same time make the code more efficient.
As currently written it:
1) checked for "bad" SVs (such as read-only) and croaked;
2) checked for a IOK-only SV and directly incremented the IVX slot;
3) else called out to sv_inc() to handle the more complex cases.
This commit combines the checks in (1) and (2) into one single big
check of flags, and anything "bad" simply skips the IOK-only code
and calls sv_dec(), which can do its own checking of read-only etc
and croak if necessary. Porting/bench.pl shows the following raw numbers
for ++$x ($x lexical and holding an integer):
before after
-------- --------
Ir 77.0 56.0
Dr 30.0 24.0
Dw 10.0 10.0
COND 12.0 9.0
IND 2.0 2.0
COND_m -0.1 0.0
IND_m 2.0 2.0
Even having split the function into two, the combined size of the two new
functions is smaller than the single previous function.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In pp_add, pp_subtract and pp_multiply, special-case the following:
* both args IV, neither arg large enough to under/overflow
* both args NV.
Starting in 5.8.0, the implementation of the arithmetic pp functions
became a lot more complex (and famously, much slower), due to the need
to support 64-bit integers.
For example, formerly pp_add just converted both its args to an NV and
returned an NV. On 64-bit systems, that could gave bad results if the
mantissa of an NV was < 64 bits; for example:
$ perl561 -e'$x = 0x1000000000000000; printf "%x\n", $x+1'
1000000000000000
$ perl580 -e'$x = 0x1000000000000000; printf "%x\n", $x+1'
1000000000000001
This led to a lot of complex code that covered all the possibilities
of overflow etc.
This commit adds some special casing to these three common arithmetic ops.
It does some quick checks (mainly involving fast boolean and bit ops)
to determine if both args are valid IVs (and not UVs), are not magic,
and aren't very big (+ve or -ve). In this case, the result is simply
SvIVX(svl) + SvIVX(svr) (or - or *) with no possibility of overflow.
Failing that, if both args are NV's and not magic, then if both NVs
can be converted to IVs without loss, handle as for the IV case; failing
that, just return SvNVX(svl) + SvNVX(svr);
For all other cases, such as mixed IV and NV or PV, fall back to the old
code.
On my platform (x86_64), it (along with the previous commit) reduces the
execution time of the nbody benchmark (lots of floating-point vector
arithmetic) by a third and in fact makes it 10% faster than 5.6.1.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
TL;DR: on platforms with a libc memchr() implementation which makes good
use of underlying hardware support, patterns which include fixed
substrings will now often be much faster; for example with glibc on on a
recent x86_64 CPU, this:
$s = "a" x 1000 . "wxyz";
$s =~ /wxyz/ for 1..30000
is now about 7 times faster. On systems with slow memchr(), e.g. 32-bit
ARM Raspberry Pi, there will be a small or little speedup. Conversely,
some pathological cases, such as "ab" x 1000 =~ /aa/ will be slower now;
up to 3 times slower on the rPi, 1.5x slower on x86_64.
In detail:
The perl core includes a Boyer-Moore substring matcher, Perl_fbm_instr(),
which is used to quickly find a fixed substring within a longer string,
where a table of lookups is pre-computed from the substring. As well as
being used in index() when the substring is a constant, its main use
is in patterns. When the regex engine compiles a pattern, it typically
takes note of the two longest fixed substrings within the pattern; for
example in
/[01]abc\w+de\d+fghij+/
the two longest are "abc" and "fghij". The engine uses Perl_fbm_instr() to
scan for these two strings before running the full NFA. This often allows
the string to be quickly rejected, or to find a suitable minimum starting
point to run the NFA.
However, Perl_fbm_instr() was written about 16 years ago and has been
virtually untouched since, so it could do with some love.
It currently special-cases strings of length 1 and 2, using roll-your-own
loops along the lines of
while (s < end) { if (*s++ = c1) ... }
while strings of length 3+ use the Boyer-Moore algorithm. The big
advantage of BM is that in a best-case, where none of the characters from
the substring are found in this region of the string, it only has to test
every N'th char, where N is length of the substring. For example when
searching for wxyz in abcdefghikl..., it just reads and tests d,h,l,..
However these days some platforms have decent memchr() implementations.
For example, glibc has assembly-level implementations for i386, x86_64,
sparc32/64, powerpc32/64, s390-32/64, arm, m68k and ia64 by the looks of
it. These can often be substantially faster than a C-level implementation.
This commit makes Perl_fbm_instr() use memchr() where possible.
For the length == 1 special case, it just calls memchr() directly rather
than using a loop as previously.
For the length == 2 special case, it continues to distinguish the cases
where the two chars of the substring are the same or differ. For the
former it calls memchr() after an initial direct failure, i.e.
if (*s != c) { s++; s = memchr(....); ... }
For the latter case it does a similar direct test first (to avoid the
costly overhead of a call to memchr() when the next char is the one we
seek anyway), but in addition, on each failure to find the second char
following a found first char, it swaps which char it's searching for.
This means that in something like "aaaaaaa..." =~ /ab/, it wont keep
hopping 1 char position with memchar(s,'a'); after the first hop it
will do memchr(s,'b') and skip lots of chars in one go. This helps reduce
the number of pathological cases.
For the length >= 3 cases (normal BM), it keeps using BM, but after each
iteration where the pointer has been incremented by the skip determined by
the BM algorithm, it now does an additional
if (*s != c) { s++; s = memchr(....); ... }
step before running the next iteration of BM.
|
|
|
|
|
|
|
|
|
|
|
| |
With 808ce5578203, I tweaked the OPpASSIGN_COMMON flagging to mark as safe
when the LHS or RHS only contains only one var. This turned out to be
flawed for the RHS logic, as position as well as oneness is important:
(undef, $x) = ...; # only 1 scalar on LHS: always safe
($x, $y) = (undef, $x); # 2 scalars on RHS: unsafe
So this commit makes undef on the RHS count towards the scalar var count.
|
|
|
|
|
|
|
|
| |
make the tests in the benchmark file be compilable back to 5.004_05. (To
go further back, it would need to avoid package names that start with
digits, such as 'call::sub::3_args').
Basically avoid // and our.
|
|
|
|
|
|
|
| |
Avoid setting common scalar flags in these cases:
($x) = (...);
(...) = ($x);
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Technically in
my ($scalar,...) = @_
due to closure/goto tricks, its possible for $scalar to appear on both
the LHS and RHS, so we currently set the OPpASSIGN_COMMON_RC1 flag.
However, this imposes extra overhead; for example 5% extra instruction
reads and 11% extra conditional branches for
my ($x,$y,$z) = @_;
Given what an important construct this is, disable this flag in the
specific case of of only my's on the LHS and only @_ on the RHS.
It's technically incorrect, but its the same behaviour we've always had
(it was only the previous commit which made it safe but slower).
We still set the OPpASSIGN_COMMON_AGG flag for
my ($...,@a) = @_
since in the normal case this only adds the small additional runtime
overhead of checking that @a is already empty.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This commit almost completely replaces the current mechanism
for detecting and handing common vars in list assignment, e.g.
($a,$b) = ($b,$a);
In general outline: it creates more false positives at compile-time
than before, but also no longer misses some false negatives. In
compensation, it considerably reduces the run-time cost of handling
potential and real commonality.
It does this firstly by splitting the OPpASSIGN_COMMON flag into 3
separate flags:
OPpASSIGN_COMMON_AGG
OPpASSIGN_COMMON_RC1
OPpASSIGN_COMMON_SCALAR
which indicate different classes of commonality that can be handled
in different ways at runtime.
Most importantly, it distinguishes between two basic cases. Firstly,
common scalars (OPpASSIGN_COMMON_SCALAR), e.g.
($x,....) = (....,$x,...)
where $x is modified and then sometime later its value is used again,
but that value has changed in the meantime. In this case, we need
replace such vars on the RHS with mortal copies before processing the
assign.
The second case is an aggregate on the LHS (OPpASSIGN_COMMON_AGG), e.g.
(...,@a) = (...., $a[0],...)
In this case, the issue is instead that when @a is cleared, it may free
items on the RHS (due to the stack not being ref counted). What is
required here is that rather than making of a copy of each RHS element and
storing it in the array as we progress, we make *all* the copies *before*
clearing the array, but mortalise them in case we die in the meantime.
We can further distinguish two scalar cases; sometimes it's possible
to confirm non-commonality at run-time merely by checking that all
the LHS scalars have a reference count of 1. If this is possible,
we set the OPpASSIGN_COMMON_RC1 flag rather than the
OPpASSIGN_COMMON_SCALAR flag.
The major improvement in the run-time performance in the
OPpASSIGN_COMMON_SCALAR case (or OPpASSIGN_COMMON_RC1 if rc>1 scalars are
detected), is to use a mark-and-sweep scan of the two lists using the
SVf_BREAK flag, to determine which elements are common, and only make
mortal copies of those elements. This has a very big effect on run-time
performance; for example in the classic
($a,$b) = ($b,$a);
it would formerly make temp copies of both $a and $b; now it only
copies $a.
In more detail, the mark and sweep mechanism in pp_aassign works by
looping through each LHS and RHS SV pair in parallel. It temporarily marks
each LHS SV with the SVf_BREAK flag, then makes a copy of each RHS element
only if it has the SVf_BREAK flag set. When the scan is finished, the flag
is unset on all LHS elements.
One major change in compile-time flagging is that package scalar vars are
now treated as if they could always be aliased. So we don't bother any
more to do the compile-time PL_generation checking on package vars (we
still do it on lexical vars). We also no longer make use of the run-time
PL_sawalias mechanism for detecting aliased package vars (and indeed the
next commit but one will remove that mechanism). This means that more list
assignment expressions which feature package vars will now need to
do a runtime mark-and-sweep (or where appropriate, RC1) test. In
compensation, we no longer need to test for aliasing and set PL_sawalias
in pp_gvsv and pp_gv, nor reset PL_sawalias in every pp_nextstate.
Part of the reasoning behind this is that it's nearly impossible to detect
all possible package var aliasing; for example PL_sawalias would fail to
detect XS code doing GvSV(gv) = sv.
Note that we now scan the two children of the OP_AASSIGN separately,
and in particular we mark lexicals with PL_generation only on the
LHS and test only on the RHS. So something like
($x,$y) = ($default, $default)
will no longer be regarded as having common vars.
In terms of performance, running Porting/perlbench.pl on the new
expr::aassign:: tests in t/perf/benchmarks show that the biggest slowdown
is around 13% more instruction reads and 20% more conditional branches in
this:
setup => 'my ($v1,$v2,$v3) = 1..3; ($x,$y,$z) = 1..3;',
code => '($x,$y,$z) = ($v1,$v2,$v3)',
where this is now a false positive due to the presence of package
variables.
The biggest speedup is 50% less instruction reads and conditional branches
in this:
setup => '@_ = 1..3; my ($x,$y,$z)',
code => '($x,$y,$z) = @_',
because formerly the presence of @_ pessimised things if the LHS wasn't
a my declaration (it's still pessimised, but the runtime's faster now).
Conversely, we pessimise the 'my' variant too now:
setup => '@_ = 1..3;',
code => 'my ($x,$y,$z) = @_',
this gives 5% more instruction reads and 11% more conditional branches now.
But see the next commit, which will cheat for that particular construct.
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
| |
Commit v5.21.6-352-gc46871e caused all padsv ops to be marked with
scalar context. This prevented the padrange optimisation from happen-
ing with a combination of scalars and aggregates in list context.
Adjust the padrange algorithm to ignore the distinction between list
and scalar context. We only do this optimisation when, even in list
context, each op will push just one thing on to the stack. (That’s
why we check OPf_REF on padav and padhv.) So the list/scalar distinc-
tion is irrelevant.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This op is an optimisation for any series of one or more array or hash
lookups and dereferences, where the key/index is a simple constant or
package/lexical variable. If the first-level lookup is of a simple
array/hash variable or scalar ref, then that is included in the op too.
So all of the following are replaced with a single op:
$h{foo}
$a[$i]
$a[5][$k][$i]
$r->{$k}
local $a[0][$i]
exists $a[$i]{$k}
delete $h{foo}
while these aren't:
$a[0] already handled by OP_AELEMFAST
$a[$x+1] not a simple index
and these are partially replaced:
(expr)->[0]{$k} the bit following (expr) is replaced
$h{foo}[$x+1][0] the first and third lookups are each done with
a multideref op, while the $x+1 expression and
middle lookup are done by existing add, aelem etc
ops.
Up until now, aggregate dereferencing has been very heavyweight in ops; for
example, $r->[0]{$x} is compiled as:
gv[*r] s
rv2sv sKM/DREFAV,1
rv2av[t2] sKR/1
const[IV 0] s
aelem sKM/DREFHV,2
rv2hv sKR/1
gvsv[*x] s
helem vK/2
When executing this, in addition to the actual calls to av_fetch() and
hv_fetch(), there is a lot of overhead of pushing SVs on and off the
stack, and calling lots of little pp() functions from the runops loop
(each with its potential indirect branch miss).
The multideref op avoids that by running all the code in a loop in a
switch statement. It makes use of the new UNOP_AUX type to hold an array
of
typedef union {
PADOFFSET pad_offset;
SV *sv;
IV iv;
UV uv;
} UNOP_AUX_item;
In something like $a[7][$i]{foo}, the GVs or pad offsets for @a and $i are
stored as items in the array, along with a pointer to a const SV holding
'foo', and the UV 7 is stored directly. Along with this, some UVs are used
to store a sequence of actions (several actions are squeezed into a single
UV).
Then the main body of pp_multideref is a big while loop round a switch,
which reads actions and values from the AUX array. The two big branches in
the switch are ones that are affectively unrolled (/DREFAV, rv2av, aelem)
and (/DREFHV, rv2hv, helem) triplets. The other branches are various entry
points that handle retrieving the different types of initial value; for
example 'my %h; $h{foo}' needs to get %h from the pad, while '(expr)->{foo}'
needs to pop expr off the stack.
Note that there is a slight complication with /DEREF; in the example above
of $r->[0]{$x}, the aelem op is actually
aelem sKM/DREFHV,2
which means that the aelem, after having retrieved a (possibly undef)
value from the array, is responsible for autovivifying it into a hash,
ready for the next op. Similarly, the rv2sv that retrieves $r from the
typeglob is responsible for autovivifying it into an AV. This action
of doing the next op's work for it complicates matters somewhat. Within
pp_multideref, the autovivification action is instead included as the
first step of the current action.
In terms of benchmarking with Porting/bench.pl, a simple lexical
$a[$i][$j] shows a reduction of approx 40% in numbers of instructions
executed, while $r->[0][0][0] uses 54% fewer. The speed-up for hash
accesses is relatively more modest, since the actual hash lookup (i.e.
hv_fetch()) is more expensive than an array lookup. A lexical $h{foo}
uses 10% fewer, while $r->{foo}{bar}{baz} uses 34% fewer instructions.
Overall,
bench.pl --tests='/expr::(array|hash)/' ...
gives:
PRE POST
------ ------
Ir 100.00 145.00
Dr 100.00 165.30
Dw 100.00 175.74
COND 100.00 132.02
IND 100.00 171.11
COND_m 100.00 127.65
IND_m 100.00 203.90
with cache misses unchanged at 100%.
In general, the more lookups done, the bigger the proportionate saving.
|
|
|
|
|
|
|
| |
There's no need to run sv_pos_b2u_flags if the retval is one as
one byte can only be one character, therefore change the test to
"> 1". This makes index on unicode strings that match at 1 slightly
faster.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This tool runs code snippets found in t/perf/benchmarks (or similar) under
cachegrind, in order to calculate how many instruction reads, data writes,
branches, cache misses, etc. that one execution of the snippet uses. It
will run them against two or more perl executables and show how much each
test has gotten better or worse.
It is modelled on the perlbench tool, but since it measures instruction
reads etc., rather than timings, it is much more precise and reproducible.
It is also considerably faster, and is capable or running tests in
parallel. Rather than displaying a single relative percentage per
test/perl combination, it displays values for 13 different measurements,
such as instruction reads, conditional branch misses etc.
This commit also changes the format of t/perf/benchmarks slightly; it
becomes an AoH rather than a HoH (to allow checking for duplicate keys),
and the test names themselves become a :: hierarchy.
|
|
|
|
| |
[A-z] should be [A-Z]. Spotted by Karl Williamson.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
t/perf/benchmarks is a file intended to contain snippets of code
that can be usefully benchmarked or otherwise profiled.
The basic idea is that any time you add an optimisation that is intended
to make a particular construct faster, then you should add that construct
to this file.
Under the normal test suite, the test file benchmarks.t does a basic
compile and run of each of these snippets; not to test performance,
but just to ensure that the code doesn't have errors.
Over time, it is intended that various measurement and profiling tools
will be written that can run selected (or all) snippets in various
environments. These will not be run as part of a normal test suite run.
|
|
|
|
|
|
|
|
|
|
| |
This test file is similar to /re/speed.t, but to test general-purpose
optimisations.
The idea is to run snippets of code that are 100s or 1000s times slower
if a particular optimisation is broken. We are not so much interested
in the individual tests passing, as in the whole file failing with a
watchdog timeout (or just observing that it running more slowly)
|
|
|
|
| |
explain (kind of) why this file is called optree.t
|
|
|
|
|
|
| |
Now that we have a directory, t/perf/, for perfomance /optimsation
tests, move this test file there, and rename to something slightly
clearer.
|
|
Add a new directory designed to hold performance / optimising tests
and infrastructure, and add the first test file, opcount.t, that
checks that a sub has the right numbers of particular op types
|