summaryrefslogtreecommitdiff
path: root/t/perf
Commit message (Collapse)AuthorAgeFilesLines
* add sv_set_undef() API functionDavid Mitchell2016-11-241-0/+5
| | | | | | | This function is equivalent to sv_setsv(sv, &PL_sv_undef), but more efficient. Also change the obvious places in the core to use the new idiom.
* avoid premature free of referent in list assignDavid Mitchell2016-11-241-0/+13
| | | | | | | | | | | | | | | | | | | RT #130132 My recent commit v5.25.6-266-ga083329 made it so that perl could sometimes avoid mortalising the referent when assigning to a reference (e.g. for $ref1 = $ref2, where $$ref1 has a ref count of 1). Unfortunately it turns out that list assign relied on this behaviour to avoid premature freeing, e.g. ($ref1, $x) = ($y, $$ref1); where $$ref1 needs to continue to live for at least the rest of the assign. This commit fixes it by mortalising the referent in pp_assign when required.
* optimise $ref1 = $ref2 betterDavid Mitchell2016-11-161-0/+10
| | | | | | | | | | | | | | | When assigning to a ref, the old referent is mortalised if its refcount is 1, to avoid a premature free on things like $r = $$r or $r = $r->[0]. For the shortcut case where $ref1 and $ref2 are simple refs (no magic etc) it's possible to do the assign then SvREFCNT_dec() the old value without having to mortalise it. Which is faster. Even when it doesn't have to be mortalised (RC > 1) this commit makes it slightly faster as it no longer calls sv_unref_flags(). Conversely, this commit also makes the short-cut integer assign code path infinitesimally slower.
* perf/benchmarks: tidy scalar assign benchmarksDavid Mitchell2016-11-161-28/+32
| | | | | | rename them from expr::assign::* to expr::sassign::* so as to more easily distinguish them from expr::aassign::, and move them to the correct place in the file
* /t/perf/benchmarks: move expr::index:: to func::David Mitchell2016-11-141-95/+96
| | | | | 'index' is a perl function, so it should really be in the 'func' top-level namespace of benchmarks
* Better optimise array and hash assignmentDavid Mitchell2016-10-261-0/+76
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [perl #127999] Slowdown in split + list assign Re-implement the code that handles e.g. (..., @a) = (...); (..., %h) = (...); to make it a lot faster - more than reversing a performance regression introduced in 5.24.0 - and fix some bugs. In particular, it now special-cases an empty RHS, which just clears the aggregate, e.g. (..., @a) = () Getting list assignment correct is quite tricky, due to the possibility of premature frees, like @a = ($a[0]), and magic/tied values on the LHS or RHS being triggered too soon/late, which might have side-effects. This often requires making a copy of each RHS element (and indeed for assigning to an array or hash, the values need copying anyway). But copying too soon can result in leaked SVs if magic (such as calling FETCH()) dies. This usually involves mortalising all the copies, which slows things down. Further, a bug fix in 5.24.0 added the SV_NOSTEAL flag when copying SVs. This meant in something like @a = (split(...))[0,0], where the two SvTEMPs on the RHS are the same, the first copy is no longer allowed to steal the PVX buffer, which would have made the second SV undef. But this means that PVX buffers are now always copied, which resulted in the slowdown seen in RT #127999. Amongst the general rewriting and optimising, this commit does the following specific things to boost performance (and fix RT #127999). * If the SVs on the RHS are non-magical SvTEMPs with a ref count of 1, then the SV isn't copied; instead it is stored directly in the array/hash. This more than undoes the cost of SV_NOSTEAL. * The tmps stack is now used as a temporary refcounted version of the argument stack frame, meaning that args placed there will be freed on croak. In something like @a = (....), each RHS element is copied, with the copy placed on the temps stack. Then @a is cleared. Then the elements on the tmps stack are stored in the array, and removed from the temps stack (with the ownership of 1 reference count transferring from the temps stack to the array). Normally by the time pp_aassign() returns, there is nothing left on the tmps stack and tmps_free() isn't called - this is the novel element that distinguishes this from the normal use of mortalising. * For hash assignment, the keys and values are processed in separate loops, with keys not normally being copied. * The ENTER/SAVEFREESV(ary/hash)/LEAVE has been removed, and the array or hash kept temporarily alive by using the temps stack along with all the other copied SVs. * The main 'for each LHS element' loop has been split into two loops: the second one is run when there no more RHS elements to consume. The second loop is much simpler, and makes things like @a = () much faster. Here are the average expr::aassign:: benchmarks for selected perls (raw numbers - lower is better) 5.6.1 5.22.0 5.24.0 5.25.5 this ------ ------ ------ ------ ------ Ir 1355.9 1497.8 1387.0 1382.0 1146.6 Dr 417.2 454.2 410.1 411.1 335.2 Dw 260.6 270.8 249.0 246.8 194.5 COND 193.5 223.2 212.0 207.7 174.4 IND 25.3 17.6 10.8 10.8 10.0 COND_m 4.1 3.1 3.1 3.7 2.8 IND_m 8.9 6.1 5.5 5.5 5.5 And this code: my @a; for my $i (1..10_000_000) { @a = (1,2,3); #@a = (); } with the empty assign is 33% faster than blead, and without is 12% faster than blead.
* fix common assign issue on @a = (split(), 1)David Mitchell2016-10-042-1/+11
| | | | | | | | | | | | | | RT #127999 Slowdown in split + list assign The compile-time common-value detection mechanism for OP_ASSIGN was getting OP_SPLIT wrong. It was assuming that OP_SPLIT was always dangerous. In fact, OP_SPLIT is usually completely safe, not passing though any of its arguments, except where the assign in (@a = split()) has been optimised away and the array attached directly to the OP_SPLIT op, or the ops that produce the array have been appended as an extra child of the OP_SPLIT op (OPf_STACKED).
* Better optimise my/local @a = split()David Mitchell2016-10-042-2/+31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are currently two optimisations for when the results of a split are assigned to an array. For the first, @array = split(...); the aassign and padav/rv2av are optimised away, and pp_split() directly assigns to the array attached to the split op (via op_pmtargetoff or op_pmtargetgv). For the second, my @array = split(...); local @array = split(...); @{$expr} = split(...); The aassign is optimised away, but the padav/rv2av is kept as an additional arg to split. pp_split itself then uses the first arg popped off the stack as the array (This was introduced by FC with v5.21.4-409-gef7999f). This commit moves these two: my @array = split(...); local @array = split(...); from the second case to the first case, by simply setting OPpLVAL_INTRO on the OP_SPLIT, and making pp_split() do SAVECLEARSV() or save_ary() as appropriate. This makes my @a = split(...) a few percent faster.
* make OP_SPLIT a PMOP, and eliminate OP_PUSHREDavid Mitchell2016-10-041-0/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Most ops that execute a regex, such as match and subst, are of type PMOP. A PMOP allows the actual regex to be attached directly to that op, due to its extra fields. OP_SPLIT is different; it is just a plain LISTOP, but it always has an OP_PUSHRE as its first child, which *is* a PMOP and which has the regex attached. At runtime, pp_pushre()'s only job is to push itself (i.e. the current PL_op) onto the stack. Later pp_split() pops this to get access to the regex it wants to execute. This is a bit unpleasant, because we're pushing an OP* onto the stack, which is supposed to be an array of SV*'s. As a bit of a hack, on DEBUGGING builds we push a PVLV with the PL_op address embedded instead, but this still isn't very satisfactory. Now that regexes are first-class SVs, we could push a REGEXP onto the stack rather than PL_op. However, there is an optimisation of @array = split which eliminates the assign and embeds the array's GV/padix directly in the PUSHRE op. So split still needs access to that op. But the pushre op will always be splitop->op_first anyway, so one possibility is to just skip executing the pushre altogether, and make pp_split just directly access op_first instead to get the regex and @array info. But if we're doing that, then why not just go the full hog and make OP_SPLIT into a PMOP, and eliminate the OP_PUSHRE op entirely: with the data that was spread across the two ops now combined into just the one split op. That is exactly what this commit does. For a simple compile-time pattern like split(/foo/, $s, 1), the optree looks like: before: <@> split[t2] lK </> pushre(/"foo"/) s/RTIME <0> padsv[$s:1,2] s <$> const(IV 1) s after: </> split(/"foo"/)[t2] lK/RTIME <0> padsv[$s:1,2] s <$> const[IV 1] s while for a run-time expression like split(/$pat/, $s, 1), before: <@> split[t3] lK </> pushre() sK/RTIME <|> regcomp(other->8) sK <0> padsv[$pat:2,3] s <0> padsv[$s:1,3] s <$> const(IV 1)s after: </> split()[t3] lK/RTIME <|> regcomp(other->8) sK <0> padsv[$pat:2,3] s <0> padsv[$s:1,3] s <$> const[IV 1] s This makes the code faster and simpler. At the same time, two new private flags have been added for OP_SPLIT - OPpSPLIT_ASSIGN and OPpSPLIT_LEX - which make it explicit that the assign op has been optimised away, and if so, whether the array is lexical. Also, deparsing of split has been improved, to the extent that perl TEST -deparse op/split.t now passes. Also, a couple of panic messages in pp_split() have been replaced with asserts().
* S_sv_2iuv_common(): optimise single digit stringsDavid Mitchell2016-09-271-0/+11
| | | | | | | When converting a POK SV to an IOK SV, short-cut the relatively common case of a string that is only one char long and consists of a single digit, e.g. "0". Thus skipping all the floating-point, infinity, whitespace etc complexity.
* in-place sort preserved element lvalue identityDavid Mitchell2016-08-101-1/+47
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | RT #128340 The in-place sorting optimisation @a = sort @a, was preserving the elements of @a rather than (logically) making copies. So make a copy of any element whose refcount is greater than 1. This may not be the perfect condition, but keeps performance for the common cases. Note that several of the tests in t/op/sort.t actually relied on this behaviour to test whether the sort was being in-placed, so I've added tests for in-placing to t/perf/opcount.t instead. See the previous commit for a general discussion of performance; to the A, B, C in that commit message, here's a fourth column added: D is like C but with this commit added: A B C D ------ ------ ------ ------ Ir 5238.0 2324.0 2772.0 2801.0 Dr 1464.0 649.0 765.0 765.0 Dw 919.0 298.0 370.0 380.0 COND 782.0 320.0 405.0 405.0 IND 25.0 25.0 26.0 26.0 COND_m 14.9 13.0 17.0 17.1 IND_m 8.0 5.0 5.0 5.0 so it has little effect on performance.
* Partially pessimise in-place sortingDavid Mitchell2016-08-101-1/+66
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There's currently an optimisation that converts at compile-time @a = sort { .... } @a into (approximately) sort { ... } \@a Then at run time, rather than passing an svp pointer to the appropriate sort routine which points to a list of SV*'s on the stack, pp_sort() passes a pointer to @a's AvARRAY. This allows the array to be sorted in-place, which is more efficient. However, it has some issues. First, the @a visible to the sort routine will be varying, whereas logically it should still hold the original list of values until after the '@a = ...' assignment. Secondly, the mergesort algorithm cureently used internally, when in mid-sort, temporarily stores pointers in the array which aren't pointers to SVs - this means that if @a elements are accessed mid-sort, it can crash. The solution to both these problems is for pp_sort() to push the elements of @a onto the stack at the beginning, sort the stack (like normal sorts do), then copy back to @a at the end. This is less efficient than before, but is still a lot more efficient than executing separate padav and aassign ops. Here are benchmark results in raw instruction counts etc (lower is better) for the sort line in this code block: my (@a, @b); @a = reverse 1..10; @b = sort { $a <=> $b } @a; A is for a non-in-place sort, i.e. @b = sort ... @a; B and C are for an inline sort, i.e. as above, but @a = sort ... @a; where B is blead before this commit and C is this commit. A B C ------ ------ ------ Ir 5238.0 2324.0 2772.0 Dr 1464.0 649.0 765.0 Dw 919.0 298.0 370.0 COND 782.0 320.0 405.0 IND 25.0 25.0 26.0 COND_m 14.9 13.0 17.0 IND_m 8.0 5.0 5.0 As can be seen, this partial pessimisation slows down in-place sorting by round 20%, but overall in-place is still nearly twice the speed as without the optimisation. These are the figures for a plain numeric sort (which is optimised to use a C comparison function); for other types of sort, the cost of the comparator dominates, and so the slowdown is much less marked.
* peephole optimise op_other branch in OP_ARGDEFELEMDavid Mitchell2016-08-031-1/+14
| | | | I forgot to do this earlier.
* Make barewords constant-foldableAaron Crane2016-05-151-1/+7
| | | | | | | | | | | | | | | | | | | | | | Commit 11fa937b2766784f0f812687a7f81dd3761e605f, from August 1999, changed constant folding to treat barewords as unfoldable. Public mailing-list archives from that period are incomplete, unfortunately, and this change in particular is no longer well documented; the only message still available seems to be <199907140903.CAA07505@activestate.com>, with subject "Re: Unwanted perl helpfullness". That message quotes a bug reporter pointing out that a constant-folded bareword would fail to trigger the "not allowed while strict subs in use" error under "use strict". However, there's no information available about why the bug was fixed in this way, rather than by merely ensuring that constant folding would produce that error when necessary. The no_bareword_allowed() routine did already exist at that point, for example. This change therefore adopts that approach. This causes one minor change in behaviour: since barewords now take part in constant folding, concatenating a bareword with another constant in void context now produces a warning for the concatenated string, rather than for the concatenation op. This seems like an acceptable change, given that non-bareword constants already behave in the same way.
* add a few grep and map benchmarksDavid Mitchell2016-02-031-0/+24
|
* optimise bare 'next'David Mitchell2016-02-031-0/+6
| | | | | If a 'next' without a label appears directly in the scope of the current loop, then skip searching the context stack for a suitable LOOP context.
* pp_leavesub(): call FREETMPS and optimiseDavid Mitchell2016-02-031-0/+30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently pp_leavesub() doesn't call FREETMPS. Presumably this is due to the danger of freeing temps that need to exist beyond the return, such as the mortal copies we make of return args, or return args that are already temps. The down side of this is that temps aren't freed until the next nextstate is executed following the function call. If the function isn't near a statement boundary, then it may be a while before temps are freed; e.g. in f(g()), when g() returns, its temps continue to exist during the call to f(). For recursive subs it gets worse, although there is a specific hack already in pp_leavesub() that says in scalar context if CvDEPTH > 1, then temporarily RC++ the single return value then do a FREETMPS. This can in theory leak if something dies during the FREETMPS. This commit provides a more general solution. During the course of processing (usually mortal copying) the return args, it divides the current temps stack frame into two halves, with the temps that need keeping migrating to the bottom half. It then does a FREETMPS equivalent only of the top half. This is actually more efficient than it sounds; but in addition, the code has been heavily optimised: in particular the call to sv_mortalcopy() has been unrolled and inlined, and within that code common cases (such as IV, RV) handled directly, and the mortal stack is only checked/extended once, rather than for every arg. Also, the arg adjust / freetmps code has been moved out of pp_leavesub() and into a separate static function.
* optimise sv_setsv_flags()David Mitchell2016-02-031-1/+16
| | | | | | | | | | | | | | | | | | This commit does two things. First, it streamlines and re-orders some of the initial tests, such as 'is sstr already freed?'. Second, it looks for a reasonably common case where both sstr and dstr are SVt_NULL/SVt_IV. This covers undef, int and refs, where the SV hasn't previously been used for other things (like strings). With just SVt_NULL/SVt_IV, we know that the SV won't have a real body, and won't need one and can be very quick. The check for SVt_NULL/SVt_IV is a single compare-and-branch, so has a minimal effect on users of sv_setsv_flags() that have more complex types.
* benchmarks: add some 'for' array iteratingDavid Mitchell2016-02-031-0/+41
|
* add loop benchmark testsDavid Mitchell2016-02-031-3/+101
|
* t/perf/benchmarks: add a few sub and goto testsDavid Mitchell2016-02-031-2/+23
|
* make "for my $lex {}" faster under ITHREADSDavid Mitchell2016-02-031-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Normally at the start of a 'for' iteration (pp_enteriter), we store the address of the GV or the address of the pad slot of the iteration variable in the CX struct, so we can quickly access and update it in pp_iter. For the ITHREADS case however, the pad may change if the thread is cloned, so instead the address of PL_comppad was stored, and then updated during cloning. This meant that on each iter, we would have to retrieve the saved PL_comppad address from the ctx struct, retrieve the targ from the saved my_op, and do a (perlish) array indexing. Thuis commit makes this faster by, for the ITHREADS case, storing both PL_comppad and svp = &PLcurpad[targ]. In pp_iter(), we're fast by directly accessing *svp; while storing PL_comppad allows us to update both it and svp during thread cloning. This requires one extra pointer in the block_loop part of the context union under threads, but this is already smaller than some other parts of the union, so has no effect. Note that the oldcomppad field was formerly part of the itervar_u union, but is now a separate field within the larger block_loop struct (but only on threaded builds). Note also that the tests I've added I retrieved from an old WIP private branch that contained a forerunner of this commit, so they may not be entirely relevant to the current form of this commit. But extra tests can never hurt, so I've included them.
* split pp_predec() from pp_preinc() and improveDavid Mitchell2015-11-101-0/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | pp_preinc() handles both ++$x and --$x (and the integer variants pp_i_preinc/dec). Split it into two separate functions, as handling both inc and dec in the same function requires 3 extra conditionals. At the same time make the code more efficient. As currently written it: 1) checked for "bad" SVs (such as read-only) and croaked; 2) checked for a IOK-only SV and directly incremented the IVX slot; 3) else called out to sv_inc() to handle the more complex cases. This commit combines the checks in (1) and (2) into one single big check of flags, and anything "bad" simply skips the IOK-only code and calls sv_dec(), which can do its own checking of read-only etc and croak if necessary. Porting/bench.pl shows the following raw numbers for ++$x ($x lexical and holding an integer): before after -------- -------- Ir 77.0 56.0 Dr 30.0 24.0 Dw 10.0 10.0 COND 12.0 9.0 IND 2.0 2.0 COND_m -0.1 0.0 IND_m 2.0 2.0 Even having split the function into two, the combined size of the two new functions is smaller than the single previous function.
* faster add, subtract, multiplyDavid Mitchell2015-11-101-0/+93
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In pp_add, pp_subtract and pp_multiply, special-case the following: * both args IV, neither arg large enough to under/overflow * both args NV. Starting in 5.8.0, the implementation of the arithmetic pp functions became a lot more complex (and famously, much slower), due to the need to support 64-bit integers. For example, formerly pp_add just converted both its args to an NV and returned an NV. On 64-bit systems, that could gave bad results if the mantissa of an NV was < 64 bits; for example: $ perl561 -e'$x = 0x1000000000000000; printf "%x\n", $x+1' 1000000000000000 $ perl580 -e'$x = 0x1000000000000000; printf "%x\n", $x+1' 1000000000000001 This led to a lot of complex code that covered all the possibilities of overflow etc. This commit adds some special casing to these three common arithmetic ops. It does some quick checks (mainly involving fast boolean and bit ops) to determine if both args are valid IVs (and not UVs), are not magic, and aren't very big (+ve or -ve). In this case, the result is simply SvIVX(svl) + SvIVX(svr) (or - or *) with no possibility of overflow. Failing that, if both args are NV's and not magic, then if both NVs can be converted to IVs without loss, handle as for the IV case; failing that, just return SvNVX(svl) + SvNVX(svr); For all other cases, such as mixed IV and NV or PV, fall back to the old code. On my platform (x86_64), it (along with the previous commit) reduces the execution time of the nbody benchmark (lots of floating-point vector arithmetic) by a third and in fact makes it 10% faster than 5.6.1.
* make /fixed-substr/ much faster.David Mitchell2015-10-131-0/+88
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | TL;DR: on platforms with a libc memchr() implementation which makes good use of underlying hardware support, patterns which include fixed substrings will now often be much faster; for example with glibc on on a recent x86_64 CPU, this: $s = "a" x 1000 . "wxyz"; $s =~ /wxyz/ for 1..30000 is now about 7 times faster. On systems with slow memchr(), e.g. 32-bit ARM Raspberry Pi, there will be a small or little speedup. Conversely, some pathological cases, such as "ab" x 1000 =~ /aa/ will be slower now; up to 3 times slower on the rPi, 1.5x slower on x86_64. In detail: The perl core includes a Boyer-Moore substring matcher, Perl_fbm_instr(), which is used to quickly find a fixed substring within a longer string, where a table of lookups is pre-computed from the substring. As well as being used in index() when the substring is a constant, its main use is in patterns. When the regex engine compiles a pattern, it typically takes note of the two longest fixed substrings within the pattern; for example in /[01]abc\w+de\d+fghij+/ the two longest are "abc" and "fghij". The engine uses Perl_fbm_instr() to scan for these two strings before running the full NFA. This often allows the string to be quickly rejected, or to find a suitable minimum starting point to run the NFA. However, Perl_fbm_instr() was written about 16 years ago and has been virtually untouched since, so it could do with some love. It currently special-cases strings of length 1 and 2, using roll-your-own loops along the lines of while (s < end) { if (*s++ = c1) ... } while strings of length 3+ use the Boyer-Moore algorithm. The big advantage of BM is that in a best-case, where none of the characters from the substring are found in this region of the string, it only has to test every N'th char, where N is length of the substring. For example when searching for wxyz in abcdefghikl..., it just reads and tests d,h,l,.. However these days some platforms have decent memchr() implementations. For example, glibc has assembly-level implementations for i386, x86_64, sparc32/64, powerpc32/64, s390-32/64, arm, m68k and ia64 by the looks of it. These can often be substantially faster than a C-level implementation. This commit makes Perl_fbm_instr() use memchr() where possible. For the length == 1 special case, it just calls memchr() directly rather than using a loop as previously. For the length == 2 special case, it continues to distinguish the cases where the two chars of the substring are the same or differ. For the former it calls memchr() after an initial direct failure, i.e. if (*s != c) { s++; s = memchr(....); ... } For the latter case it does a similar direct test first (to avoid the costly overhead of a call to memchr() when the next char is the one we seek anyway), but in addition, on each failure to find the second char following a found first char, it swaps which char it's searching for. This means that in something like "aaaaaaa..." =~ /ab/, it wont keep hopping 1 char position with memchar(s,'a'); after the first hop it will do memchr(s,'b') and skip lots of chars in one go. This helps reduce the number of pathological cases. For the length >= 3 cases (normal BM), it keeps using BM, but after each iteration where the pointer has been incremented by the skip determined by the BM algorithm, it now does an additional if (*s != c) { s++; s = memchr(....); ... } step before running the next iteration of BM.
* pp_aassign(): fix ($x,$y) = (undef, $x)David Mitchell2015-09-021-0/+5
| | | | | | | | | | | With 808ce5578203, I tweaked the OPpASSIGN_COMMON flagging to mark as safe when the LHS or RHS only contains only one var. This turned out to be flawed for the RHS logic, as position as well as oneness is important: (undef, $x) = ...; # only 1 scalar on LHS: always safe ($x, $y) = (undef, $x); # 2 scalars on RHS: unsafe So this commit makes undef on the RHS count towards the scalar var count.
* t/perf/benchmarks: 5.004 compatDavid Mitchell2015-08-191-20/+20
| | | | | | | | make the tests in the benchmark file be compilable back to 5.004_05. (To go further back, it would need to avoid package names that start with digits, such as 'call::sub::3_args'). Basically avoid // and our.
* Optimise 1 arg in list assignDavid Mitchell2015-08-172-1/+14
| | | | | | | Avoid setting common scalar flags in these cases: ($x) = (...); (...) = ($x);
* make my (...) = @_ non-OPpASSIGN_COMMON_RC1David Mitchell2015-08-171-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | Technically in my ($scalar,...) = @_ due to closure/goto tricks, its possible for $scalar to appear on both the LHS and RHS, so we currently set the OPpASSIGN_COMMON_RC1 flag. However, this imposes extra overhead; for example 5% extra instruction reads and 11% extra conditional branches for my ($x,$y,$z) = @_; Given what an important construct this is, disable this flag in the specific case of of only my's on the LHS and only @_ on the RHS. It's technically incorrect, but its the same behaviour we've always had (it was only the previous commit which made it safe but slower). We still set the OPpASSIGN_COMMON_AGG flag for my ($...,@a) = @_ since in the normal case this only adds the small additional runtime overhead of checking that @a is already empty.
* re-implement OPpASSIGN_COMMON mechanismDavid Mitchell2015-08-172-13/+441
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit almost completely replaces the current mechanism for detecting and handing common vars in list assignment, e.g. ($a,$b) = ($b,$a); In general outline: it creates more false positives at compile-time than before, but also no longer misses some false negatives. In compensation, it considerably reduces the run-time cost of handling potential and real commonality. It does this firstly by splitting the OPpASSIGN_COMMON flag into 3 separate flags: OPpASSIGN_COMMON_AGG OPpASSIGN_COMMON_RC1 OPpASSIGN_COMMON_SCALAR which indicate different classes of commonality that can be handled in different ways at runtime. Most importantly, it distinguishes between two basic cases. Firstly, common scalars (OPpASSIGN_COMMON_SCALAR), e.g. ($x,....) = (....,$x,...) where $x is modified and then sometime later its value is used again, but that value has changed in the meantime. In this case, we need replace such vars on the RHS with mortal copies before processing the assign. The second case is an aggregate on the LHS (OPpASSIGN_COMMON_AGG), e.g. (...,@a) = (...., $a[0],...) In this case, the issue is instead that when @a is cleared, it may free items on the RHS (due to the stack not being ref counted). What is required here is that rather than making of a copy of each RHS element and storing it in the array as we progress, we make *all* the copies *before* clearing the array, but mortalise them in case we die in the meantime. We can further distinguish two scalar cases; sometimes it's possible to confirm non-commonality at run-time merely by checking that all the LHS scalars have a reference count of 1. If this is possible, we set the OPpASSIGN_COMMON_RC1 flag rather than the OPpASSIGN_COMMON_SCALAR flag. The major improvement in the run-time performance in the OPpASSIGN_COMMON_SCALAR case (or OPpASSIGN_COMMON_RC1 if rc>1 scalars are detected), is to use a mark-and-sweep scan of the two lists using the SVf_BREAK flag, to determine which elements are common, and only make mortal copies of those elements. This has a very big effect on run-time performance; for example in the classic ($a,$b) = ($b,$a); it would formerly make temp copies of both $a and $b; now it only copies $a. In more detail, the mark and sweep mechanism in pp_aassign works by looping through each LHS and RHS SV pair in parallel. It temporarily marks each LHS SV with the SVf_BREAK flag, then makes a copy of each RHS element only if it has the SVf_BREAK flag set. When the scan is finished, the flag is unset on all LHS elements. One major change in compile-time flagging is that package scalar vars are now treated as if they could always be aliased. So we don't bother any more to do the compile-time PL_generation checking on package vars (we still do it on lexical vars). We also no longer make use of the run-time PL_sawalias mechanism for detecting aliased package vars (and indeed the next commit but one will remove that mechanism). This means that more list assignment expressions which feature package vars will now need to do a runtime mark-and-sweep (or where appropriate, RC1) test. In compensation, we no longer need to test for aliasing and set PL_sawalias in pp_gvsv and pp_gv, nor reset PL_sawalias in every pp_nextstate. Part of the reasoning behind this is that it's nearly impossible to detect all possible package var aliasing; for example PL_sawalias would fail to detect XS code doing GvSV(gv) = sv. Note that we now scan the two children of the OP_AASSIGN separately, and in particular we mark lexicals with PL_generation only on the LHS and test only on the RHS. So something like ($x,$y) = ($default, $default) will no longer be regarded as having common vars. In terms of performance, running Porting/perlbench.pl on the new expr::aassign:: tests in t/perf/benchmarks show that the biggest slowdown is around 13% more instruction reads and 20% more conditional branches in this: setup => 'my ($v1,$v2,$v3) = 1..3; ($x,$y,$z) = 1..3;', code => '($x,$y,$z) = ($v1,$v2,$v3)', where this is now a false positive due to the presence of package variables. The biggest speedup is 50% less instruction reads and conditional branches in this: setup => '@_ = 1..3; my ($x,$y,$z)', code => '($x,$y,$z) = @_', because formerly the presence of @_ pessimised things if the LHS wasn't a my declaration (it's still pessimised, but the runtime's faster now). Conversely, we pessimise the 'my' variant too now: setup => '@_ = 1..3;', code => 'my ($x,$y,$z) = @_', this gives 5% more instruction reads and 11% more conditional branches now. But see the next commit, which will cheat for that particular construct.
* No Scalar::Util under fresh miniperl.Jarkko Hietaniemi2015-02-261-0/+1
|
* [perl #123202] speed up scalar //g against tainted stringsTony Cook2015-02-261-0/+42
|
* Ignore cx of padsv for padrange optimisationFather Chrysostomos2014-12-211-1/+12
| | | | | | | | | | | | Commit v5.21.6-352-gc46871e caused all padsv ops to be marked with scalar context. This prevented the padrange optimisation from happen- ing with a combination of scalars and aggregates in list context. Adjust the padrange algorithm to ignore the distinction between list and scalar context. We only do this optimisation when, even in list context, each op will push just one thing on to the stack. (That’s why we check OPf_REF on padav and padhv.) So the list/scalar distinc- tion is irrelevant.
* Detypo.Jarkko Hietaniemi2014-12-071-1/+1
|
* Add OP_MULTIDEREFDavid Mitchell2014-12-072-4/+315
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This op is an optimisation for any series of one or more array or hash lookups and dereferences, where the key/index is a simple constant or package/lexical variable. If the first-level lookup is of a simple array/hash variable or scalar ref, then that is included in the op too. So all of the following are replaced with a single op: $h{foo} $a[$i] $a[5][$k][$i] $r->{$k} local $a[0][$i] exists $a[$i]{$k} delete $h{foo} while these aren't: $a[0] already handled by OP_AELEMFAST $a[$x+1] not a simple index and these are partially replaced: (expr)->[0]{$k} the bit following (expr) is replaced $h{foo}[$x+1][0] the first and third lookups are each done with a multideref op, while the $x+1 expression and middle lookup are done by existing add, aelem etc ops. Up until now, aggregate dereferencing has been very heavyweight in ops; for example, $r->[0]{$x} is compiled as: gv[*r] s rv2sv sKM/DREFAV,1 rv2av[t2] sKR/1 const[IV 0] s aelem sKM/DREFHV,2 rv2hv sKR/1 gvsv[*x] s helem vK/2 When executing this, in addition to the actual calls to av_fetch() and hv_fetch(), there is a lot of overhead of pushing SVs on and off the stack, and calling lots of little pp() functions from the runops loop (each with its potential indirect branch miss). The multideref op avoids that by running all the code in a loop in a switch statement. It makes use of the new UNOP_AUX type to hold an array of typedef union { PADOFFSET pad_offset; SV *sv; IV iv; UV uv; } UNOP_AUX_item; In something like $a[7][$i]{foo}, the GVs or pad offsets for @a and $i are stored as items in the array, along with a pointer to a const SV holding 'foo', and the UV 7 is stored directly. Along with this, some UVs are used to store a sequence of actions (several actions are squeezed into a single UV). Then the main body of pp_multideref is a big while loop round a switch, which reads actions and values from the AUX array. The two big branches in the switch are ones that are affectively unrolled (/DREFAV, rv2av, aelem) and (/DREFHV, rv2hv, helem) triplets. The other branches are various entry points that handle retrieving the different types of initial value; for example 'my %h; $h{foo}' needs to get %h from the pad, while '(expr)->{foo}' needs to pop expr off the stack. Note that there is a slight complication with /DEREF; in the example above of $r->[0]{$x}, the aelem op is actually aelem sKM/DREFHV,2 which means that the aelem, after having retrieved a (possibly undef) value from the array, is responsible for autovivifying it into a hash, ready for the next op. Similarly, the rv2sv that retrieves $r from the typeglob is responsible for autovivifying it into an AV. This action of doing the next op's work for it complicates matters somewhat. Within pp_multideref, the autovivification action is instead included as the first step of the current action. In terms of benchmarking with Porting/bench.pl, a simple lexical $a[$i][$j] shows a reduction of approx 40% in numbers of instructions executed, while $r->[0][0][0] uses 54% fewer. The speed-up for hash accesses is relatively more modest, since the actual hash lookup (i.e. hv_fetch()) is more expensive than an array lookup. A lexical $h{foo} uses 10% fewer, while $r->{foo}{bar}{baz} uses 34% fewer instructions. Overall, bench.pl --tests='/expr::(array|hash)/' ... gives: PRE POST ------ ------ Ir 100.00 145.00 Dr 100.00 165.30 Dw 100.00 175.74 COND 100.00 132.02 IND 100.00 171.11 COND_m 100.00 127.65 IND_m 100.00 203.90 with cache misses unchanged at 100%. In general, the more lookups done, the bigger the proportionate saving.
* Tweak sv_pos_b2u_flags check in pp_indexJames Raspass2014-12-061-0/+6
| | | | | | | There's no need to run sv_pos_b2u_flags if the retval is one as one byte can only be one character, therefore change the test to "> 1". This makes index on unicode strings that match at 1 slightly faster.
* add Porting/bench.plDavid Mitchell2014-11-293-21/+106
| | | | | | | | | | | | | | | | | | | This tool runs code snippets found in t/perf/benchmarks (or similar) under cachegrind, in order to calculate how many instruction reads, data writes, branches, cache misses, etc. that one execution of the snippet uses. It will run them against two or more perl executables and show how much each test has gotten better or worse. It is modelled on the perlbench tool, but since it measures instruction reads etc., rather than timings, it is much more precise and reproducible. It is also considerably faster, and is capable or running tests in parallel. Rather than displaying a single relative percentage per test/perl combination, it displays values for 13 different measurements, such as instruction reads, conditional branch misses etc. This commit also changes the format of t/perf/benchmarks slightly; it becomes an AoH rather than a HoH (to allow checking for duplicate keys), and the test names themselves become a :: hierarchy.
* perf/benchmarks.t: fix regex typoDavid Mitchell2014-11-111-1/+1
| | | | [A-z] should be [A-Z]. Spotted by Karl Williamson.
* add t/perf/benchmarks, t/perf/benchmarks.tDavid Mitchell2014-10-262-0/+92
| | | | | | | | | | | | | | | | | t/perf/benchmarks is a file intended to contain snippets of code that can be usefully benchmarked or otherwise profiled. The basic idea is that any time you add an optimisation that is intended to make a particular construct faster, then you should add that construct to this file. Under the normal test suite, the test file benchmarks.t does a basic compile and run of each of these snippets; not to test performance, but just to ensure that the code doesn't have errors. Over time, it is intended that various measurement and profiling tools will be written that can run selected (or all) snippets in various environments. These will not be run as part of a normal test suite run.
* add t/perf/speed.tDavid Mitchell2014-10-261-0/+51
| | | | | | | | | | This test file is similar to /re/speed.t, but to test general-purpose optimisations. The idea is to run snippets of code that are 100s or 1000s times slower if a particular optimisation is broken. We are not so much interested in the individual tests passing, as in the whole file failing with a watchdog timeout (or just observing that it running more slowly)
* t/perf/optree.t: expand blurbDavid Mitchell2014-10-261-1/+2
| | | | explain (kind of) why this file is called optree.t
* rename t/op/opt.t -> t/perf/optree.tDavid Mitchell2014-10-261-0/+107
| | | | | | Now that we have a directory, t/perf/, for perfomance /optimsation tests, move this test file there, and rename to something slightly clearer.
* add t/perf/, t/perf/opcount.tDavid Mitchell2014-10-261-0/+74
Add a new directory designed to hold performance / optimising tests and infrastructure, and add the first test file, opcount.t, that checks that a sub has the right numbers of particular op types