| Commit message (Collapse) | Author | Age | Files | Lines |
... | |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The inline allocation version is 69% faster than the out-of-line
version, when cloning an array of 16 unit elements on a 64-bit
machine.
Comparing the new and the old primop implementations isn't
straightforward. The old version had a missing heap check that I
discovered during the development of the new version. Comparing the
old and the new version would requiring fixing the old version, which
in turn means reimplementing the equivalent of MAYBE_CG in StgCmmPrim.
The inline allocation threshold is configurable via
-fmax-inline-alloc-size which gives the maximum array size, in bytes,
to allocate inline. The size does not include the closure header size.
Allowing the same primop to be either inline or out-of-line has some
implication for how we lay out heap checks. We always place a heap
check around out-of-line primops, as they may allocate outside of our
knowledge. However, for the inline primops we only allow allocation
via the standard means (i.e. virtHp). Since the clone primops might be
either inline or out-of-line the heap check layout code now consults
shouldInlinePrimOp to know whether a primop will be inlined.
|
|
|
|
|
|
|
|
| |
We don't yet understand WHY commit ad15c2, which is to do with
CmmSink, causes seg-faults on Windows, but it certainly seems to. So
reverting it is a stop-gap, but we need to un-block the 7.8 release.
Many thanks to awson for identifying the offending commit.
|
|
|
|
|
|
|
| |
This results in a 57% runtime decrease when allocating an array of 128
bytes on a 64-bit machine.
Fixes #8876.
|
|
|
|
|
|
|
|
|
|
| |
- Move array representation knowledge into SMRep
- Separate out low-level heap-object allocation so that we can reuse
it from doNewArrayOp
- remove card-table initialisation, we can safely ignore the card
table for newly allocated arrays.
|
|
|
|
|
| |
I'd like to be able to pack together non-pointer fields that are less
than a word in size, and this is a necessary prerequisite.
|
|
|
|
|
|
|
|
|
| |
End of Cmm pipeline used to be split into two alternative flows,
depending on whether we did proc-point splitting or not. There
was a lot of code duplication between these two branches. But it
wasn't really necessary as the differences can be easily enclosed
within an if-then-else. I observed no impact of this change on
compilation performance.
|
| |
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* CmmRewriteAddignments module was replaced by CmmSink a long
time ago. That module is now available at
https://ghc.haskell.org/trac/ghc/wiki/Commentary/Compiler/Hoopl/Examples
wiki page.
* removeDeadAssignments function was not used and it was also
moved to the above page.
* I also nuked some commented out debugging code that was not
used for 1,5 year.
|
|
|
|
|
|
|
| |
It turns out that one of the cases in the optimization pass was
a special case of another. I remove that specialization since it
does not have impact on compilation time, and the resulting Cmm
is identical.
|
| |
|
| |
|
|
|
|
| |
By using the constant-folder to reduce it to an integer.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We occasionally need to reserve some temporary memory in a primop for
passing to a foreign function. We've been using the stack for this,
but when we moved to high-level Cmm it became quite fragile because
primops are in high-level Cmm and the stack is supposed to be under
the control of the Cmm pipeline.
So this change puts things on a firmer footing by adding a new Cmm
construct 'reserve'. e.g. in decodeFloat_Int#:
reserve 2 = tmp {
mp_tmp1 = tmp + WDS(1);
mp_tmp_w = tmp;
/* Perform the operation */
ccall __decodeFloat_Int(mp_tmp1 "ptr", mp_tmp_w "ptr", arg);
r1 = W_[mp_tmp1];
r2 = W_[mp_tmp_w];
}
reserve is described in CmmParse.y.
Unfortunately the argument to reserve must be a compile-time constant.
We might have to extend the parser to allow expressions with
arithmetic operators if this is too restrictive.
Note also that the return instruction for the procedure must be
outside the scope of the reserved stack area, so we have to extract
the values from the reserved area before we close the scope. This
means some more local variables (r1, r2 in the example above). The
generated code is more or less identical to what we had before though.
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
| |
This bug only shows up when you are using proc-point splitting.
What was happening was:
* We generate a proc-point for the stack check
* And an info table
* We eliminate the stack check because it's redundant
* And the dangling info table caused a panic in
CmmBuildInfoTables.bundle
|
| |
|
| |
|
|
|
|
| |
Signed-off-by: Herbert Valerio Riedel <hvr@gnu.org>
|
|
|
|
| |
This reverts commit 2f5db98e90cf0cff1a11971c85f108a7480528ed.
|
|
|
|
|
|
|
| |
Inlining global registers and constants made code slightly larger in
some cases. I finally got around to looking into why, and discovered
one reason: we weren't discarding dead code in some cases. This patch
fixes it.
|
| |
|
|
|
|
| |
Fixes #8456
|
| |
|
|
|
|
|
|
|
| |
Fixes #8456. Previous version of control flow optimisations
did not update the list of block predecessors, leading to
unnecessary duplication of blocks in some cases. See Trac
and comments in the code for more details.
|
|
|
|
|
|
| |
The only substantive change here is to change "==" into ">=" in
the Note [Always false stack check] code. This is semantically
correct, but won't have any practical impact.
|
| |
|
|
|
|
|
| |
Fix a bug introduced in 94125c97e49987e91fa54da6c86bc6d17417f5cf.
See Note [Always false stack check]
|
|
|
|
|
|
|
| |
I am removing old loopification code that has been commented out
for long long time. We now have loopification implemented in
the code generator (see Note [Self-recursive tail calls]) so we
won't need to resurect this old code.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When compiling a function we can determine how much stack space it will
use. We therefore need to perform only a single stack check at the beginning
of a function to see if we have enough stack space. Instead of referring
directly to Sp - as we used to do in the past - the code generator uses
(old + 0) in the stack check. Stack layout phase turns (old + 0) into Sp.
The idea here is that, while we need to perform only one stack check for
each function, we could in theory place more stack checks later in the
function. They would be redundant, but not incorrect (in a sense that they
should not change program behaviour). We need to make sure however that a
stack check inserted after incrementing the stack pointer checks for a
respectively smaller stack space. This would not be the case if the code
generator produced direct references to Sp. By referencing (old + 0) we make
sure that we always check for a correct amount of stack: when converting
(old + 0) to Sp the stack layout phase takes into account changes already
made to stack pointer. The idea for this change came from observations made
while debugging #8275.
|
| |
|
|
|
|
| |
Signed-off-by: Erik de Castro Lopo <erikd@mega-nerd.com>
|
|
|
|
|
|
| |
This way CPP conditionals can be avoided for the transition period.
Signed-off-by: Herbert Valerio Riedel <hvr@gnu.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch adds support for several new primitive operations which
support using processor-specific instructions to help guide data and
cache locality decisions. We have levels ranging from [0..3]
For LLVM, we generate llvm.prefetch intrinsics at the proper locality
level (similar to GCC.)
For x86 we generate prefetch{NTA, t2, t1, t0} instructions. On SPARC and
PowerPC, the locality levels are ignored.
This closes #8256.
Authored-by: Carter Tazio Schonwald <carter.schonwald@gmail.com>
Signed-off-by: Austin Seipp <austin@well-typed.com>
|
| |
|
| |
|
| |
|
|
|
|
|
|
|
| |
On 32-bit platforms, the bitmap should be an array of
32-bit words, not Word64s.
Signed-off-by: Austin Seipp <austin@well-typed.com>
|
|
|
|
|
|
|
| |
LLVM's GHC calling convention only allows 128-bit SIMD vectors to be passed in
machine registers on X86-64. This may change in LLVM 3.4; the hidden flag
-fllvm-pass-vectors-in-regs causes all SIMD vector widths to be passed in
registers on both X86-64 and on X86-32.
|
| |
|
| |
|
| |
|
| |
|