| Commit message (Collapse) | Author | Age | Files | Lines |
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Add 'dumpAction' hook to DynFlags.
It allows GHC API users to catch dumped intermediate codes and
information. The format of the dump (Core, Stg, raw text, etc.) is now
reported allowing easier automatic handling.
* Add 'traceAction' hook to DynFlags.
Some dumps go through the trace mechanism (for instance unfoldings that
have been considered for inlining). This is problematic because:
1) dumps aren't written into files even with -ddump-to-file on
2) dumps are written on stdout even with GHC API
3) in this specific case, dumping depends on unsafe globally stored
DynFlags which is bad for GHC API users
We introduce 'traceAction' hook which allows GHC API to catch those
traces and to avoid using globally stored DynFlags.
* Avoid dumping empty logs via dumpAction/traceAction (but still write
empty files to keep the existing behavior)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Formerly we punted on these and evaluated constructors always got a tag
of 1.
We now cascade switches because we have to check the tag first and when
it is MAX_PTR_TAG then get the precise tag from the info table and
switch on that. The only technically tricky part is that the default
case needs (logical) duplication. To do this we emit an extra label for
it and branch to that from the second switch. This avoids duplicated
codegen.
Here's a simple example of the new code gen:
data D = D1 | D2 | D3 | D4 | D5 | D6 | D7 | D8
On a 64-bit system previously all constructors would be tagged 1. With
the new code gen D7 and D8 are tagged 7:
[Lib.D7_con_entry() {
...
{offset
c1eu: // global
R1 = R1 + 7;
call (P64[Sp])(R1) args: 8, res: 0, upd: 8;
}
}]
[Lib.D8_con_entry() {
...
{offset
c1ez: // global
R1 = R1 + 7;
call (P64[Sp])(R1) args: 8, res: 0, upd: 8;
}
}]
When switching we now look at the info table only when the tag is 7. For
example, if we derive Enum for the type above, the Cmm looks like this:
c2Le:
_s2Js::P64 = R1;
_c2Lq::P64 = _s2Js::P64 & 7;
switch [1 .. 7] _c2Lq::P64 {
case 1 : goto c2Lk;
case 2 : goto c2Ll;
case 3 : goto c2Lm;
case 4 : goto c2Ln;
case 5 : goto c2Lo;
case 6 : goto c2Lp;
case 7 : goto c2Lj;
}
// Read info table for tag
c2Lj:
_c2Lv::I64 = %MO_UU_Conv_W32_W64(I32[I64[_s2Js::P64 & (-8)] - 4]);
if (_c2Lv::I64 != 6) goto c2Lu; else goto c2Lt;
Generated Cmm sizes do not change too much, but binaries are very
slightly larger, due to the fact that the new instructions are longer in
encoded form. E.g. previously entry code for D8 above would be
00000000000001c0 <Lib_D8_con_info>:
1c0: 48 ff c3 inc %rbx
1c3: ff 65 00 jmpq *0x0(%rbp)
With this patch
00000000000001d0 <Lib_D8_con_info>:
1d0: 48 83 c3 07 add $0x7,%rbx
1d4: ff 65 00 jmpq *0x0(%rbp)
This is one byte longer.
Secondly, reading info table directly and then switching is shorter
_c1co:
movq -1(%rbx),%rax
movl -4(%rax),%eax
// Switch on info table tag
jmp *_n1d5(,%rax,8)
than doing the same switch, and then for the tag 7 doing another switch:
// When tag is 7
_c1ct:
andq $-8,%rbx
movq (%rbx),%rax
movl -4(%rax),%eax
// Switch on info table tag
...
Some changes of binary sizes in actual programs:
- In NoFib the worst case is 0.1% increase in benchmark "parser" (see
NoFib results below). All programs get slightly larger.
- Stage 2 compiler size does not change.
- In "containers" (the library) size of all object files increases
0.0005%. Size of the test program "bitqueue-properties" increases
0.03%.
nofib benchmarks kindly provided by Ömer (@osa1):
NoFib Results
=============
--------------------------------------------------------------------------------
Program Size Allocs Instrs Reads Writes
--------------------------------------------------------------------------------
CS +0.0% 0.0% -0.0% -0.0% -0.0%
CSD +0.0% 0.0% 0.0% +0.0% +0.0%
FS +0.0% 0.0% 0.0% +0.0% 0.0%
S +0.0% 0.0% -0.0% 0.0% 0.0%
VS +0.0% 0.0% -0.0% +0.0% +0.0%
VSD +0.0% 0.0% -0.0% +0.0% -0.0%
VSM +0.0% 0.0% 0.0% 0.0% 0.0%
anna +0.0% 0.0% +0.1% -0.9% -0.0%
ansi +0.0% 0.0% -0.0% +0.0% +0.0%
atom +0.0% 0.0% 0.0% 0.0% 0.0%
awards +0.0% 0.0% -0.0% +0.0% 0.0%
banner +0.0% 0.0% -0.0% +0.0% 0.0%
bernouilli +0.0% 0.0% +0.0% +0.0% +0.0%
binary-trees +0.0% 0.0% -0.0% -0.0% -0.0%
boyer +0.0% 0.0% +0.0% 0.0% -0.0%
boyer2 +0.0% 0.0% +0.0% 0.0% -0.0%
bspt +0.0% 0.0% +0.0% +0.0% 0.0%
cacheprof +0.0% 0.0% +0.1% -0.8% 0.0%
calendar +0.0% 0.0% -0.0% +0.0% -0.0%
cichelli +0.0% 0.0% +0.0% 0.0% 0.0%
circsim +0.0% 0.0% -0.0% -0.1% -0.0%
clausify +0.0% 0.0% +0.0% +0.0% 0.0%
comp_lab_zift +0.0% 0.0% +0.0% 0.0% -0.0%
compress +0.0% 0.0% +0.0% +0.0% 0.0%
compress2 +0.0% 0.0% 0.0% 0.0% 0.0%
constraints +0.0% 0.0% -0.0% -0.0% -0.0%
cryptarithm1 +0.0% 0.0% +0.0% 0.0% 0.0%
cryptarithm2 +0.0% 0.0% +0.0% -0.0% 0.0%
cse +0.0% 0.0% +0.0% +0.0% 0.0%
digits-of-e1 +0.0% 0.0% -0.0% -0.0% -0.0%
digits-of-e2 +0.0% 0.0% +0.0% -0.0% -0.0%
dom-lt +0.0% 0.0% +0.0% +0.0% 0.0%
eliza +0.0% 0.0% -0.0% +0.0% 0.0%
event +0.0% 0.0% -0.0% -0.0% -0.0%
exact-reals +0.0% 0.0% +0.0% +0.0% +0.0%
exp3_8 +0.0% 0.0% -0.0% -0.0% -0.0%
expert +0.0% 0.0% +0.0% +0.0% +0.0%
fannkuch-redux +0.0% 0.0% +0.0% 0.0% 0.0%
fasta +0.0% 0.0% -0.0% -0.0% -0.0%
fem +0.0% 0.0% +0.0% +0.0% +0.0%
fft +0.0% 0.0% +0.0% -0.0% -0.0%
fft2 +0.0% 0.0% +0.0% +0.0% +0.0%
fibheaps +0.0% 0.0% +0.0% +0.0% 0.0%
fish +0.0% 0.0% +0.0% +0.0% 0.0%
fluid +0.0% 0.0% +0.0% +0.0% +0.0%
fulsom +0.0% 0.0% +0.0% -0.0% +0.0%
gamteb +0.0% 0.0% +0.0% -0.0% -0.0%
gcd +0.0% 0.0% +0.0% +0.0% 0.0%
gen_regexps +0.0% 0.0% +0.0% -0.0% -0.0%
genfft +0.0% 0.0% -0.0% -0.0% -0.0%
gg +0.0% 0.0% 0.0% -0.0% 0.0%
grep +0.0% 0.0% +0.0% +0.0% +0.0%
hidden +0.0% 0.0% +0.0% -0.0% -0.0%
hpg +0.0% 0.0% +0.0% -0.1% -0.0%
ida +0.0% 0.0% +0.0% -0.0% -0.0%
infer +0.0% 0.0% -0.0% -0.0% -0.0%
integer +0.0% 0.0% -0.0% -0.0% -0.0%
integrate +0.0% 0.0% 0.0% +0.0% 0.0%
k-nucleotide +0.0% 0.0% -0.0% -0.0% -0.0%
kahan +0.0% 0.0% -0.0% -0.0% -0.0%
knights +0.0% 0.0% +0.0% -0.0% -0.0%
lambda +0.0% 0.0% +1.2% -6.1% -0.0%
last-piece +0.0% 0.0% +0.0% -0.0% -0.0%
lcss +0.0% 0.0% +0.0% -0.0% -0.0%
life +0.0% 0.0% +0.0% -0.0% -0.0%
lift +0.0% 0.0% +0.0% +0.0% 0.0%
linear +0.0% 0.0% +0.0% +0.0% +0.0%
listcompr +0.0% 0.0% -0.0% -0.0% -0.0%
listcopy +0.0% 0.0% -0.0% -0.0% -0.0%
maillist +0.0% 0.0% +0.0% -0.0% -0.0%
mandel +0.0% 0.0% +0.0% +0.0% +0.0%
mandel2 +0.0% 0.0% +0.0% +0.0% -0.0%
mate +0.0% 0.0% +0.0% +0.0% +0.0%
minimax +0.0% 0.0% -0.0% +0.0% -0.0%
mkhprog +0.0% 0.0% +0.0% +0.0% +0.0%
multiplier +0.0% 0.0% 0.0% +0.0% -0.0%
n-body +0.0% 0.0% +0.0% -0.0% -0.0%
nucleic2 +0.0% 0.0% +0.0% +0.0% -0.0%
para +0.0% 0.0% +0.0% +0.0% +0.0%
paraffins +0.0% 0.0% +0.0% +0.0% +0.0%
parser +0.1% 0.0% +0.4% -1.7% -0.0%
parstof +0.0% 0.0% -0.0% -0.0% -0.0%
pic +0.0% 0.0% +0.0% 0.0% -0.0%
pidigits +0.0% 0.0% -0.0% -0.0% -0.0%
power +0.0% 0.0% +0.0% -0.0% -0.0%
pretty +0.0% 0.0% +0.0% +0.0% +0.0%
primes +0.0% 0.0% +0.0% 0.0% 0.0%
primetest +0.0% 0.0% +0.0% +0.0% +0.0%
prolog +0.0% 0.0% +0.0% +0.0% +0.0%
puzzle +0.0% 0.0% +0.0% +0.0% +0.0%
queens +0.0% 0.0% 0.0% +0.0% +0.0%
reptile +0.0% 0.0% +0.0% +0.0% 0.0%
reverse-complem +0.0% 0.0% -0.0% -0.0% -0.0%
rewrite +0.0% 0.0% +0.0% 0.0% -0.0%
rfib +0.0% 0.0% +0.0% +0.0% +0.0%
rsa +0.0% 0.0% +0.0% +0.0% +0.0%
scc +0.0% 0.0% +0.0% +0.0% +0.0%
sched +0.0% 0.0% +0.0% +0.0% +0.0%
scs +0.0% 0.0% +0.0% +0.0% 0.0%
simple +0.0% 0.0% +0.0% +0.0% +0.0%
solid +0.0% 0.0% +0.0% +0.0% 0.0%
sorting +0.0% 0.0% +0.0% -0.0% 0.0%
spectral-norm +0.0% 0.0% -0.0% -0.0% -0.0%
sphere +0.0% 0.0% +0.0% -1.0% 0.0%
symalg +0.0% 0.0% +0.0% +0.0% +0.0%
tak +0.0% 0.0% +0.0% +0.0% +0.0%
transform +0.0% 0.0% +0.4% -1.3% +0.0%
treejoin +0.0% 0.0% +0.0% -0.0% 0.0%
typecheck +0.0% 0.0% -0.0% +0.0% 0.0%
veritas +0.0% 0.0% +0.0% -0.1% +0.0%
wang +0.0% 0.0% +0.0% +0.0% +0.0%
wave4main +0.0% 0.0% +0.0% 0.0% -0.0%
wheel-sieve1 +0.0% 0.0% +0.0% +0.0% +0.0%
wheel-sieve2 +0.0% 0.0% +0.0% +0.0% 0.0%
x2n1 +0.0% 0.0% +0.0% +0.0% 0.0%
--------------------------------------------------------------------------------
Min +0.0% 0.0% -0.0% -6.1% -0.0%
Max +0.1% 0.0% +1.2% +0.0% +0.0%
Geometric Mean +0.0% -0.0% +0.0% -0.1% -0.0%
NoFib GC Results
================
--------------------------------------------------------------------------------
Program Size Allocs Instrs Reads Writes
--------------------------------------------------------------------------------
circsim +0.0% 0.0% -0.0% -0.0% -0.0%
constraints +0.0% 0.0% -0.0% 0.0% -0.0%
fibheaps +0.0% 0.0% 0.0% -0.0% -0.0%
fulsom +0.0% 0.0% 0.0% -0.6% -0.0%
gc_bench +0.0% 0.0% 0.0% 0.0% -0.0%
hash +0.0% 0.0% -0.0% -0.0% -0.0%
lcss +0.0% 0.0% 0.0% -0.0% 0.0%
mutstore1 +0.0% 0.0% 0.0% -0.0% -0.0%
mutstore2 +0.0% 0.0% +0.0% -0.0% -0.0%
power +0.0% 0.0% -0.0% 0.0% -0.0%
spellcheck +0.0% 0.0% -0.0% -0.0% -0.0%
--------------------------------------------------------------------------------
Min +0.0% 0.0% -0.0% -0.6% -0.0%
Max +0.0% 0.0% +0.0% 0.0% 0.0%
Geometric Mean +0.0% +0.0% +0.0% -0.1% +0.0%
Fixes #14373
These performance regressions appear to be a fluke in CI. See the
discussion in !1742 for details.
Metric Increase:
T6048
T12234
T12425
Naperian
T12150
T5837
T13035
|
| |
|
| |
|
| |
|
| |
|
|
|
|
| |
[ci skip]
|
|\
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
This introduces a concurrent mark & sweep garbage collector to manage the old
generation. The concurrent nature of this collector typically results in
significantly reduced maximum and mean pause times in applications with large
working sets.
Due to the large and intricate nature of the change I have opted to
preserve the fully-buildable history, including merge commits, which is
described in the "Branch overview" section below.
Collector design
================
The full design of the collector implemented here is described in detail
in a technical note
> B. Gamari. "A Concurrent Garbage Collector For the Glasgow Haskell
> Compiler" (2018)
This document can be requested from @bgamari.
The basic heap structure used in this design is heavily inspired by
> K. Ueno & A. Ohori. "A fully concurrent garbage collector for
> functional programs on multicore processors." /ACM SIGPLAN Notices/
> Vol. 51. No. 9 (presented at ICFP 2016)
This design is intended to allow both marking and sweeping
concurrent to execution of a multi-core mutator. Unlike the Ueno design,
which requires no global synchronization pauses, the collector
introduced here requires a stop-the-world pause at the beginning and end
of the mark phase.
To avoid heap fragmentation, the allocator consists of a number of
fixed-size /sub-allocators/. Each of these sub-allocators allocators into
its own set of /segments/, themselves allocated from the block
allocator. Each segment is broken into a set of fixed-size allocation
blocks (which back allocations) in addition to a bitmap (used to track
the liveness of blocks) and some additional metadata (used also used
to track liveness).
This heap structure enables collection via mark-and-sweep, which can be
performed concurrently via a snapshot-at-the-beginning scheme (although
concurrent collection is not implemented in this patch).
Implementation structure
========================
The majority of the collector is implemented in a handful of files:
* `rts/Nonmoving.c` is the heart of the beast. It implements the entry-point
to the nonmoving collector (`nonmoving_collect`), as well as the allocator
(`nonmoving_allocate`) and a number of utilities for manipulating the heap.
* `rts/NonmovingMark.c` implements the mark queue functionality, update
remembered set, and mark loop.
* `rts/NonmovingSweep.c` implements the sweep loop.
* `rts/NonmovingScav.c` implements the logic necessary to scavenge the
nonmoving heap.
Branch overview
===============
```
* wip/gc/opt-pause:
| A variety of small optimisations to further reduce pause times.
|
* wip/gc/compact-nfdata:
| Introduce support for compact regions into the non-moving
|\ collector
| \
| \
| | * wip/gc/segment-header-to-bdescr:
| | | Another optimization that we are considering, pushing
| | | some segment metadata into the segment descriptor for
| | | the sake of locality during mark
| | |
| * | wip/gc/shortcutting:
| | | Support for indirection shortcutting and the selector optimization
| | | in the non-moving heap.
| | |
* | | wip/gc/docs:
| |/ Work on implementation documentation.
| /
|/
* wip/gc/everything:
| A roll-up of everything below.
|\
| \
| |\
| | \
| | * wip/gc/optimize:
| | | A variety of optimizations, primarily to the mark loop.
| | | Some of these are microoptimizations but a few are quite
| | | significant. In particular, the prefetch patches have
| | | produced a nontrivial improvement in mark performance.
| | |
| | * wip/gc/aging:
| | | Enable support for aging in major collections.
| | |
| * | wip/gc/test:
| | | Fix up the testsuite to more or less pass.
| | |
* | | wip/gc/instrumentation:
| | | A variety of runtime instrumentation including statistics
| | / support, the nonmoving census, and eventlog support.
| |/
| /
|/
* wip/gc/nonmoving-concurrent:
| The concurrent write barriers.
|
* wip/gc/nonmoving-nonconcurrent:
| The nonmoving collector without the write barriers necessary
| for concurrent collection.
|
* wip/gc/preparation:
| A merge of the various preparatory patches that aren't directly
| implementing the GC.
|
|
* GHC HEAD
.
.
.
```
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
This extends the non-moving collector to allow concurrent collection.
The full design of the collector implemented here is described in detail
in a technical note
B. Gamari. "A Concurrent Garbage Collector For the Glasgow Haskell
Compiler" (2018)
This extension involves the introduction of a capability-local
remembered set, known as the /update remembered set/, which tracks
objects which may no longer be visible to the collector due to mutation.
To maintain this remembered set we introduce a write barrier on
mutations which is enabled while a concurrent mark is underway.
The update remembered set representation is similar to that of the
nonmoving mark queue, being a chunked array of `MarkEntry`s. Each
`Capability` maintains a single accumulator chunk, which it flushed
when it (a) is filled, or (b) when the nonmoving collector enters its
post-mark synchronization phase.
While the write barrier touches a significant amount of code it is
conceptually straightforward: the mutator must ensure that the referee
of any pointer it overwrites is added to the update remembered set.
However, there are a few details:
* In the case of objects with a dirty flag (e.g. `MVar`s) we can
exploit the fact that only the *first* mutation requires a write
barrier.
* Weak references, as usual, complicate things. In particular, we must
ensure that the referee of a weak object is marked if dereferenced by
the mutator. For this we (unfortunately) must introduce a read
barrier, as described in Note [Concurrent read barrier on deRefWeak#]
(in `NonMovingMark.c`).
* Stable names are also a bit tricky as described in Note [Sweeping
stable names in the concurrent collector] (`NonMovingSweep.c`).
We take quite some pains to ensure that the high thread count often seen
in parallel Haskell applications doesn't affect pause times. To this end
we allow thread stacks to be marked either by the thread itself (when it
is executed or stack-underflows) or the concurrent mark thread (if the
thread owning the stack is never scheduled). There is a non-trivial
handshake to ensure that this happens without racing which is described
in Note [StgStack dirtiness flags and concurrent marking].
Co-Authored-by: Ömer Sinan Ağacan <omer@well-typed.com>
|
|/
|
|
|
|
|
|
|
|
|
|
| |
19 times out of 20 we already have dynflags in scope.
We could just always use `return dflags`. But this is in fact not free.
When looking at some STG code I noticed that we always allocate a
closure for this expression in the heap. Clearly a waste in these cases.
For the other cases we can either just modify the callsite to
get dynflags or use the _D variants of withTiming I added which
will use getDynFlags under the hood.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
For backends maintaining the CFG during codegen
we can now find loops and their nesting level.
This is based on the Cmm CFG and dominator analysis.
As a result we can estimate edge frequencies a lot better
for methods, resulting in far better code layout.
Speedup on nofib: ~1.5%
Increase in compile times: ~1.9%
To make this feasible this commit adds:
* Dominator analysis based on the Lengauer-Tarjan Algorithm.
* An algorithm estimating global edge frequences from branch
probabilities - In CFG.hs
A few static branch prediction heuristics:
* Expect to take the backedge in loops.
* Expect to take the branch NOT exiting a loop.
* Expect integer vs constant comparisons to be false.
We also treat heap/stack checks special for branch prediction
to avoid them being treated as loops.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Introduces a new flag `-fmax-pmcheck-deltas` to achieve that. Deprecates
the old `-fmax-pmcheck-iter` mechanism in favor of this new flag.
From the user's guide:
Pattern match checking can be exponential in some cases. This limit makes sure
we scale polynomially in the number of patterns, by forgetting refined
information gained from a partially successful match. For example, when
matching `x` against `Just 4`, we split each incoming matching model into two
sub-models: One where `x` is not `Nothing` and one where `x` is `Just y` but
`y` is not `4`. When the number of incoming models exceeds the limit, we
continue checking the next clause with the original, unrefined model.
This also retires the incredibly hard to understand "maximum number of
refinements" mechanism, because the current mechanism is more general
and should catch the same exponential cases like PrelRules at the same
time.
-------------------------
Metric Decrease:
T11822
-------------------------
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
'withTiming' becomes a function that, when passed '-vN' (N >= 2) or
'-ddump-timings', will print timing (and possibly allocations) related
information. When additionally built with '-eventlog' and executed with
'+RTS -l', 'withTiming' will also emit both 'traceMarker' and 'traceEvent'
events to the eventlog.
'withTimingSilent' on the other hand will never print any timing information,
under any circumstance, and will only emit 'traceEvent' events to the eventlog.
As pointed out in !1672, 'traceMarker' is better suited for things that we
might want to visualize in tools like eventlog2html, while 'traceEvent'
is better suited for internal events that occur a lot more often and that we
don't necessarily want to visualize.
This addresses #17138 by using 'withTimingSilent' for all the codegen bits
that are expressed as a bunch of small computations over streams of codegen
ASTs.
|
|
|
|
|
|
| |
Add StgToCmm module hierarchy. Platform modules that are used in several
other places (NCG, LLVM codegen, Cmm transformations) are put into
GHC.Platform.
|
|
|
|
|
| |
The tightens up the kinds a bit. I use type synnonyms to avoid adding
promotion ticks everywhere.
|
|
|
|
|
|
|
| |
- Fixes crazy indentation in -ddump-debug output
- We no longer dump empty sections in -ddump-debug when a code block
does not have any generated debug info
- Minor refactoring in Debug.hs and AsmCodeGen.hs
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Noticed by @simonmar in !1362:
If the srtEntry is Nothing, then it should be safe to omit
references to this SRT from other SRTs, even if it is a static
function.
When updating SRT map we don't omit references to static functions (see
Note [Invalid optimisation: shortcutting]), but there's no reason to add
an SRT entry for a static function if the function is not CAFFY.
(Previously we'd add SRT entries for static functions even when they're
not CAFFY)
Using 9151b99e I checked sizes of all SRTs when building GHC and
containers:
- GHC: 583736 (HEAD), 581695 (this patch). 2041 less SRT entries.
- containers: 2457 (HEAD), 2381 (this patch). 76 less SRT entries.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This generalizes code generators (outputAsm, outputLlvm, outputC, and
the call site codeOutput) so that they'll return the return values of
the passed Cmm streams.
This allows accumulating data during Cmm generation and returning it to
the call site in HscMain.
Previously the Cmm streams were assumed to return (), so the code
generators returned () as well.
This change is required by !1304 and !1530.
Skipping CI as this was tested before and I only updated the commit
message.
[skip ci]
|
|
|
|
|
|
|
|
|
| |
This adds a Stream.consume function, uses it in LLVM and C code
generators, and removes the use of Stream.collect function which was
used to collect streaming Cmm generation results into a list.
LLVM and C backends now properly use streamed Cmm generation, instead of
collecting Cmm groups into a list before generating LLVM/C code.
|
|
|
|
|
|
|
|
|
|
|
| |
These kinds of imports are necessary in some cases such as
importing instances of typeclasses or intentionally creating
dependencies in the build system, but '-Wunused-imports' can't
detect when they are no longer needed. This commit removes the
unused ones currently in the code base (not including test files
or submodules), with the hope that doing so may increase
parallelism in the build system by removing unnecessary
dependencies.
|
|
|
|
|
|
|
|
| |
We introduce a PlatformWordSize type and use it in platformWordSize
field.
This removes to panic/error calls called when platform word size is not
32 or 64. We now check for this when reading the platform config.
|
| |
|
|
|
|
| |
separate file and add -ddump-cmm-verbose-by-proc to keep old behaviour (#16930)
|
|
|
|
|
|
|
| |
This prepares the way for making Int32# and Word32# the actual size they
claim to be.
Updates binary submodule for (de)serializing the new runtime reps.
|
|
|
|
|
|
|
| |
Unfortunately this will require more work; register allocation is
quite broken.
This reverts commit acd795583625401c5554f8e04ec7efca18814011.
|
|
|
|
|
| |
- Replace `catMaybes (map ...)` with `mapMaybe ...`
- Remove a list->set->list conversion
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Move switch expressions into a local variable when generating switches.
This avoids duplicating the expression if we translate the switch
to a tree search. This fixes #16933.
Further we now check if all branches of a switch have the same
destination, replacing the switch with a direct branch if that
is the case.
Both of these patterns appear in the ENTER macro used by the RTS
but are unlikely to occur in intermediate Cmm generated by GHC.
Nofib result summary:
--------------------------------------------------------------------------------
Program Size Allocs Runtime Elapsed TotalMem
--------------------------------------------------------------------------------
Min -0.0% -0.0% -15.7% -15.6% 0.0%
Max -0.0% 0.0% +5.4% +5.5% 0.0%
Geometric Mean -0.0% -0.0% -1.0% -1.0% -0.0%
Compiler allocations go up slightly: +0.2%
Example output before and after the change taken from RTS code below.
All but one of the memory loads `I32[_c3::I64 - 8]` are eliminated.
Instead the data is loaded once from memory in block c6.
Also the switch in block `ud` in the original code has been
eliminated completely.
Cmm without this commit:
```
stg_ap_0_fast() { // [R1]
{ []
}
{offset
ca: _c1::P64 = R1; // CmmAssign
goto c2; // CmmBranch
c2: if (_c1::P64 & 7 != 0) goto c4; else goto c6;
c6: _c3::I64 = I64[_c1::P64];
if (I32[_c3::I64 - 8] < 26 :: W32) goto ub; else goto ug;
ub: if (I32[_c3::I64 - 8] < 15 :: W32) goto uc; else goto ue;
uc: if (I32[_c3::I64 - 8] < 8 :: W32) goto c7; else goto ud;
ud: switch [8 .. 14] (%MO_SS_Conv_W32_W64(I32[_c3::I64 - 8])) {
case 8, 9, 10, 11, 12, 13, 14 : goto c4;
}
ue: if (I32[_c3::I64 - 8] >= 25 :: W32) goto c4; else goto uf;
uf: if (%MO_SS_Conv_W32_W64(I32[_c3::I64 - 8]) != 23) goto c7; else goto c4;
c4: R1 = _c1::P64;
call (P64[Sp])(R1) args: 8, res: 0, upd: 8;
ug: if (I32[_c3::I64 - 8] < 28 :: W32) goto uh; else goto ui;
uh: if (I32[_c3::I64 - 8] < 27 :: W32) goto c7; else goto c8;
ui: if (I32[_c3::I64 - 8] < 29 :: W32) goto c8; else goto c7;
c8: _c1::P64 = P64[_c1::P64 + 8];
goto c2;
c7: R1 = _c1::P64;
call (_c3::I64)(R1) args: 8, res: 0, upd: 8;
}
}
```
Cmm with this commit:
```
stg_ap_0_fast() { // [R1]
{ []
}
{offset
ca: _c1::P64 = R1;
goto c2;
c2: if (_c1::P64 & 7 != 0) goto c4; else goto c6;
c6: _c3::I64 = I64[_c1::P64];
_ub::I64 = %MO_SS_Conv_W32_W64(I32[_c3::I64 - 8]);
if (_ub::I64 < 26) goto uc; else goto uh;
uc: if (_ub::I64 < 15) goto ud; else goto uf;
ud: if (_ub::I64 < 8) goto c7; else goto c4;
uf: if (_ub::I64 >= 25) goto c4; else goto ug;
ug: if (_ub::I64 != 23) goto c7; else goto c4;
c4: R1 = _c1::P64;
call (P64[Sp])(R1) args: 8, res: 0, upd: 8;
uh: if (_ub::I64 < 28) goto ui; else goto uj;
ui: if (_ub::I64 < 27) goto c7; else goto c8;
uj: if (_ub::I64 < 29) goto c8; else goto c7;
c8: _c1::P64 = P64[_c1::P64 + 8];
goto c2;
c7: R1 = _c1::P64;
call (_c3::I64)(R1) args: 8, res: 0, upd: 8;
}
}
```
|
|
|
|
|
|
|
| |
This adds support for constructing vector types from Float#, Double# etc
and performing arithmetic operations on them
Cleaned-Up-By: Ben Gamari <ben@well-typed.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Here the following changes are introduced:
- A read barrier machine op is added to Cmm.
- The order in which a closure's fields are read and written is changed.
- Memory barriers are added to RTS code to ensure correctness on
out-or-order machines with weak memory ordering.
Cmm has a new CallishMachOp called MO_ReadBarrier. On weak memory machines, this
is lowered to an instruction that ensures memory reads that occur after said
instruction in program order are not performed before reads coming before said
instruction in program order. On machines with strong memory ordering properties
(e.g. X86, SPARC in TSO mode) no such instruction is necessary, so
MO_ReadBarrier is simply erased. However, such an instruction is necessary on
weakly ordered machines, e.g. ARM and PowerPC.
Weam memory ordering has consequences for how closures are observed and mutated.
For example, consider a closure that needs to be updated to an indirection. In
order for the indirection to be safe for concurrent observers to enter, said
observers must read the indirection's info table before they read the
indirectee. Furthermore, the entering observer makes assumptions about the
closure based on its info table contents, e.g. an INFO_TYPE of IND imples the
closure has an indirectee pointer that is safe to follow.
When a closure is updated with an indirection, both its info table and its
indirectee must be written. With weak memory ordering, these two writes can be
arbitrarily reordered, and perhaps even interleaved with other threads' reads
and writes (in the absence of memory barrier instructions). Consider this
example of a bad reordering:
- An updater writes to a closure's info table (INFO_TYPE is now IND).
- A concurrent observer branches upon reading the closure's INFO_TYPE as IND.
- A concurrent observer reads the closure's indirectee and enters it. (!!!)
- An updater writes the closure's indirectee.
Here the update to the indirectee comes too late and the concurrent observer has
jumped off into the abyss. Speculative execution can also cause us issues,
consider:
- An observer is about to case on a value in closure's info table.
- The observer speculatively reads one or more of closure's fields.
- An updater writes to closure's info table.
- The observer takes a branch based on the new info table value, but with the
old closure fields!
- The updater writes to the closure's other fields, but its too late.
Because of these effects, reads and writes to a closure's info table must be
ordered carefully with respect to reads and writes to the closure's other
fields, and memory barriers must be placed to ensure that reads and writes occur
in program order. Specifically, updates to a closure must follow the following
pattern:
- Update the closure's (non-info table) fields.
- Write barrier.
- Update the closure's info table.
Observing a closure's fields must follow the following pattern:
- Read the closure's info pointer.
- Read barrier.
- Read the closure's (non-info table) fields.
This patch updates RTS code to obey this pattern. This should fix long-standing
SMP bugs on ARM (specifically newer aarch64 microarchitectures supporting
out-of-order execution) and PowerPC. This fixes issue #15449.
Co-Authored-By: Ben Gamari <ben@well-typed.com>
|
| |
|
|
|
|
|
|
|
| |
ghc-pkg needs to be aware of platforms so it can figure out which
subdire within the user package db to use. This is admittedly
roundabout, but maybe Cabal could use the same notion of a platform as
GHC to good affect too.
|
|
|
|
|
|
| |
("Continuation BlockIds" is referenced in CmmProcPoint)
[skip ci]
|
| |
|
| |
|
|
|
|
|
| |
Previously log and exp were primitives yet log1p and expm1 were FFI
calls. Fix this non-uniformity.
|
|
|
|
|
|
| |
[skip ci]
This should really be caught by the linters! (#16711)
|
|
|
|
|
|
|
|
|
|
| |
After the previous commit, `Settings` is just a thin wrapper around
other groups of settings. While `Settings` is used by GHC-the-executable
to initalize `DynFlags`, in principle another consumer of
GHC-the-library could initialize `DynFlags` a different way. It
therefore doesn't make sense for `DynFlags` itself (library code) to
separate the settings that typically come from `Settings` from the
settings that typically don't.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Previously -ddump-cmm was generating code with unbalanced curly braces:
stg_atomically_entry() // [R1]
{ info_tbls: [(cfl,
label: stg_atomically_info
rep: tag:16 HeapRep 1 ptrs { Thunk }
srt: Nothing)]
stack_info: arg_space: 8 updfr_space: Just 8
}
{offset
cfl: // cfk
unwind Sp = Just Sp + 0;
_cfk::P64 = R1;
//tick src<rts/PrimOps.cmm:(1243,1)-(1245,1)>
R1 = I64[_cfk::P64 + 8 + 8 + 0 * 8];
call stg_atomicallyzh(R1) args: 8, res: 0, upd: 8;
}
}, <---- OPENING BRACE MISSING
After this patch:
stg_atomically_entry() { // [R1] <---- MISSING OPENING BRACE HERE
{ info_tbls: [(cfl,
label: stg_atomically_info
rep: tag:16 HeapRep 1 ptrs { Thunk }
srt: Nothing)]
stack_info: arg_space: 8 updfr_space: Just 8
}
{offset
cfl: // cfk
unwind Sp = Just Sp + 0;
_cfk::P64 = R1;
//tick src<rts/PrimOps.cmm:(1243,1)-(1245,1)>
R1 = I64[_cfk::P64 + 8 + 8 + 0 * 8];
call stg_atomicallyzh(R1) args: 8, res: 0, upd: 8;
}
},
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
1. If GHC is to be multi-target, these cannot be baked in at compile
time.
2. Compile-time flags have a higher maintenance than run-time flags.
3. The old way makes build system implementation (various bootstrapping
details) with the thing being built. E.g. GHC doesn't need to care
about which integer library *will* be used---this is purely a crutch
so the build system doesn't need to pass flags later when using that
library.
4. Experience with cross compilation in Nixpkgs has shown things work
nicer when compiler's can *optionally* delegate the bootstrapping the
package manager. The package manager knows the entire end-goal build
plan, and thus can make top-down decisions on bootstrapping. GHC can
just worry about GHC, not even core library like base and ghc-prim!
|
|
|
|
|
|
|
|
|
|
|
|
| |
When a new closure identifier is being established to a
local or exported closure already emitted into the same
module, refrain from adding an IND_STATIC closure, and
instead emit an assembly-language alias.
Inter-module IND_STATIC objects still remain, and need to be
addressed by other measures.
Binary-size savings on nofib are around 0.1%.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* simplifies registers to have GPR, Float and Double, by removing the SSE2 and X87 Constructors
* makes -msse2 assumed/default for x86 platforms, fixing a long standing nondeterminism in rounding
behavior in 32bit haskell code
* removes the 80bit floating point representation from the supported float sizes
* theres still 1 tiny bit of x87 support needed,
for handling float and double return values in FFI calls wrt the C ABI on x86_32,
but this one piece does not leak into the rest of NCG.
* Lots of code thats not been touched in a long time got deleted as a
consequence of all of this
all in all, this change paves the way towards a lot of future further
improvements in how GHC handles floating point computations, along with
making the native code gen more accessible to a larger pool of contributors.
|
|
|
|
|
|
| |
This commit includes the necessary changes in code and
documentation to support a primop that reverses a word's
bits. It also includes a test.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This moves all URL references to Trac Wiki to their corresponding
GitLab counterparts.
This substitution is classified as follows:
1. Automated substitution using sed with Ben's mapping rule [1]
Old: ghc.haskell.org/trac/ghc/wiki/XxxYyy...
New: gitlab.haskell.org/ghc/ghc/wikis/xxx-yyy...
2. Manual substitution for URLs containing `#` index
Old: ghc.haskell.org/trac/ghc/wiki/XxxYyy...#Zzz
New: gitlab.haskell.org/ghc/ghc/wikis/xxx-yyy...#zzz
3. Manual substitution for strings starting with `Commentary`
Old: Commentary/XxxYyy...
New: commentary/xxx-yyy...
See also !539
[1]: https://gitlab.haskell.org/bgamari/gitlab-migration/blob/master/wiki-mapping.json
|