| Commit message (Collapse) | Author | Age | Files | Lines |
... | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Ensure that the bitmap of the segmentt that we will clear next is in
cache by the time we reach it.
|
| | | | |
|
| | | |
| | | |
| | | |
| | | |
| | | | |
Use memchr instead of a open-coded loop. This is nearly twice as fast in
a synthetic benchmark.
|
| | | |
| | | |
| | | |
| | | | |
This shortens MarkQueueEntry by 30% (one word)
|
| | | | |
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Perf showed that the this single div was capturing up to 10% of samples
in nonmovingMark. However, the overwhelming majority of cases is looking
at small block sizes. These cases we can easily compute explicitly,
allowing the compiler to turn the division into a significantly more
efficient division-by-constant.
While the increase in source code looks scary, this all optimises down
to very nice looking assembler. At this point the only remaining
hotspots in nonmovingBlockCount are due to memory access.
|
| | | | |
|
| | | | |
|
| | |/
| | |
| | |
| | |
| | |
| | |
| | | |
This commit does two things:
* Allow aging of objects during the preparatory minor GC
* Refactor handling of static objects to avoid the use of a hashtable
|
| | | |
|
| | |
| | |
| | |
| | |
| | | |
Otherwise the census is unsafe when mutators are running due to
concurrent mutation.
|
| | | |
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
This introduces a simple census of the non-moving heap (not to be
confused with the heap census used by the heap profiler). This
collects basic heap usage information (number of allocated and free
blocks) which is useful when characterising fragmentation of the
nonmoving heap.
|
| |/
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
This introduces a few events to mark key points in the nonmoving
garbage collection cycle. These include:
* `EVENT_CONC_MARK_BEGIN`, denoting the beginning of a round of
marking. This may happen more than once in a single major collection
since we the major collector iterates until it hits a fixed point.
* `EVENT_CONC_MARK_END`, denoting the end of a round of marking.
* `EVENT_CONC_SYNC_BEGIN`, denoting the beginning of the post-mark
synchronization phase
* `EVENT_CONC_UPD_REM_SET_FLUSH`, indicating that a capability has
flushed its update remembered set.
* `EVENT_CONC_SYNC_END`, denoting that all mutators have flushed their
update remembered sets.
* `EVENT_CONC_SWEEP_BEGIN`, denoting the beginning of the sweep portion
of the major collection.
* `EVENT_CONC_SWEEP_END`, denoting the end of the sweep portion of the
major collection.
|
| |
| |
| |
| |
| |
| | |
This required some fiddling around with the location of forward
declarations since the C sources generated by GHC's C backend only
includes Stg.h.
|
| | |
|
| | |
|
| |
| |
| |
| |
| |
| |
| | |
This requires that we break nonmovingExit into two pieces since we need
to first stop the collector to relinquish any capabilities, then we need
to shutdown the scheduler, then we need to free the nonmoving
allocators.
|
| | |
|
| | |
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
This extends the non-moving collector to allow concurrent collection.
The full design of the collector implemented here is described in detail
in a technical note
B. Gamari. "A Concurrent Garbage Collector For the Glasgow Haskell
Compiler" (2018)
This extension involves the introduction of a capability-local
remembered set, known as the /update remembered set/, which tracks
objects which may no longer be visible to the collector due to mutation.
To maintain this remembered set we introduce a write barrier on
mutations which is enabled while a concurrent mark is underway.
The update remembered set representation is similar to that of the
nonmoving mark queue, being a chunked array of `MarkEntry`s. Each
`Capability` maintains a single accumulator chunk, which it flushed
when it (a) is filled, or (b) when the nonmoving collector enters its
post-mark synchronization phase.
While the write barrier touches a significant amount of code it is
conceptually straightforward: the mutator must ensure that the referee
of any pointer it overwrites is added to the update remembered set.
However, there are a few details:
* In the case of objects with a dirty flag (e.g. `MVar`s) we can
exploit the fact that only the *first* mutation requires a write
barrier.
* Weak references, as usual, complicate things. In particular, we must
ensure that the referee of a weak object is marked if dereferenced by
the mutator. For this we (unfortunately) must introduce a read
barrier, as described in Note [Concurrent read barrier on deRefWeak#]
(in `NonMovingMark.c`).
* Stable names are also a bit tricky as described in Note [Sweeping
stable names in the concurrent collector] (`NonMovingSweep.c`).
We take quite some pains to ensure that the high thread count often seen
in parallel Haskell applications doesn't affect pause times. To this end
we allow thread stacks to be marked either by the thread itself (when it
is executed or stack-underflows) or the concurrent mark thread (if the
thread owning the stack is never scheduled). There is a non-trivial
handshake to ensure that this happens without racing which is described
in Note [StgStack dirtiness flags and concurrent marking].
Co-Authored-by: Ömer Sinan Ağacan <omer@well-typed.com>
|
| | |
|
| |
| |
| |
| |
| | |
This simply runs the compile_and_run tests with `-xn`, enabling the
nonmoving oldest generation.
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
This implements the core heap structure and a serial mark/sweep
collector which can be used to manage the oldest-generation heap.
This is the first step towards a concurrent mark-and-sweep collector
aimed at low-latency applications.
The full design of the collector implemented here is described in detail
in a technical note
B. Gamari. "A Concurrent Garbage Collector For the Glasgow Haskell
Compiler" (2018)
The basic heap structure used in this design is heavily inspired by
K. Ueno & A. Ohori. "A fully concurrent garbage collector for
functional programs on multicore processors." /ACM SIGPLAN Notices/
Vol. 51. No. 9 (presented by ICFP 2016)
This design is intended to allow both marking and sweeping
concurrent to execution of a multi-core mutator. Unlike the Ueno design,
which requires no global synchronization pauses, the collector
introduced here requires a stop-the-world pause at the beginning and end
of the mark phase.
To avoid heap fragmentation, the allocator consists of a number of
fixed-size /sub-allocators/. Each of these sub-allocators allocators into
its own set of /segments/, themselves allocated from the block
allocator. Each segment is broken into a set of fixed-size allocation
blocks (which back allocations) in addition to a bitmap (used to track
the liveness of blocks) and some additional metadata (used also used
to track liveness).
This heap structure enables collection via mark-and-sweep, which can be
performed concurrently via a snapshot-at-the-beginning scheme (although
concurrent collection is not implemented in this patch).
The mark queue is a fairly straightforward chunked-array structure.
The representation is a bit more verbose than a typical mark queue to
accomodate a combination of two features:
* a mark FIFO, which improves the locality of marking, reducing one of
the major overheads seen in mark/sweep allocators (see [1] for
details)
* the selector optimization and indirection shortcutting, which
requires that we track where we found each reference to an object
in case we need to update the reference at a later point (e.g. when
we find that it is an indirection). See Note [Origin references in
the nonmoving collector] (in `NonMovingMark.h`) for details.
Beyond this the mark/sweep is fairly run-of-the-mill.
[1] R. Garner, S.M. Blackburn, D. Frampton. "Effective Prefetch for
Mark-Sweep Garbage Collection." ISMM 2007.
Co-Authored-By: Ben Gamari <ben@well-typed.com>
|
| | |
|
| |
| |
| |
| | |
This flag will enable the use of a non-moving oldest generation.
|
| |
| |
| |
| |
| |
| |
| | |
To keep the non-moving collector nicely separated from the moving
collector its scavenging phase will live in another file,
`NonMovingScav.c`. However, it will need to use these functions so
let's expose them.
|
| |
| |
| |
| |
| | |
This warning is a bit of a relic; there is little reason to avoid
aggregate return values in 2019.
|
| |
| |
| |
| |
| | |
These will be needed when we implement sweeping in the nonmoving
collector.
|
| |\ \
| | | |
| | | |
| | | | |
'wip/gc/aligned-block-allocation' into wip/gc/preparation
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
This implements support for block group allocations which are aligned to
an integral number of blocks.
This will be used by the nonmoving garbage collector, which uses the
block allocator to allocate the segments which back its heap. These
segments are a fixed number of blocks in size, with each segment being
aligned to the segment size boundary. This allows us to easily find the
segment metadata stored at the beginning of the segment.
|
| | |/
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
The concurrent mark-and-sweep will be performed by a GHC task which will
not hold a capability. This is necessary to avoid a concurrent mark from
interfering with minor generation collections.
However, the major collector must synchronize with the mutators at the
end of marking to flush their update remembered sets. This patch extends
the `requestSync` mechanism used to synchronize garbage collectors to
allow synchronization without holding a capability.
This change is fairly straightforward as the capability was previously
only required for two reasons:
1. to ensure that we don't try to re-acquire a capability that we
the sync requestor already holds.
2. to provide a way to suspend and later resume the sync request if
there is already a sync pending.
When synchronizing without holding a capability we needn't worry about
consideration (1) at all.
(2) is slightly trickier and may happen, for instance, when a capability
requests a minor collection and shortly thereafter the non-moving mark
thread requests a post-mark synchronization. In this case we need to
ensure that the non-moving mark thread suspends his request until after
the minor GC has concluded to avoid dead-locking. For this we introduce
a condition variable, `sync_finished_cond`, which a
non-capability-bearing requestor will wait on and which is signalled
after a synchronization or GC has finished.
|
| | | |
|
| | | |
|
| | |
| | |
| | |
| | |
| | | |
This were previously quite unclear and will change a bit under the
non-moving collector so let's clear this up now.
|
| | | |
|
| | |
| | |
| | |
| | | |
This was slightly non-obvious so a note seems deserved.
|
| |/
| |
| |
| |
| |
| |
| | |
Namely ensure that block descriptors are initialized with valid
generation numbers.
Co-Authored-By: Ben Gamari <ben@well-typed.com>
|
| |
| |
| |
| |
| |
| | |
Unfortunately this was introduced in base-4.11.0 (GHC 8.4.1)
whereas the other locking primitives were added in base-4.10.0 (GHC
8.2.1).
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Previously -ddump-stg would dump pre and post-unarise STGs. Now we have
a new flag for post-unarise STG and -ddump-stg only dumps coreToStg
output.
STG dump flags after this commit:
- -ddump-stg: Dumps CoreToStg output
- -ddump-stg-unarised: Unarise output
- -ddump-stg-final: STG right before code gen (includes CSE and lambda
lifting)
|
| |
| |
| |
| |
| |
| |
| |
| | |
cmd uses RawCommand which uses Windows semantics to find the executable
which sometimes seems to fail for unclear reasons.
If we invoke ghc via bash then bash will find the ghc executable and
the issue goes away.
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Previously the make build system would pass things like
`-optl-optl-Wl,-x -optl-optl-Wl,noexecstack` to GHC. This would
naturally result in mass confusion as GHC would pass `-optl-Wl,-x` to
GCC. GCC would in turn interpret this as `-o ptl-Wl,-x`, setting the
output pass of the invocation.
The problem that `-optl` was added to the command-line in two places in
the build system. Fix this.
Fixes #17385.
|
| |
| |
| |
| | |
Fixes #17382.
|
| |
| |
| |
| |
| |
| |
| | |
The "new" performance testing infrastructure resets the baseline after
every test so it's easy to miss gradual performance regressions over
time. We should at least make these numbers smaller to catch patches
which affect performance earlier.
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
19 times out of 20 we already have dynflags in scope.
We could just always use `return dflags`. But this is in fact not free.
When looking at some STG code I noticed that we always allocate a
closure for this expression in the heap. Clearly a waste in these cases.
For the other cases we can either just modify the callsite to
get dynflags or use the _D variants of withTiming I added which
will use getDynFlags under the hood.
|
| | |
|
| |
| |
| |
| |
| |
| |
| |
| | |
Previously partial roll back of a branch of an `orElse` was attempted
if validation failure was observed. Validation here, however, does
not account for what part of the transaction observed inconsistent
state. This commit fixes this by fully aborting and restarting the
transaction.
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
This is part two of fixing #17334.
There are two parts to this commit:
- A bugfix for computing loop levels
- A bugfix of basic block invariants in the NCG.
-----------------------------------------------------------
In the first bug we ended up with a CFG of the sort: [A -> B -> C]
This was represented via maps as fromList [(A,B),(B,C)] and later
transformed into a adjacency array. However the transformation did
not include block C in the array (since we only looked at the keys of
the map).
This was still fine until we tried to look up successors for C and tried
to read outside of the array bounds when accessing C.
In order to prevent this in the future I refactored to code to include
all nodes as keys in the map representation. And make this a invariant
which is checked in a few places.
Overall I expect this to make the code more robust as now any failed
lookup will represent an error, versus failed lookups sometimes being
expected and sometimes not.
In terms of performance this makes some things cheaper (getting a list
of all nodes) and others more expensive (adding a new edge). Overall
this adds up to no noteable performance difference.
-----------------------------------------------------------
Part 2: When the NCG generated a new basic block, it did
not always insert a NEWBLOCK meta instruction in the stream which
caused a quite subtle bug.
During instruction selection a statement `s`
in a block B with control of the sort: B -> C
will sometimes result in control
flow of the sort:
┌ < ┐
v ^
B -> B1 ┴ -> C
as is the case for some atomic operations.
Now to keep the CFG in sync when introducing B1 we clearly
want to insert it between B and C. However there is
a catch when we have to deal with self loops.
We might start with code and a CFG of these forms:
loop:
stmt1 ┌ < ┐
.... v ^
stmtX loop ┘
stmtY
....
goto loop:
Now we introduce B1:
┌ ─ ─ ─ ─ ─┐
loop: │ ┌ < ┐ │
instrs v │ │ ^
.... loop ┴ B1 ┴ ┘
instrsFromX
stmtY
goto loop:
This is simple, all outgoing edges from loop now simply
start from B1 instead and the code generator knows which
new edges it introduced for the self loop of B1.
Disaster strikes if the statement Y follows the same pattern.
If we apply the same rule that all outgoing edges change then
we end up with:
loop ─> B1 ─> B2 ┬─┐
│ │ └─<┤ │
│ └───<───┘ │
└───────<────────┘
This is problematic. The edge B1->B1 is modified as expected.
However the modification is wrong!
The assembly in this case looked like this:
_loop:
<instrs>
_B1:
...
cmpxchgq ...
jne _B1
<instrs>
<end _B1>
_B2:
...
cmpxchgq ...
jne _B2
<instrs>
jmp loop
There is no edge _B2 -> _B1 here. It's still a self loop onto _B1.
The problem here is that really B1 should be two basic blocks.
Otherwise we have control flow in the *middle* of a basic block.
A contradiction!
So to account for this we add yet another basic block marker:
_B:
<instrs>
_B1:
...
cmpxchgq ...
jne _B1
jmp _B1'
_B1':
<instrs>
<end _B1>
_B2:
...
Now when inserting B2 we will only look at the outgoing edges of B1' and
everything will work out nicely.
You might also wonder why we don't insert jumps at the end of _B1'. There is
no way another block ends up jumping to the labels _B1 or _B2 since they are
essentially invisible to other blocks. View them as control flow labels local
to the basic block if you'd like.
Not doing this ultimately caused (part 2 of) #17334.
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
This commit allows command name resolution for GHCi commands
with option `!` as follows:
ghci> :k! Int
Int :: *
= Int
This commit changes implementation as follows:
Before:
* Prefix match with full string including the option `!` (e.g. `k!`)
After (this patch):
* Prefix match without option suffix `!` (e.g. `k`)
* in addition, suffix match with option `!`
See also #8305 and #8113
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
With this change it is possible to reconstruct the timing portion of a
`.prof` file after the fact. By logging the stacks at each time point
a more precise executation trace of the program can be observed rather
than all identical cost centres being identified in the report.
There are two new events:
1. `EVENT_PROF_BEGIN` - emitted at the start of profiling to communicate
the tick interval
2. `EVENT_PROF_SAMPLE_COST_CENTRE` - emitted on each tick to communicate the
current call stack.
Fixes #17322
|