| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
| |
0e274c39bf836d5bb846f5fa08649c75f85326ac added an assertion in
`dirty_MUT_VAR` checking that the MUT_VAR being dirtied was clean.
However, this isn't necessarily the case since another thread may have
raced us to dirty the object.
|
|
|
|
|
|
|
| |
The logic here was inverted. Reverting the commit to avoid confusion
when examining the commit history.
This reverts commit b3eacd64fb36724ed6c5d2d24a81211a161abef1.
|
|
|
|
|
|
|
| |
This avoids a lock inversion between the storage manager mutex and
the stable pointer table mutex by not dropping the SM_MUTEX in
nonmovingCollect. This requires quite a bit of rejiggering but it
does seem like a better strategy.
|
|
|
|
|
|
|
| |
0e274c39bf836d5bb846f5fa08649c75f85326ac added an assertion in
`dirty_MUT_VAR` checking that the MUT_VAR being dirtied was clean.
However, this isn't necessarily the case since another thread may have
raced us to dirty the object.
|
| |
|
|
|
|
|
|
| |
And ensure accesses to n_capabilities are atomic (although with relaxed
ordering). This is necessary as RTS API callers may concurrently call
into the RTS without holding a capability.
|
|
|
|
|
| |
Also introduce MUT_FIELD marker in Closures.h to document mutable
fields.
|
|
|
|
|
| |
This patch makes flushExec a no-op on wasm32, since there's no such
thing as executable memory on wasm32 in the first place.
|
| |
|
| |
|
|
|
|
| |
comparison (#20088)
|
|
|
|
|
|
| |
is used outside of the rts so we do this rather than just fish it out of
the repo in ad-hoc way, in order to make packages in this repo more
self-contained.
|
|
|
|
| |
All uses of these now use ExecPage.
|
|
|
|
|
|
|
| |
Previously the libffi Adjustor implementation would use allocateExec to
create executable mappings. However, allocateExec is also used elsewhere
in GHC to allocate things other than ffi_closure, which is a use-case
which libffi does not support.
|
|
|
|
|
|
|
|
|
|
|
|
| |
We now have two darwin flavours. AArch64-Darwin, and
x86_64-darwin, the latter one which has proper custom
adjustor support, the former though relies on libffi.
Mixing both leads to odd crashes, as the closures might
not fit the size of the libffi closures. Hence this
needs to be guarded by the USE_LBFFI_FOR_ADJUSTORS guard.
Original patch by Hamish Mackenzie
|
|
|
|
|
|
|
|
|
| |
This drops allocateExec for darwin, and replaces it with
a alloc, write, mark executable strategy instead. This prevents
us from trying to allocate an executable range and then write to
it, which X^W will prohibit on darwin.
This will *only* work if we can use mmap.
|
|
|
|
| |
This means it will be reposted everytime the eventlog is started.
|
|
|
|
|
|
|
|
|
|
|
| |
The way in which allocatePinned took blocks out of the nursery was
leading to horrible fragmentation in some workloads.
The strategy now is that a separate free block list is reserved for each
capability and blocks are taken from there. When it's empty the global
SM lock is taken and a fresh block of size PINNED_EMPTY_SIZE is allocated.
Fixes #19481
|
|
|
|
|
|
|
|
|
| |
This function is exposed in the RtsAPI.h so that external users have a
blessed way to traverse all the different `bdescr`s which are known by
the RTS.
The main motivation is to use this function in ghc-debug but avoid
having to expose the internal structure of a Capability in the API.
|
|
|
|
|
|
| |
But only when profiling or DEBUG are enabled.
Fixes #17572.
|
|
|
|
| |
This is necessary since the user may enable `+RTS -hT` at any time.
|
|
|
|
|
|
|
|
| |
This addes the necessary logic to support aarch64 on elf, as well
as aarch64 on mach-o, which Apple calls arm64.
We change architecture name to AArch64, which is the official arm
naming scheme.
|
| |
|
|\ |
|
| |
| |
| |
| | |
Since the latter wants to call getRTSStats.
|
| | |
|
|/ |
|
|
|
|
|
|
|
|
|
|
|
| |
We're now correctly computing allocated bytes on 32-bit arch, so we get
huge increases.
Metric Increase:
haddock.Cabal
haddock.base
haddock.compiler
space_leak_001
|
|
|
|
|
|
|
|
|
|
|
| |
The code is just more confusing than it needs to be. We don't need to mix
the threaded check with the ldv profiling check since ldv's init already
checks for this. Hence they can be two separate checks. Taking the sanity
checking into account is also cleaner via DebugFlags.sanity. No need for
checking the DEBUG define.
The ZERO_SLOP_FOR_LDV_PROF and ZERO_SLOP_FOR_SANITY_CHECK definitions the
old code had also make things a lot more opaque IMO so I removed those.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Previously we (incorrectly) relied on failed_to_evac to be "precise".
That is, we expected it to only be true if *all* of an object's fields
lived outside of the non-moving heap. However, does not match the
behavior of failed_to_evac, which is true if *any* of the object's
fields weren't promoted (meaning that some others *may* live in the
non-moving heap).
This is problematic as we skip the non-moving write barrier for dirty
objects (which we can only safely do if *all* fields point outside of
the non-moving heap).
Clearly this arises due to a fundamental difference in the behavior
expected of failed_to_evac in the moving and non-moving collector.
e.g., in the moving collector it is always safe to conservatively say
failed_to_evac=true whereas in the non-moving collector the safe value
is false.
This issue went unnoticed as I never wrote down the dirtiness
invariant enforced by the non-moving collector. We now define this
invariant as
An object being marked as dirty implies that all of its fields are
on the mark queue (or, equivalently, update remembered set).
To maintain this invariant we teach nonmovingScavengeOne to push the
fields of objects which we fail to evacuate to the update remembered
set. This is a simple and reasonably cheap solution and avoids the
complexity and fragility that other, more strict alternative invariants
would require.
All of this is described in a new Note, Note [Dirty flags in the
non-moving collector] in NonMoving.c.
|
| |
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The heap profiler currently cannot traverse pinned blocks because of
alignment slop. This used to just be a minor annoyance as the whole block
is accounted into a special cost center rather than the respective object's
CCS, cf. #7275. However for the new root profiler we would like to be able
to visit _every_ closure on the heap. We need to do this so we can get rid
of the current 'flip' bit hack in the heap traversal code.
Since info pointers are always non-zero we can in principle skip all the
slop in the profiler if we can rely on it being zeroed. This assumption
caused problems in the past though, commit a586b33f8e ("rts: Correct
handling of LARGE ARR_WORDS in LDV profiler"), part of !1118, tried to use
the same trick for BF_LARGE objects but neglected to take into account that
shrink*Array# functions don't ensure that slop is zeroed when not
compiling with profiling.
Later, commit 0c114c6599 ("Handle large ARR_WORDS in heap census (fix
as we will only be assuming slop is zeroed when profiling is on.
This commit also reduces the ammount of slop we introduce in the first
place by calculating the needed alignment before doing the allocation for
small objects where we know the next available address. For large objects
we don't know how much alignment we'll have to do yet since those details
are hidden behind the allocateMightFail function so there we continue to
allocate the maximum additional words we'll need to do the alignment.
So we don't have to duplicate all this logic in the cmm code we pull it
into the RTS allocatePinned function instead.
Metric Decrease:
T7257
haddock.Cabal
haddock.base
|
|
|
|
|
|
| |
Similar to SELinux, NetBSD "PaX mprotect" prohibits marking a page
mapping both writable and executable at the same time. Use libffi
which knows how to work around it.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The below is only necessary to fix the CI perf fluke that
happened in 9897e8c8ef0b19a9571ef97a1d9bb050c1ee9121:
-------------------------
Metric Decrease:
T5837
T6048
T9020
T12425
T12234
T13035
T12150
Naperian
-------------------------
|
| |
|
|
|
|
|
|
|
| |
This commit does two things:
* Allow aging of objects during the preparatory minor GC
* Refactor handling of static objects to avoid the use of a hashtable
|
| |
|
|
|
|
|
|
|
| |
This requires that we break nonmovingExit into two pieces since we need
to first stop the collector to relinquish any capabilities, then we need
to shutdown the scheduler, then we need to free the nonmoving
allocators.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This extends the non-moving collector to allow concurrent collection.
The full design of the collector implemented here is described in detail
in a technical note
B. Gamari. "A Concurrent Garbage Collector For the Glasgow Haskell
Compiler" (2018)
This extension involves the introduction of a capability-local
remembered set, known as the /update remembered set/, which tracks
objects which may no longer be visible to the collector due to mutation.
To maintain this remembered set we introduce a write barrier on
mutations which is enabled while a concurrent mark is underway.
The update remembered set representation is similar to that of the
nonmoving mark queue, being a chunked array of `MarkEntry`s. Each
`Capability` maintains a single accumulator chunk, which it flushed
when it (a) is filled, or (b) when the nonmoving collector enters its
post-mark synchronization phase.
While the write barrier touches a significant amount of code it is
conceptually straightforward: the mutator must ensure that the referee
of any pointer it overwrites is added to the update remembered set.
However, there are a few details:
* In the case of objects with a dirty flag (e.g. `MVar`s) we can
exploit the fact that only the *first* mutation requires a write
barrier.
* Weak references, as usual, complicate things. In particular, we must
ensure that the referee of a weak object is marked if dereferenced by
the mutator. For this we (unfortunately) must introduce a read
barrier, as described in Note [Concurrent read barrier on deRefWeak#]
(in `NonMovingMark.c`).
* Stable names are also a bit tricky as described in Note [Sweeping
stable names in the concurrent collector] (`NonMovingSweep.c`).
We take quite some pains to ensure that the high thread count often seen
in parallel Haskell applications doesn't affect pause times. To this end
we allow thread stacks to be marked either by the thread itself (when it
is executed or stack-underflows) or the concurrent mark thread (if the
thread owning the stack is never scheduled). There is a non-trivial
handshake to ensure that this happens without racing which is described
in Note [StgStack dirtiness flags and concurrent marking].
Co-Authored-by: Ömer Sinan Ağacan <omer@well-typed.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This implements the core heap structure and a serial mark/sweep
collector which can be used to manage the oldest-generation heap.
This is the first step towards a concurrent mark-and-sweep collector
aimed at low-latency applications.
The full design of the collector implemented here is described in detail
in a technical note
B. Gamari. "A Concurrent Garbage Collector For the Glasgow Haskell
Compiler" (2018)
The basic heap structure used in this design is heavily inspired by
K. Ueno & A. Ohori. "A fully concurrent garbage collector for
functional programs on multicore processors." /ACM SIGPLAN Notices/
Vol. 51. No. 9 (presented by ICFP 2016)
This design is intended to allow both marking and sweeping
concurrent to execution of a multi-core mutator. Unlike the Ueno design,
which requires no global synchronization pauses, the collector
introduced here requires a stop-the-world pause at the beginning and end
of the mark phase.
To avoid heap fragmentation, the allocator consists of a number of
fixed-size /sub-allocators/. Each of these sub-allocators allocators into
its own set of /segments/, themselves allocated from the block
allocator. Each segment is broken into a set of fixed-size allocation
blocks (which back allocations) in addition to a bitmap (used to track
the liveness of blocks) and some additional metadata (used also used
to track liveness).
This heap structure enables collection via mark-and-sweep, which can be
performed concurrently via a snapshot-at-the-beginning scheme (although
concurrent collection is not implemented in this patch).
The mark queue is a fairly straightforward chunked-array structure.
The representation is a bit more verbose than a typical mark queue to
accomodate a combination of two features:
* a mark FIFO, which improves the locality of marking, reducing one of
the major overheads seen in mark/sweep allocators (see [1] for
details)
* the selector optimization and indirection shortcutting, which
requires that we track where we found each reference to an object
in case we need to update the reference at a later point (e.g. when
we find that it is an indirection). See Note [Origin references in
the nonmoving collector] (in `NonMovingMark.h`) for details.
Beyond this the mark/sweep is fairly run-of-the-mill.
[1] R. Garner, S.M. Blackburn, D. Frampton. "Effective Prefetch for
Mark-Sweep Garbage Collection." ISMM 2007.
Co-Authored-By: Ben Gamari <ben@well-typed.com>
|
|
|
|
|
| |
This were previously quite unclear and will change a bit under the
non-moving collector so let's clear this up now.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
- No need to distinguish between gcc-llvm and clang. First of all,
gcc-llvm is quite old and surely unmaintained by now. Second of all,
none of the code actually care about that distinction!
Now, it does make sense to consider C multiple frontends for LLVMs in
the form of clang vs clang-cl (same clang, yes, but tweaked
interface). But this is better handled in terms of "gccish vs
mvscish" and "is LLVM", yielding 4 combinations. Therefore, I don't
think it is useful saving the existing code for that.
- Get the remaining CC_LLVM_BACKEND, and also TABLES_NEXT_TO_CODE in
mk/config.h the normal way, rather than hacking it post-hoc. No point
keeping these special cases around for now reason.
- Get rid of hand-rolled `die` function and just use `AC_MSG_ERROR`.
- Abstract check + flag override for unregisterised and tables next to
code.
Oh, and as part of the above I also renamed/combined some variables
where it felt appropriate.
- GccIsClang -> CcLlvmBackend. This is for `AC_SUBST`, like the other
Camal case ones. It was never about gcc-llvm, or Apple's renamed clang,
to be clear.
- llvm_CC_FLAVOR -> CC_LLVM_BACKEND. This is for `AC_DEFINE`, like the
other all-caps snake case ones. llvm_CC_FLAVOR was just silly
indirection *and* an odd name to boot.
|
|
|
|
| |
Zeros heap memory after gc freed it.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Here the following changes are introduced:
- A read barrier machine op is added to Cmm.
- The order in which a closure's fields are read and written is changed.
- Memory barriers are added to RTS code to ensure correctness on
out-or-order machines with weak memory ordering.
Cmm has a new CallishMachOp called MO_ReadBarrier. On weak memory machines, this
is lowered to an instruction that ensures memory reads that occur after said
instruction in program order are not performed before reads coming before said
instruction in program order. On machines with strong memory ordering properties
(e.g. X86, SPARC in TSO mode) no such instruction is necessary, so
MO_ReadBarrier is simply erased. However, such an instruction is necessary on
weakly ordered machines, e.g. ARM and PowerPC.
Weam memory ordering has consequences for how closures are observed and mutated.
For example, consider a closure that needs to be updated to an indirection. In
order for the indirection to be safe for concurrent observers to enter, said
observers must read the indirection's info table before they read the
indirectee. Furthermore, the entering observer makes assumptions about the
closure based on its info table contents, e.g. an INFO_TYPE of IND imples the
closure has an indirectee pointer that is safe to follow.
When a closure is updated with an indirection, both its info table and its
indirectee must be written. With weak memory ordering, these two writes can be
arbitrarily reordered, and perhaps even interleaved with other threads' reads
and writes (in the absence of memory barrier instructions). Consider this
example of a bad reordering:
- An updater writes to a closure's info table (INFO_TYPE is now IND).
- A concurrent observer branches upon reading the closure's INFO_TYPE as IND.
- A concurrent observer reads the closure's indirectee and enters it. (!!!)
- An updater writes the closure's indirectee.
Here the update to the indirectee comes too late and the concurrent observer has
jumped off into the abyss. Speculative execution can also cause us issues,
consider:
- An observer is about to case on a value in closure's info table.
- The observer speculatively reads one or more of closure's fields.
- An updater writes to closure's info table.
- The observer takes a branch based on the new info table value, but with the
old closure fields!
- The updater writes to the closure's other fields, but its too late.
Because of these effects, reads and writes to a closure's info table must be
ordered carefully with respect to reads and writes to the closure's other
fields, and memory barriers must be placed to ensure that reads and writes occur
in program order. Specifically, updates to a closure must follow the following
pattern:
- Update the closure's (non-info table) fields.
- Write barrier.
- Update the closure's info table.
Observing a closure's fields must follow the following pattern:
- Read the closure's info pointer.
- Read barrier.
- Read the closure's (non-info table) fields.
This patch updates RTS code to obey this pattern. This should fix long-standing
SMP bugs on ARM (specifically newer aarch64 microarchitectures supporting
out-of-order execution) and PowerPC. This fixes issue #15449.
Co-Authored-By: Ben Gamari <ben@well-typed.com>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This:
- Hoists part of the condition outside of the initialization loop in
`stg_newSmallArrayzh`.
- Annotates one of the unlikely branches as unlikely, also in
`stg_newSmallArrayzh`.
- Adds a couple of annotations to `allocateMightFail` indicating which
branches are likely to be taken.
Together this gives about 5% improvement.
Signed-off-by: Michal Terepeta <michal.terepeta@gmail.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This moves all URL references to Trac Wiki to their corresponding
GitLab counterparts.
This substitution is classified as follows:
1. Automated substitution using sed with Ben's mapping rule [1]
Old: ghc.haskell.org/trac/ghc/wiki/XxxYyy...
New: gitlab.haskell.org/ghc/ghc/wikis/xxx-yyy...
2. Manual substitution for URLs containing `#` index
Old: ghc.haskell.org/trac/ghc/wiki/XxxYyy...#Zzz
New: gitlab.haskell.org/ghc/ghc/wikis/xxx-yyy...#zzz
3. Manual substitution for strings starting with `Commentary`
Old: Commentary/XxxYyy...
New: commentary/xxx-yyy...
See also !539
[1]: https://gitlab.haskell.org/bgamari/gitlab-migration/blob/master/wiki-mapping.json
|
|
|
|
|
| |
This moves all URL references to Trac tickets to their corresponding
GitLab counterparts.
|