| Commit message (Collapse) | Author | Age | Files | Lines |
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Solves the quadratic worst case performance of freeing megablocks that
was described in issue #19897.
During GC runs, we now keep a secondary free list for megablocks that is
neither sorted, nor coalesced. That way, free becomes an O(1) operation
at the expense of not being able to reuse memory for larger allocations.
At the end of a GC run, the secondary free list is sorted and then
merged into the actual free list in a single pass.
That way, our worst case performance is O(n log(n)) rather than O(n^2).
We postulate that temporarily losing coalescense during a single GC run
won't have any adverse effects in practice because:
- We would need to release enough memory during the GC, and then after
that (but within the same GC run) allocate a megablock group of more
than one megablock. This seems unlikely, as large objects are not
copied during GC, and so we shouldn't need such large allocations
during a GC run.
- Allocations of megablock groups of more than one megablock are rare.
They only happen when a single heap object is large enough to require
that amount of space. Any allocation areas that are supposed to hold
more than one heap object cannot use megablock groups, because only
the first megablock of a megablock group has valid `bdescr`s. Thus,
heap object can only start in the first megablock of a group, not in
later ones.
|
|
|
|
| |
This fixes various violations of the newly-added RTS includes linter.
|
|
|
|
|
|
|
|
| |
Previously `markCAFs` would call `markObjectCode` even in non-major GCs.
This is problematic since `prepareUnloadCheck` is not called in such
GCs, meaning that the section index has not been updated.
Fixes #21254
|
|
|
|
|
|
|
|
| |
Previously we failed to untag the function closure when scavenging the
payload of a PAP, resulting in an invalid closure pointer being passed
to scavenge_large_bitmap and consequently #21254. Fix this.
Fixes #21254
|
| |
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Here we refactor WinIO's IO completion scheme, squashing a memory leak
and fixing #18382.
To fix #18382 we drop the special thread status introduced for IoPort
blocking, BlockedOnIoCompletion, as well as drop the non-threaded RTS's
special dead-lock detection logic (which is redundant to the GC's
deadlock detection logic), as proposed in #20947.
Previously WinIO relied on foreign import ccall "wrapper" to create an
adjustor thunk which can be attached to the OVERLAPPED structure passed
to the operating system. It would then use foreign import ccall
"dynamic" to back out the original continuation from the adjustor. This
roundtrip is significantly more expensive than the alternative, using a
StablePtr. Furthermore, the implementation let the adjustor leak,
meaning that every IO request would leak a page of memory.
Fixes T18382.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Despite the documented care having been taken, several bugs are fixed here.
When run with -qn1, when a SYNC_GC_PAR is requested we will have
n_gc_threads == n_capabilities && n_gc_idle_threads == (n_gc_threads - 1)
In this case we now:
* Don't increment par_collections
* Don't increment par_balanced_copied
* Don't emit debug traces for idle threads
* Take the fast path in scavenge_until_all_done, wakeup_gc_threads, and
shutdown_gc_threads.
Some ASSERTs have also been tightened.
Fixes #19685
|
|
|
|
| |
comparison (#20088)
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Previously `markCAFs` would only evacuate CAFs' indirectees. This would
allow reachable object code to be unloaded by the linker as `evacuate`
may never be called on the CAF itself, despite it being reachable via
the `{dyn,revertible}_caf_list`s.
To fix this we teach `markCAFs` to explicit call `markObjectCode`,
ensuring that the linker is aware of objects reachable via the CAF
lists.
Fixes #20649.
|
| |
|
|
|
|
| |
These functions really do no marking; they merely trace pointers.
|
|
|
|
|
| |
We need to ensure that the TRecChunk itself is marked, in addition to
the TRecs it contains.
|
|
|
|
|
|
|
|
|
| |
The space requirements of the non-moving gc are comparable to the
compacting gc, not the copying gc.
The copying gc requires a much larger overhead.
Fixes #20475
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
While the thread ids had been changed to 64 bit words in
e57b7cc6d8b1222e0939d19c265b51d2c3c2b4c0 the return type of the foreign
import function used to retrieve these ids - namely
'GHC.Conc.Sync.getThreadId' - was never updated accordingly.
In order to fix that this function returns now a 'CUULong'.
In addition to that the types used in the thread labeling subsystem were
adjusted as well and several format strings were modified throughout the
whole RTS to display thread ids in a consistent and correct way.
Fixes #16761
|
|
|
|
|
|
|
|
|
|
|
| |
Previously PerformPut failed to respect the non-moving collector's
snapshot invariant, hiding references to an MVar and its new value by
overwriting a stack frame without dirtying the stack. Fix this.
PerformTake exhibited a similar bug, failing to dirty (and therefore
mark) the blocked stack before mutating it.
Closes #20399.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
allocateForCompact() is called when the current allocation for the
compact region does not fit in the nursery. It previously had a special
case for objects exceeding the large object threshold. In that case, it
would allocate a new compact region block just for that object. That led
to a lot of small blocks being allocated in compact regions with a
larger default block size (`autoBlockW`).
This commit removes this special case because having a lot of small
compact region blocks contributes significantly to memory fragmentation.
The removal should be valid because
- a more generic case for allocating a new compact region block follows
at the end of allocateForCompact(), and that one takes `autoBlockW`
into account
- the reason for allocating separate blocks for large objects in the
main heap seems to be to avoid copying during GCs, but once inside
the compact region, the object will never be copied anyway.
Fixes #18757.
A regression test T18757 was added.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In order to make the packages in this repo "reinstallable", we need to
associate source code with a specific packages. Having a top level
`/includes` dir that mixes concerns (which packages' includes?) gets in
the way of this.
To start, I have moved everything to `rts/`, which is mostly correct.
There are a few things however that really don't belong in the rts (like
the generated constants haskell type, `CodeGen.Platform.h`). Those
needed to be manually adjusted.
Things of note:
- No symlinking for sake of windows, so we hard-link at configure time.
- `CodeGen.Platform.h` no longer as `.hs` extension (in addition to
being moved to `compiler/`) so as not to confuse anyone, since it is
next to Haskell files.
- Blanket `-Iincludes` is gone in both build systems, include paths now
more strictly respect per-package dependencies.
- `deriveConstants` has been taught to not require a `--target-os` flag
when generating the platform-agnostic Haskell type. Make takes
advantage of this, but Hadrian has yet to.
|
|
|
|
|
|
| |
is used outside of the rts so we do this rather than just fish it out of
the repo in ad-hoc way, in order to make packages in this repo more
self-contained.
|
|
|
|
|
|
|
|
| |
Running the test suite with asserts enabled is somewhat tricky at the
moment as running it with a GHC compiled the DEBUG way has some hundred
failures from the start. These seem to be unrelated to assertions
though. So this provides a toggle to make it easier to debug failing
assertions using the test suite.
|
|
|
|
| |
All uses of these now use ExecPage.
|
|
|
|
|
|
|
| |
Previously the libffi Adjustor implementation would use allocateExec to
create executable mappings. However, allocateExec is also used elsewhere
in GHC to allocate things other than ffi_closure, which is a use-case
which libffi does not support.
|
|
|
|
|
|
|
|
|
|
|
|
| |
We now have two darwin flavours. AArch64-Darwin, and
x86_64-darwin, the latter one which has proper custom
adjustor support, the former though relies on libffi.
Mixing both leads to odd crashes, as the closures might
not fit the size of the libffi closures. Hence this
needs to be guarded by the USE_LBFFI_FOR_ADJUSTORS guard.
Original patch by Hamish Mackenzie
|
| |
|
|
|
|
|
|
|
|
|
| |
This drops allocateExec for darwin, and replaces it with
a alloc, write, mark executable strategy instead. This prevents
us from trying to allocate an executable range and then write to
it, which X^W will prohibit on darwin.
This will *only* work if we can use mmap.
|
|
|
|
| |
This means it will be reposted everytime the eventlog is started.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Related to #19381 #19359 #14702
After a spike in memory usage we have been conservative about returning
allocated blocks to the OS in case we are still allocating a lot and would
end up just reallocating them. The result of this was that up to 4 * live_bytes
of blocks would be retained once they were allocated even if memory usage ended up
a lot lower.
For a heap of size ~1.5G, this would result in OS memory reporting 6G which is
both misleading and worrying for users.
In long-lived server applications this results in consistent high memory
usage when the live data size is much more reasonable (for example ghcide)
Therefore we have a new (2021) strategy which starts by retaining up to 4 * live_bytes
of blocks before gradually returning uneeded memory back to the OS on subsequent
major GCs which are NOT caused by a heap overflow.
Each major GC which is NOT caused by heap overflow increases the consec_idle_gcs
counter and the amount of memory which is retained is inversely proportional to this number.
By default the excess memory retained is
oldGenFactor (controlled by -F) / 2 ^ (consec_idle_gcs * returnDecayFactor)
On a major GC caused by a heap overflow, the `consec_idle_gcs` variable is reset to 0
(as we could continue to allocate more, so retaining all the memory might make sense).
Therefore setting bigger values for `-Fd` makes the rate at which memory is returned slower.
Smaller values make it get returned faster. Setting `-Fd0` disables the
memory return completely, which is the behaviour of older GHC versions.
The default is `-Fd4` which results in the following scaling:
> mapM print [(x, 1/ (2**(x / 4))) | x <- [1 :: Double ..20]]
(1.0,0.8408964152537146)
(2.0,0.7071067811865475)
(3.0,0.5946035575013605)
(4.0,0.5)
(5.0,0.4204482076268573)
(6.0,0.35355339059327373)
(7.0,0.29730177875068026)
(8.0,0.25)
(9.0,0.21022410381342865)
(10.0,0.17677669529663687)
(11.0,0.14865088937534013)
(12.0,0.125)
(13.0,0.10511205190671433)
(14.0,8.838834764831843e-2)
(15.0,7.432544468767006e-2)
(16.0,6.25e-2)
(17.0,5.255602595335716e-2)
(18.0,4.4194173824159216e-2)
(19.0,3.716272234383503e-2)
(20.0,3.125e-2)
So after 13 consecutive GCs only 0.1 of the maximum memory used will be retained.
Further to this decay factor, the amount of memory we attempt to retain is
also influenced by the GC strategy for the oldest generation. If we are using
a copying strategy then we will need at least 2 * live_bytes for copying to take
place, so we always keep that much. If using compacting or nonmoving then we need a lower number,
so we just retain at least `1.2 * live_bytes` for some protection.
In future we might want to make this behaviour more aggressive, some
relevant literature is
> Ulan Degenbaev, Jochen Eisinger, Manfred Ernst, Ross McIlroy, and Hannes Payer. 2016. Idle time garbage collection scheduling. SIGPLAN Not. 51, 6 (June 2016), 570–583. DOI:https://doi.org/10.1145/2980983.2908106
which describes the "memory reducer" in the V8 javascript engine which
on an idle collection immediately returns as much memory as possible.
|
|
|
|
|
|
|
|
|
|
|
| |
The way in which allocatePinned took blocks out of the nursery was
leading to horrible fragmentation in some workloads.
The strategy now is that a separate free block list is reserved for each
capability and blocks are taken from there. When it's empty the global
SM lock is taken and a fresh block of size PINNED_EMPTY_SIZE is allocated.
Fixes #19481
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
See #19357
The event reports the
* Current number of megablocks allocated
* The number that the RTS thinks it needs
* The number is managed to return to the OS
When current > need then the difference is returned to the OS, the
successful number of returned mblocks is reported by 'returned'.
In a fragmented heap current > need but returned < current - need.
|
|
|
|
|
|
|
|
|
| |
This function is exposed in the RtsAPI.h so that external users have a
blessed way to traverse all the different `bdescr`s which are known by
the RTS.
The main motivation is to use this function in ghc-debug but avoid
having to expose the internal structure of a Capability in the API.
|
| |
|
| |
|
|
|
|
|
| |
We compare it to n_gc_idle_threads which is unsigned as well.
So make both signed to avoid a warning.
|
|
|
|
|
|
| |
used timed wait on condition variable in waitForGcThreads
fix dodgy timespec calculation
|
|
|
|
|
|
|
|
| |
I've never observed this counter taking a non-zero value, however I do
think it's existence is justified by the comment in grab_local_todo_block.
I've not added it to RTSStats in GHC.Stats, as it doesn't seem worth the
api churn.
|
|
|
|
| |
We are no longer busyish waiting, so this is no longer meaningful
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Here we remove the schedYield loop in scavenge_until_all_done+any_work, replacing
it with a single mutex + condition variable.
Previously any_work would check todo_large_objects, todo_q,
todo_overflow of each gen for work. Comments explained that this was
checking global work in any gen. However, these must have been out of
date, because all of these locations are local to a gc thread.
We've eliminated any_work entirely, instead simply looping back into
scavenge_loop, which will quickly return if there is no work.
shutdown_gc_threads is called slightly earlier than before. This ensures
that n_gc_threads can never be observed to increase from 0 by a worker thread.
startup_gc_threads is removed. It consisted of a single variable
assignment, which is moved inline to it's single callsite.
|
|
|
|
|
| |
These are the two remaining non-atomic accesses to `wakeup` which were
missed by the original TSAN patch.
|
|
|
|
|
|
|
|
| |
Solves #19147. When n_capabilities > 1 we were not correctly accounting
for gc time for sequential collections. In this case par_n_gcthreads ==
1, however it is not guaranteed that the single gc thread is capability 0.
A similar issue for copied is addressed as well.
|
|
|
|
|
|
|
| |
The weak pointer check in `checkGenWeakPtrList` previously failed to
account for dead weak pointers. This caused `fptr01` to fail in the
`sanity` way.
Fixes #19162.
|
|
|
|
|
|
| |
But only when profiling or DEBUG are enabled.
Fixes #17572.
|
|
|
|
| |
This is necessary since the user may enable `+RTS -hT` at any time.
|
| |
|
| |
|
|
|
|
|
|
| |
Previously we would push large objects and compact regions to the mark
queue during the deadlock detect GC, resulting in failure to detect
deadlocks.
|
|
|
|
| |
Pull the cold non-moving allocation path out of alloc_for_copy.
|
|
|
|
|
|
| |
Previously the deadlock-detection promotion logic in alloc_for_copy was
just plain wrong: it failed to fire when gct->evac_gen_no !=
oldest_gen->gen_no. The fix is simple: move the
|