| Commit message (Collapse) | Author | Age | Files | Lines |
| |
|
| |
|
| |
|
| |
|
| |
|
|
|
|
|
|
|
|
|
| |
As noted in #17606, Docker disallows the get_mempolicy syscall by
default. This caused numerous tests to fail under CI in the `debug_numa`
way. Avoid this by disabling the NUMA probing logic when --debug-numa is
in use, instead setting n_numa_nodes in RtsFlags.c.
Fixes #17606.
|
|
|
|
|
| |
Previously things like `+RTS --numa-debug` would enable NUMA support,
despite being an invalid flag.
|
| |
|
| |
|
|
|
|
|
|
|
|
| |
We can do a heap census with a non-profiling RTS. With a non-profiling
RTS we don't zero superfluous bytes of shrunk arrays hence a need to
handle the case specifically to avoid a crash.
Revert part of a586b33f8e8ad60b5c5ef3501c89e9b71794bbed
|
|
|
|
|
|
|
| |
This seems to have regressed builds using `--with-system-libffi`
(#17520).
This reverts commit 3ce18700f80a12c48a029b49c6201ad2410071bb.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Separate word and string hash tables on the type level, and do not store
the hashing function. Thus when a different hash function is desire it
is provided upon accessing the table. This is worst case the same as
before the change, and in the majority of cases is better. Also mark the
functions for aggressive inlining to improve performance. {F1686506}
Reviewers: bgamari, erikd, simonmar
Subscribers: rwbarton, thomie, carter
GHC Trac Issues: #13165
Differential Revision: https://phabricator.haskell.org/D4889
|
|
|
|
|
| |
The old flag, `-xn`, was quite cryptic. Here we add `--nonmoving-gc` in
addition.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The below is only necessary to fix the CI perf fluke that
happened in 9897e8c8ef0b19a9571ef97a1d9bb050c1ee9121:
-------------------------
Metric Decrease:
T5837
T6048
T9020
T12425
T12234
T13035
T12150
Naperian
-------------------------
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Previously we used allocBlockOnNode_sync in nonmovingSweepMutLists
despite the fact that we aren't in the GC and therefore the allocation
spinlock isn't in use. This meant that sweep would end up spinning until
the next minor GC, when the SM lock was moved away from the SM_MUTEX to
the spinlock. This isn't a correctness issue but it sure isn't good for
performance.
Found thanks for Ward.
Fixes #17539.
|
|
|
|
|
|
|
|
|
|
| |
Previously we would clear the bitmaps of segments which we are going to
sweep during the preparatory pause. However, this is unnecessary: the
existence of the mark epoch ensures that the sweep will correctly
identify non-reachable objects, even if we do not clear the bitmap.
We now defer clearing the bitmap to sweep, which happens concurrently
with mutation.
|
| |
|
| |
|
| |
|
|
|
|
|
|
| |
This exposes a set of interfaces from the GHC API for configuring
EventLogWriters. These can be used by consumers like
[ghc-eventlog-socket](https://github.com/bgamari/ghc-eventlog-socket).
|
|
|
|
|
|
|
|
|
|
| |
Changes (==) to use only pointer equality. This is safe because two
threads are the same iff they have the same id.
Changes `compare` to check pointer equality first and fall back on ids
only in case of inequality.
See discussion in #16761.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Previously we would push stack-carried return values to the new stack on
a stack overflow. While the precise reasoning for this barrier is
unfortunately lost to history, in hindsight I suspect it was prompted by
a missing barrier elsewhere (that has been since fixed).
Moreover, there the redundant barrier is actively harmful: the stack may
contain non-pointer values; blindly pushing these to the mark queue will
result in a crash. This is precisely what happened in the `stack003`
test. However, because of a (now fixed) deficiency in the test this
crash did not trigger on amd64.
|
|
|
|
|
|
|
|
|
| |
Previously we would reset the pointer pointing to the object to be
marked to the beginning of the block when marking a large object. This
did no harm on 64-bit but on 32-bit it broke, e.g. `arr020`, since we
align pinned ByteArray allocations such that the payload is 8
byte-aligned. This means that the object might not begin at the
beginning of the block.,
|
|
|
|
|
| |
The previous representation needlessly limited the array length to
16-bits on 32-bit platforms.
|
|
|
|
|
|
| |
We were using TAG_BITS instead of TAG_MASK. This happened to work on
64-bit platforms where TAG_BITS==3 since we only use tag values 0 and
3. However, this broken on 32-bit platforms where TAG_BITS==2.
|
|
|
|
|
|
|
|
|
|
|
| |
Previously we used INFO_PTR_TO_STRUCT instead of
THUNK_INFO_PTR_TO_STRUCT when looking at a thunk. These two happen to be
equivalent on 64-bit architectures due to alignment considerations
however they are different on 32-bit platforms. This lead to #17487.
To fix this we also employ a small optimization: there is only one thunk
of type WHITEHOLE (namely stg_WHITEHOLE_info). Consequently, we can just
use a plain pointer comparison instead of testing against info->type.
|
|
|
|
| |
This broke the Windows build.
|
|
|
|
| |
Should finally fix #17255.
|
|
|
|
|
|
|
|
|
|
|
| |
If using a pthread instead of a timer signal is more reliable, and
has no known drawbacks, then FreeBSD is also capable of supporting
this mode of operation (tested on FreeBSD 12 with GHC 8.8.1, but
no reason why it would not also work on FreeBSD 11 or GHC 8.6).
Proposed by Kevin Zhang in:
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=241849
|
| |
|
|
|
|
|
| |
The Windows build seems to be stricter about not providing threading
primitives in the non-threaded RTS.
|
|
|
|
|
| |
I have no idea why I marked this as inline originally but clearly it
shouldn't be inlined.
|
|
|
|
|
|
|
|
|
|
| |
In general this is the convention that we use in the RTS. On Windows
things actually fail if we break it. For instance, you see things like:
includes\stg\Types.h:26:9: error:
warning: #warning "Mismatch between __USE_MINGW_ANSI_STDIO
definitions. If using Rts.h make sure it is the first header
included." [-Wcpp]
|
|
|
|
|
| |
An inconsistency in the name of m32_allocator_flush caused the build to
fail with a missing prototype error.
|
|
|
|
| |
Fixing #17255.
|
|
|
|
| |
These are now handled in the cabal file's include-dirs field.
|
|
|
|
|
| |
This was previously unnoticed as this code-path is hit on very few
platforms (e.g. OpenBSD).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
For many years the linker would simply map all of its memory with
PROT_READ|PROT_WRITE|PROT_EXEC. However operating systems have been
becoming increasingly reluctant to accept this practice (e.g. #17353
and #12657) and for good reason: writable code is ripe for exploitation.
Consequently mmapForLinker now maps its memory with
PROT_READ|PROT_WRITE. After the linker has finished filling/relocating
the mapping it must then call mmapForLinkerMarkExecutable on the
sections of the mapping which contain executable code.
Moreover, to make all of this possible it was necessary to redesign the
m32 allocator. First, we gave (in an earlier commit) each ObjectCode its
own m32_allocator. This was necessary since code loading and symbol
resolution/relocation are currently interleaved, meaning that it is not
possible to enforce W^X when symbols from different objects reside in
the same page.
We then redesigned the m32 allocator to take advantage of the fact that
all of the pages allocated with the allocator die at the same time
(namely, when the owning ObjectCode is unloaded). This makes a number of
things simpler (e.g. no more page reference counting; the interface
provided by the allocator for freeing is simpler). See
Note [M32 Allocator] for details.
|
|
|
|
|
|
| |
Sets `MiscFlags.disableDelayedOsMemoryReturn`.
See the added `Note [MADV_FREE and MADV_DONTNEED]` for details.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
MacOS Catalina is finally going to force our hand in forbidden writable
exeutable mappings. Unfortunately, this is quite incompatible with the
current global m32 allocator, which mixes symbols from various objects
in a single page. The problem here is that some of these symbols may not
yet be resolved (e.g. had relocations performed) as this happens lazily
(and therefore we can't yet make the section read-only and therefore
executable).
The easiest way around this is to simply create one m32 allocator per
ObjectCode. This may slightly increase fragmentation for short-running
programs but I suspect will actually improve fragmentation for programs
doing lots of loading/unloading since we can always free all of the
pages allocated to an object when it is unloaded (although this ability
will only be implemented in a later patch).
|
| |
|
| |
|
|
|
|
|
|
| |
AP_NOUPD entry code doesn't use the arity field, but not initializing
this field confuses printers/debuggers, and also makes testing harder as
the field's value changes randomly.
|
|
|
|
|
|
|
| |
* Prefer #pragma once over guard macros
* Drop redundant #includes
* Fix order to ensure that necessary macros are defined when we
condition on them
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is a part of GHC Proposal #25: "Offer more array resizing primitives".
Resources related to the proposal:
- Discussion: https://github.com/ghc-proposals/ghc-proposals/pull/121
- Proposal: https://github.com/ghc-proposals/ghc-proposals/blob/master/proposals/0025-resize-boxed.rst
Only shrinkSmallMutableArray# is implemented as a primop since a
library-space implementation of resizeSmallMutableArray# (in GHC.Exts)
is no less efficient than a primop would be. This may be replaced by
a primop in the future if someone devises a strategy for growing
arrays in-place. The library-space implementation always copies the
array when growing it.
This commit also tweaks the documentation of the deprecated
sizeofMutableByteArray#, removing the mention of concurrency. That
primop is unsound even in single-threaded applications. Additionally,
the non-negativity assertion on the existing shrinkMutableByteArray#
primop has been removed since this predicate is trivially always true.
|
|
|
|
|
| |
GCC 4.6 was released 7 years ago. I think we can finally assume that
it's available. This is a simplification prompted by #15742.
|
|\
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
This introduces a concurrent mark & sweep garbage collector to manage the old
generation. The concurrent nature of this collector typically results in
significantly reduced maximum and mean pause times in applications with large
working sets.
Due to the large and intricate nature of the change I have opted to
preserve the fully-buildable history, including merge commits, which is
described in the "Branch overview" section below.
Collector design
================
The full design of the collector implemented here is described in detail
in a technical note
> B. Gamari. "A Concurrent Garbage Collector For the Glasgow Haskell
> Compiler" (2018)
This document can be requested from @bgamari.
The basic heap structure used in this design is heavily inspired by
> K. Ueno & A. Ohori. "A fully concurrent garbage collector for
> functional programs on multicore processors." /ACM SIGPLAN Notices/
> Vol. 51. No. 9 (presented at ICFP 2016)
This design is intended to allow both marking and sweeping
concurrent to execution of a multi-core mutator. Unlike the Ueno design,
which requires no global synchronization pauses, the collector
introduced here requires a stop-the-world pause at the beginning and end
of the mark phase.
To avoid heap fragmentation, the allocator consists of a number of
fixed-size /sub-allocators/. Each of these sub-allocators allocators into
its own set of /segments/, themselves allocated from the block
allocator. Each segment is broken into a set of fixed-size allocation
blocks (which back allocations) in addition to a bitmap (used to track
the liveness of blocks) and some additional metadata (used also used
to track liveness).
This heap structure enables collection via mark-and-sweep, which can be
performed concurrently via a snapshot-at-the-beginning scheme (although
concurrent collection is not implemented in this patch).
Implementation structure
========================
The majority of the collector is implemented in a handful of files:
* `rts/Nonmoving.c` is the heart of the beast. It implements the entry-point
to the nonmoving collector (`nonmoving_collect`), as well as the allocator
(`nonmoving_allocate`) and a number of utilities for manipulating the heap.
* `rts/NonmovingMark.c` implements the mark queue functionality, update
remembered set, and mark loop.
* `rts/NonmovingSweep.c` implements the sweep loop.
* `rts/NonmovingScav.c` implements the logic necessary to scavenge the
nonmoving heap.
Branch overview
===============
```
* wip/gc/opt-pause:
| A variety of small optimisations to further reduce pause times.
|
* wip/gc/compact-nfdata:
| Introduce support for compact regions into the non-moving
|\ collector
| \
| \
| | * wip/gc/segment-header-to-bdescr:
| | | Another optimization that we are considering, pushing
| | | some segment metadata into the segment descriptor for
| | | the sake of locality during mark
| | |
| * | wip/gc/shortcutting:
| | | Support for indirection shortcutting and the selector optimization
| | | in the non-moving heap.
| | |
* | | wip/gc/docs:
| |/ Work on implementation documentation.
| /
|/
* wip/gc/everything:
| A roll-up of everything below.
|\
| \
| |\
| | \
| | * wip/gc/optimize:
| | | A variety of optimizations, primarily to the mark loop.
| | | Some of these are microoptimizations but a few are quite
| | | significant. In particular, the prefetch patches have
| | | produced a nontrivial improvement in mark performance.
| | |
| | * wip/gc/aging:
| | | Enable support for aging in major collections.
| | |
| * | wip/gc/test:
| | | Fix up the testsuite to more or less pass.
| | |
* | | wip/gc/instrumentation:
| | | A variety of runtime instrumentation including statistics
| | / support, the nonmoving census, and eventlog support.
| |/
| /
|/
* wip/gc/nonmoving-concurrent:
| The concurrent write barriers.
|
* wip/gc/nonmoving-nonconcurrent:
| The nonmoving collector without the write barriers necessary
| for concurrent collection.
|
* wip/gc/preparation:
| A merge of the various preparatory patches that aren't directly
| implementing the GC.
|
|
* GHC HEAD
.
.
.
```
|
| | |
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Previously we would first move the new objects to their appropriate
non-moving GC list, then do another pass over that list to clear their
mark bits. This is needlessly expensive. First clear the mark bits of
the existing objects, then add the newly evacuated objects and, at the
same time, clear their mark bits.
This cuts the preparatory GC time in half for the Pusher benchmark with
a large queue size.
|