| Commit message (Collapse) | Author | Age | Files | Lines |
| |
|
|
|
|
|
|
|
| |
This requires that we break nonmovingExit into two pieces since we need
to first stop the collector to relinquish any capabilities, then we need
to shutdown the scheduler, then we need to free the nonmoving
allocators.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This extends the non-moving collector to allow concurrent collection.
The full design of the collector implemented here is described in detail
in a technical note
B. Gamari. "A Concurrent Garbage Collector For the Glasgow Haskell
Compiler" (2018)
This extension involves the introduction of a capability-local
remembered set, known as the /update remembered set/, which tracks
objects which may no longer be visible to the collector due to mutation.
To maintain this remembered set we introduce a write barrier on
mutations which is enabled while a concurrent mark is underway.
The update remembered set representation is similar to that of the
nonmoving mark queue, being a chunked array of `MarkEntry`s. Each
`Capability` maintains a single accumulator chunk, which it flushed
when it (a) is filled, or (b) when the nonmoving collector enters its
post-mark synchronization phase.
While the write barrier touches a significant amount of code it is
conceptually straightforward: the mutator must ensure that the referee
of any pointer it overwrites is added to the update remembered set.
However, there are a few details:
* In the case of objects with a dirty flag (e.g. `MVar`s) we can
exploit the fact that only the *first* mutation requires a write
barrier.
* Weak references, as usual, complicate things. In particular, we must
ensure that the referee of a weak object is marked if dereferenced by
the mutator. For this we (unfortunately) must introduce a read
barrier, as described in Note [Concurrent read barrier on deRefWeak#]
(in `NonMovingMark.c`).
* Stable names are also a bit tricky as described in Note [Sweeping
stable names in the concurrent collector] (`NonMovingSweep.c`).
We take quite some pains to ensure that the high thread count often seen
in parallel Haskell applications doesn't affect pause times. To this end
we allow thread stacks to be marked either by the thread itself (when it
is executed or stack-underflows) or the concurrent mark thread (if the
thread owning the stack is never scheduled). There is a non-trivial
handshake to ensure that this happens without racing which is described
in Note [StgStack dirtiness flags and concurrent marking].
Co-Authored-by: Ömer Sinan Ağacan <omer@well-typed.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This implements the core heap structure and a serial mark/sweep
collector which can be used to manage the oldest-generation heap.
This is the first step towards a concurrent mark-and-sweep collector
aimed at low-latency applications.
The full design of the collector implemented here is described in detail
in a technical note
B. Gamari. "A Concurrent Garbage Collector For the Glasgow Haskell
Compiler" (2018)
The basic heap structure used in this design is heavily inspired by
K. Ueno & A. Ohori. "A fully concurrent garbage collector for
functional programs on multicore processors." /ACM SIGPLAN Notices/
Vol. 51. No. 9 (presented by ICFP 2016)
This design is intended to allow both marking and sweeping
concurrent to execution of a multi-core mutator. Unlike the Ueno design,
which requires no global synchronization pauses, the collector
introduced here requires a stop-the-world pause at the beginning and end
of the mark phase.
To avoid heap fragmentation, the allocator consists of a number of
fixed-size /sub-allocators/. Each of these sub-allocators allocators into
its own set of /segments/, themselves allocated from the block
allocator. Each segment is broken into a set of fixed-size allocation
blocks (which back allocations) in addition to a bitmap (used to track
the liveness of blocks) and some additional metadata (used also used
to track liveness).
This heap structure enables collection via mark-and-sweep, which can be
performed concurrently via a snapshot-at-the-beginning scheme (although
concurrent collection is not implemented in this patch).
The mark queue is a fairly straightforward chunked-array structure.
The representation is a bit more verbose than a typical mark queue to
accomodate a combination of two features:
* a mark FIFO, which improves the locality of marking, reducing one of
the major overheads seen in mark/sweep allocators (see [1] for
details)
* the selector optimization and indirection shortcutting, which
requires that we track where we found each reference to an object
in case we need to update the reference at a later point (e.g. when
we find that it is an indirection). See Note [Origin references in
the nonmoving collector] (in `NonMovingMark.h`) for details.
Beyond this the mark/sweep is fairly run-of-the-mill.
[1] R. Garner, S.M. Blackburn, D. Frampton. "Effective Prefetch for
Mark-Sweep Garbage Collection." ISMM 2007.
Co-Authored-By: Ben Gamari <ben@well-typed.com>
|
|
|
|
|
| |
This were previously quite unclear and will change a bit under the
non-moving collector so let's clear this up now.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
- No need to distinguish between gcc-llvm and clang. First of all,
gcc-llvm is quite old and surely unmaintained by now. Second of all,
none of the code actually care about that distinction!
Now, it does make sense to consider C multiple frontends for LLVMs in
the form of clang vs clang-cl (same clang, yes, but tweaked
interface). But this is better handled in terms of "gccish vs
mvscish" and "is LLVM", yielding 4 combinations. Therefore, I don't
think it is useful saving the existing code for that.
- Get the remaining CC_LLVM_BACKEND, and also TABLES_NEXT_TO_CODE in
mk/config.h the normal way, rather than hacking it post-hoc. No point
keeping these special cases around for now reason.
- Get rid of hand-rolled `die` function and just use `AC_MSG_ERROR`.
- Abstract check + flag override for unregisterised and tables next to
code.
Oh, and as part of the above I also renamed/combined some variables
where it felt appropriate.
- GccIsClang -> CcLlvmBackend. This is for `AC_SUBST`, like the other
Camal case ones. It was never about gcc-llvm, or Apple's renamed clang,
to be clear.
- llvm_CC_FLAVOR -> CC_LLVM_BACKEND. This is for `AC_DEFINE`, like the
other all-caps snake case ones. llvm_CC_FLAVOR was just silly
indirection *and* an odd name to boot.
|
|
|
|
| |
Zeros heap memory after gc freed it.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Here the following changes are introduced:
- A read barrier machine op is added to Cmm.
- The order in which a closure's fields are read and written is changed.
- Memory barriers are added to RTS code to ensure correctness on
out-or-order machines with weak memory ordering.
Cmm has a new CallishMachOp called MO_ReadBarrier. On weak memory machines, this
is lowered to an instruction that ensures memory reads that occur after said
instruction in program order are not performed before reads coming before said
instruction in program order. On machines with strong memory ordering properties
(e.g. X86, SPARC in TSO mode) no such instruction is necessary, so
MO_ReadBarrier is simply erased. However, such an instruction is necessary on
weakly ordered machines, e.g. ARM and PowerPC.
Weam memory ordering has consequences for how closures are observed and mutated.
For example, consider a closure that needs to be updated to an indirection. In
order for the indirection to be safe for concurrent observers to enter, said
observers must read the indirection's info table before they read the
indirectee. Furthermore, the entering observer makes assumptions about the
closure based on its info table contents, e.g. an INFO_TYPE of IND imples the
closure has an indirectee pointer that is safe to follow.
When a closure is updated with an indirection, both its info table and its
indirectee must be written. With weak memory ordering, these two writes can be
arbitrarily reordered, and perhaps even interleaved with other threads' reads
and writes (in the absence of memory barrier instructions). Consider this
example of a bad reordering:
- An updater writes to a closure's info table (INFO_TYPE is now IND).
- A concurrent observer branches upon reading the closure's INFO_TYPE as IND.
- A concurrent observer reads the closure's indirectee and enters it. (!!!)
- An updater writes the closure's indirectee.
Here the update to the indirectee comes too late and the concurrent observer has
jumped off into the abyss. Speculative execution can also cause us issues,
consider:
- An observer is about to case on a value in closure's info table.
- The observer speculatively reads one or more of closure's fields.
- An updater writes to closure's info table.
- The observer takes a branch based on the new info table value, but with the
old closure fields!
- The updater writes to the closure's other fields, but its too late.
Because of these effects, reads and writes to a closure's info table must be
ordered carefully with respect to reads and writes to the closure's other
fields, and memory barriers must be placed to ensure that reads and writes occur
in program order. Specifically, updates to a closure must follow the following
pattern:
- Update the closure's (non-info table) fields.
- Write barrier.
- Update the closure's info table.
Observing a closure's fields must follow the following pattern:
- Read the closure's info pointer.
- Read barrier.
- Read the closure's (non-info table) fields.
This patch updates RTS code to obey this pattern. This should fix long-standing
SMP bugs on ARM (specifically newer aarch64 microarchitectures supporting
out-of-order execution) and PowerPC. This fixes issue #15449.
Co-Authored-By: Ben Gamari <ben@well-typed.com>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This:
- Hoists part of the condition outside of the initialization loop in
`stg_newSmallArrayzh`.
- Annotates one of the unlikely branches as unlikely, also in
`stg_newSmallArrayzh`.
- Adds a couple of annotations to `allocateMightFail` indicating which
branches are likely to be taken.
Together this gives about 5% improvement.
Signed-off-by: Michal Terepeta <michal.terepeta@gmail.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This moves all URL references to Trac Wiki to their corresponding
GitLab counterparts.
This substitution is classified as follows:
1. Automated substitution using sed with Ben's mapping rule [1]
Old: ghc.haskell.org/trac/ghc/wiki/XxxYyy...
New: gitlab.haskell.org/ghc/ghc/wikis/xxx-yyy...
2. Manual substitution for URLs containing `#` index
Old: ghc.haskell.org/trac/ghc/wiki/XxxYyy...#Zzz
New: gitlab.haskell.org/ghc/ghc/wikis/xxx-yyy...#zzz
3. Manual substitution for strings starting with `Commentary`
Old: Commentary/XxxYyy...
New: commentary/xxx-yyy...
See also !539
[1]: https://gitlab.haskell.org/bgamari/gitlab-migration/blob/master/wiki-mapping.json
|
|
|
|
|
| |
This moves all URL references to Trac tickets to their corresponding
GitLab counterparts.
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary:
We were unconditionally updating the nursery pointers to be
`nurseries[cap->no]`, but when using nursery chunks this might be
wrong. This manifested as a later assertion failure in allocate().
Test Plan: new test case
Reviewers: bgamari, niteria, erikd
Subscribers: thomie, carter
Differential Revision: https://phabricator.haskell.org/D4649
|
|
|
|
| |
[skip ci]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Rename to whitehole_gc_spin, in preparation for adding stats for the
whitehole busy-loop in SMPClosureOps.
Make whitehole_gc_spin volatile, and move it to be defined and
statically initialised in GC.c. This saves some #ifs, and I'm pretty
sure it should be volatile.
Test Plan: ./validate
Reviewers: bgamari, erikd, simonmar
Reviewed By: bgamari
Subscribers: rwbarton, thomie, carter
Differential Revision: https://phabricator.haskell.org/D4300
|
|
|
|
|
|
|
|
|
|
|
|
| |
Test Plan: Validate, add tests
Reviewers: simonmar, austin, erikd
Reviewed By: simonmar
Subscribers: rwbarton, thomie
Differential Revision: https://phabricator.haskell.org/D4021
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
clang defines '__clear_cache' slightly differently from gcc:
rts/sm/Storage.c:1349:13: error:
error: conflicting types for '__clear_cache'
|
1349 | extern void __clear_cache(char * begin, char * end);
| ^
extern void __clear_cache(char * begin, char * end);
^
note: '__clear_cache' is a builtin with type 'void (void *, void *)'
Reported by Moritz Angermann.
While at it used '__builtin___clear_cache' if advertised by clang.
Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org>
|
|
|
|
|
|
| |
This reverts commit 9492703a5862ee8623455209e50344cf8c4de077.
Incomplete patch (missing begin, end assignments).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
clang defines '__clear_cache' slightly differently from gcc:
rts/sm/Storage.c:1349:13: error:
error: conflicting types for '__clear_cache'
|
1349 | extern void __clear_cache(char * begin, char * end);
| ^
extern void __clear_cache(char * begin, char * end);
^
note: '__clear_cache' is a builtin with type 'void (void *, void *)'
Reported by Moritz Angermann.
While at it used '__builtin___clear_cache' if advertised by clang.
Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Noticed when was building UNREG ghc with -optc{-Wall,-Werror}:
rts/sm/Storage.c:1359:3: error:
error: implicit declaration of function '__clear_cache'
[-Werror=implicit-function-declaration]
__clear_cache((void*)begin, (void*)end);
^~~~~~~~~~~~~
|
1359 | __clear_cache((void*)begin, (void*)end);
| ^
Left direct '__clear_cache' usage gcc toolchain before 4.4.
Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org>
|
|
|
|
| |
This reverts commit d1d3e98443cf263ef09253e2478e3e638e174e0d.
|
| |
|
|
|
|
|
|
|
|
|
| |
This reverts commit 6dd1257fdd4d18e84d32e89bf0ec664b3c8f7b93.
Change fails vaildation:
rts/sm/Storage.c:1351:20: error:
error: ‘gcc_clear_cache’ defined but not used [-Werror=unused-function]
STATIC_INLINE void gcc_clear_cache(void * begin, void * end)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Noticed when was building UNREG ghc with -optc{-Wall,-Werror}:
rts/sm/Storage.c:1359:3: error:
error: implicit declaration of function '__clear_cache'
[-Werror=implicit-function-declaration]
__clear_cache((void*)begin, (void*)end);
^~~~~~~~~~~~~
|
1359 | __clear_cache((void*)begin, (void*)end);
| ^
Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org>
|
|
|
|
| |
Our new CPP linter enforces this.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Now that we throw an exception for heap overflow, we should only print
the heap overflow message in the main thread when the HeapOverflow
exception is caught, rather than as a side effect in the GC.
Stack overflows were already done this way, I just made heap overflow
consistent with stack overflow, and did some related cleanup.
Fixes broken T2592(profasm) which was reporting the heap overflow
message twice (you would only notice when building with profiling
libs enabled).
Test Plan: validate
Reviewers: bgamari, niteria, austin, DemiMarie, hvr, erikd
Reviewed By: bgamari
Subscribers: rwbarton, thomie
Differential Revision: https://phabricator.haskell.org/D3394
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary:
This commit makes various improvements and addresses some issues with
Compact Regions (aka Compact Normal Forms).
This was the most important thing I wanted to fix. Compaction
previously prevented GC from running until it was complete, which
would be a problem in a multicore setting. Now, we compact using a
hand-written Cmm routine that can be interrupted at any point. When a
GC is triggered during a sharing-enabled compaction, the GC has to
traverse and update the hash table, so this hash table is now stored
in the StgCompactNFData object.
Previously, compaction consisted of a deepseq using the NFData class,
followed by a traversal in C code to copy the data. This is now done
in a single pass with hand-written Cmm (see rts/Compact.cmm). We no
longer use the NFData instances, instead the Cmm routine evaluates
components directly as it compacts.
The new compaction is about 50% faster than the old one with no
sharing, and a little faster on average with sharing (the cost of the
hash table dominates when we're doing sharing).
Static objects that don't (transitively) refer to any CAFs don't need
to be copied into the compact region. In particular this means we
often avoid copying Char values and small Int values, because these
are static closures in the runtime.
Each Compact# object can support a single compactAdd# operation at any
given time, so the Data.Compact library now enforces mutual exclusion
using an MVar stored in the Compact object.
We now get exceptions rather than killing everything with a barf()
when we encounter an object that cannot be compacted (a function, or a
mutable object). We now also detect pinned objects, which can't be
compacted either.
The Data.Compact API has been refactored and cleaned up. A new
compactSize operation returns the size (in bytes) of the compact
object.
Most of the documentation is in the Haddock docs for the compact
library, which I've expanded and improved here.
Various comments in the code have been improved, especially the main
Note [Compact Normal Forms] in rts/sm/CNF.c.
I've added a few tests, and expanded a few of the tests that were
there. We now also run the tests with GHCi, and in a new test way
that enables sanity checking (+RTS -DS).
There's a benchmark in libraries/compact/tests/compact_bench.hs for
measuring compaction speed and comparing sharing vs. no sharing.
The field totalDataW in StgCompactNFData was unnecessary.
Test Plan:
* new unit tests
* validate
* tested manually that we can compact Data.Aeson data
Reviewers: gcampax, bgamari, ezyang, austin, niteria, hvr, erikd
Subscribers: thomie, simonpj
Differential Revision: https://phabricator.haskell.org/D2751
GHC Trac Issues: #12455
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary:
Visible API changes:
* The C struct `GCDetails` gives the stats about a single GC. This is
passed to the `gcDone()` callback if one is set via the
RtsConfig. (previously we just passed a collection of values, so this
is more extensible, at the expense of breaking the existing API)
* `RTSStats` gives cumulative stats since the start of the program,
and includes the `GCDetails` for the most recent GC. This struct
can be obtained via `getRTSStats()` (the old `getGCStats()` has been
removed, and `getGCStatsEnabled()` has been renamed to
`getRTSStatsEnabled()`)
Improvements:
* The per-GC stats and cumulative stats are now cleanly separated.
* Inside the RTS we have a top-level `RTSStats` struct to keep all our
stats in, previously this was just a collection of strangely-named
variables. This struct is mostly just copied in `getRTSStats()`, so
the implementation of that function is a lot shorter.
* Types are more consistent. We use a uint64_t byte count for all
memory values, and Time for all time values.
* Names are more consistent. We use a suffix `_bytes` for all byte
counts and `_ns` for all time values.
* We now collect information about the amount of memory in large
objects and compact objects in `GCDetails`. (the latter was the reason
I started doing this patch but it seems to have ballooned a bit!)
* I fixed a bug in the calculation of the elapsed MUT time, and added
an ASSERT to stop the calculations going wrong in the future.
For now I kept the Haskell API in `GHC.Stats` the same, by
impedence-matching with the new API. We could either break that API
and make it match the C API more closely, or we could add a new API
and deprecate the old one. Opinions welcome.
This stuff is very easy to get wrong, and it's hard to test. Reviews
welcome!
Test Plan:
manual testing
validate
Reviewers: bgamari, niteria, austin, ezyang, hvr, erikd, rwbarton, Phyx
Subscribers: thomie
Differential Revision: https://phabricator.haskell.org/D2756
|
|
|
|
|
|
|
|
|
|
|
|
| |
Test Plan: Validate on lots of platforms
Reviewers: erikd, simonmar, austin
Reviewed By: erikd, simonmar
Subscribers: michalt, thomie
Differential Revision: https://phabricator.haskell.org/D2699
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The previous code passed an end pointer, but the interface takes a size
instead.
Fixes #12838.
Reviewers: austin, erikd, simonmar, bgamari
Reviewed By: simonmar, bgamari
Subscribers: thomie
Differential Revision: https://phabricator.haskell.org/D2711
GHC Trac Issues: #12838
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Nursery chunks help reduce the cost of GC when capabilities are unevenly
loaded, by ensuring that we use more of the available nursery.
The rationale for enabling this at -A16m is that any negative effects
due to loss of cache locality are less likely to be an issue at -A16m
and above. It's a conservative guess. If we had a lot of benchmark
data we could probably do better.
Results for nofib/parallel at -N4 -A32m with and without -n4m:
```
------------------------------------------------------------------------
Program Size Allocs Runtime Elapsed TotalMem
------------------------------------------------------------------------
blackscholes 0.0% -9.5% -9.0% -15.0% -2.2%
coins 0.0% -4.7% -3.6% -0.6% -13.6%
mandel 0.0% -0.3% +7.7% +13.1% +0.1%
matmult 0.0% +1.5% +10.0% +7.7% +0.1%
nbody 0.0% -4.1% -2.9% 0.085 0.0%
parfib 0.0% -1.4% +1.0% +1.5% +0.2%
partree 0.0% -0.3% +0.8% +2.9% -0.8%
prsa 0.0% -0.5% -2.1% -7.6% 0.0%
queens 0.0% -3.2% -1.4% +2.2% +1.3%
ray 0.0% -5.6% -14.5% -7.6% +0.8%
sumeuler 0.0% -0.4% +2.4% +1.1% 0.0%
------------------------------------------------------------------------
Min 0.0% -9.5% -14.5% -15.0% -13.6%
Max 0.0% +1.5% +10.0% +13.1% +1.3%
Geometric Mean +0.0% -2.6% -1.3% -0.5% -1.4%
```
Not conclusive, but slightly better. This matters a lot more when you
have more cores.
Test Plan: validate, nofib/paralel
Reviewers: niteria, ezyang, nh2, trofi, austin, erikd, bgamari
Reviewed By: bgamari
Subscribers: thomie
Differential Revision: https://phabricator.haskell.org/D2581
GHC Trac Issues: #9221
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
malloc'd memory is not guaranteed to be zeroed. On Linux, however,
it is often zeroed, leading to latent bugs. In fact, with this
patch I fix two uninitialized memory bugs stemming from this.
Signed-off-by: Edward Z. Yang <ezyang@cs.stanford.edu>
Test Plan: validate
Reviewers: simonmar, austin, Phyx, bgamari, erikd
Subscribers: thomie
Differential Revision: https://phabricator.haskell.org/D2455
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This brings in initial support for compact regions, as described in the
ICFP 2015 paper "Efficient Communication and Collection with Compact
Normal Forms" (Edward Z. Yang et.al.) and implemented by Giovanni
Campagna.
Some things may change before the 8.2 release, but I (Simon M.) wanted
to get the main patch committed so that we can iterate.
What documentation there is is in the Data.Compact module in the new
compact package. We'll need to extend and polish the documentation
before the release.
Test Plan:
validate
(new test cases included)
Reviewers: ezyang, simonmar, hvr, bgamari, austin
Subscribers: vikraman, Yuras, RyanGlScott, qnikst, mboes, facundominguez, rrnewton, thomie, erikd
Differential Revision: https://phabricator.haskell.org/D1264
GHC Trac Issues: #11493
|
|
|
|
|
| |
- Move the numaMap and nNumaNodes out of RtsFlags to Capability.c
- Add a test to tests/rts
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary:
The aim here is to reduce the number of remote memory accesses on
systems with a NUMA memory architecture, typically multi-socket servers.
Linux provides a NUMA API for doing two things:
* Allocating memory local to a particular node
* Binding a thread to a particular node
When given the +RTS --numa flag, the runtime will
* Determine the number of NUMA nodes (N) by querying the OS
* Assign capabilities to nodes, so cap C is on node C%N
* Bind worker threads on a capability to the correct node
* Keep a separate free lists in the block layer for each node
* Allocate the nursery for a capability from node-local memory
* Allocate blocks in the GC from node-local memory
For example, using nofib/parallel/queens on a 24-core 2-socket machine:
```
$ ./Main 15 +RTS -N24 -s -A64m
Total time 173.960s ( 7.467s elapsed)
$ ./Main 15 +RTS -N24 -s -A64m --numa
Total time 150.836s ( 6.423s elapsed)
```
The biggest win here is expected to be allocating from node-local
memory, so that means programs using a large -A value (as here).
According to perf, on this program the number of remote memory accesses
were reduced by more than 50% by using `--numa`.
Test Plan:
* validate
* There's a new flag --debug-numa=<n> that pretends to do NUMA without
actually making the OS calls, which is useful for testing the code
on non-NUMA systems.
* TODO: I need to add some unit tests
Reviewers: erikd, austin, rwbarton, ezyang, bgamari, hvr, niteria
Subscribers: thomie
Differential Revision: https://phabricator.haskell.org/D2199
|
|
|
|
|
|
|
|
|
|
| |
Reviewers: austin, erikd, simonmar, bgamari
Reviewed By: bgamari
Subscribers: thomie
Differential Revision: https://phabricator.haskell.org/D2241
|
|
|
|
|
|
|
|
|
|
|
|
| |
The `nat` type was an alias for `unsigned int` with a comment saying
it was at least 32 bits. We keep the typedef in case client code is
using it but mark it as deprecated.
Test Plan: Validated on Linux, OS X and Windows
Reviewers: simonmar, austin, thomie, hvr, bgamari, hsyl20
Differential Revision: https://phabricator.haskell.org/D2166
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
+RTS -AL<size> controls the total size of large objects that can be
allocated before a GC is triggered. Previously this was always just the
value of -A, and the limit mainly existed to prevent runaway allocation
in pathalogical programs that allocate a lot of large objects. However,
since the limit is shared between all cores, on a large multicore the
default becomes more restrictive, and can end up triggering GC well
before it would normally have been.
Arguably a better default would be A*N, but this is probably excessive.
Adding a flag lets you choose, and I've left the default as it was.
See docs for usage.
|
|
|
|
|
|
|
|
|
| |
After a parallel GC, it is possible to have a long list of blocks in
ws->part_list, if we did a lot of work stealing but didn't fill up the
blocks we stole. These blocks persist until the next load-balanced GC,
which might be a long time, and during every GC we were traversing this
list to find its size. The fix is to maintain the size all the time, so
we don't have to compute it.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Commit 5d52d9b64c21dcf77849866584744722f8121389 removed
global 'blackhole_queue' in favour of new mechanism:
when TSO hits blackhole TSO blocks waiting for
'MessgaeBlackhole' delivery.
Patch removed unused global and updates stale comments.
Noticed by Yuras Shumovich.
Signed-off-by: Sergei Trofimovich <siarheit@google.com>
Test Plan: build test
Reviewers: simonmar, austin, Yuras, bgamari
Reviewed By: Yuras, bgamari
Subscribers: thomie
Differential Revision: https://phabricator.haskell.org/D1953
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Use of these helper functions was removed by
commit 18896fa2b06844407fd1e0d3f85cd3db97a96ff4
Author: Simon Marlow <marlowsd@gmail.com>
Date: Wed Feb 2 15:49:55 2011 +0000
Noticed by uselex.rb:
calcLiveBlocks: [R]: exported from:
./rts/dist/build/sm/Storage.o
calcLiveWords: [R]: exported from:
./rts/dist/build/sm/Storage.o
Signed-off-by: Sergei Trofimovich <siarheit@google.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary:
[Revised version of D1076 that was committed and then backed out]
In a workload with a large amount of code, zero_static_objects_list()
takes a significant amount of time, and furthermore it is in the
single-threaded part of the GC.
This patch uses a slightly fiddly scheme for marking objects on the
static object lists, using a flag in the low 2 bits that flips between
two states to indicate whether an object has been visited during this
GC or not. We also have to take into account objects that have not
been visited yet, which might appear at any time due to runtime linking.
Test Plan: validate
Reviewers: austin, ezyang, rwbarton, bgamari, thomie
Reviewed By: bgamari, thomie
Subscribers: thomie
Differential Revision: https://phabricator.haskell.org/D1106
|
|
|
|
| |
This reverts commit b949c96b4960168a3b399fe14485b24a2167b982.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary:
In a workload with a large amount of code, zero_static_objects_list()
takes a significant amount of time, and furthermore it is in the
single-threaded part of the GC.
This patch uses a slightly fiddly scheme for marking objects on the
static object lists, using a flag in the low 2 bits that flips between
two states to indicate whether an object has been visited during this
GC or not. We also have to take into account objects that have not
been visited yet, which might appear at any time due to runtime linking.
Test Plan: validate
Reviewers: austin, bgamari, ezyang, rwbarton
Subscribers: thomie
Differential Revision: https://phabricator.haskell.org/D1076
|
|
|
|
|
|
|
| |
getNewNursery() was unconditionally incrementing next_nursery, which
is normally fine but it broke an assumption in
storageAddCapabilities(). This manifested as an occasional crash in
the setnumcapabilities001 test.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In a situaion where we have some statically-linked code and we want to
load and unload a series of objects, we need the CAFs in the
statically-linked code to be retained indefinitely, while the CAFs in
the dynamically-linked code should be GC'd as normal, so that we can
detect when the code is unloadable. This was wrong before - we GC'd
CAFs in the static code, leading to a crash in the rare case where we
use a CAF, GC it, and then load a new object that uses it again.
I also did some tidy up: RtsConfig now has a field keep_cafs to
indicate whether we want CAFs to be retained in static code.
|