| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
| |
used timed wait on condition variable in waitForGcThreads
fix dodgy timespec calculation
|
|
|
|
|
|
|
|
| |
I've never observed this counter taking a non-zero value, however I do
think it's existence is justified by the comment in grab_local_todo_block.
I've not added it to RTSStats in GHC.Stats, as it doesn't seem worth the
api churn.
|
|
|
|
| |
We are no longer busyish waiting, so this is no longer meaningful
|
|
|
|
|
| |
These are the two remaining non-atomic accesses to `wakeup` which were
missed by the original TSAN patch.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This implements the core heap structure and a serial mark/sweep
collector which can be used to manage the oldest-generation heap.
This is the first step towards a concurrent mark-and-sweep collector
aimed at low-latency applications.
The full design of the collector implemented here is described in detail
in a technical note
B. Gamari. "A Concurrent Garbage Collector For the Glasgow Haskell
Compiler" (2018)
The basic heap structure used in this design is heavily inspired by
K. Ueno & A. Ohori. "A fully concurrent garbage collector for
functional programs on multicore processors." /ACM SIGPLAN Notices/
Vol. 51. No. 9 (presented by ICFP 2016)
This design is intended to allow both marking and sweeping
concurrent to execution of a multi-core mutator. Unlike the Ueno design,
which requires no global synchronization pauses, the collector
introduced here requires a stop-the-world pause at the beginning and end
of the mark phase.
To avoid heap fragmentation, the allocator consists of a number of
fixed-size /sub-allocators/. Each of these sub-allocators allocators into
its own set of /segments/, themselves allocated from the block
allocator. Each segment is broken into a set of fixed-size allocation
blocks (which back allocations) in addition to a bitmap (used to track
the liveness of blocks) and some additional metadata (used also used
to track liveness).
This heap structure enables collection via mark-and-sweep, which can be
performed concurrently via a snapshot-at-the-beginning scheme (although
concurrent collection is not implemented in this patch).
The mark queue is a fairly straightforward chunked-array structure.
The representation is a bit more verbose than a typical mark queue to
accomodate a combination of two features:
* a mark FIFO, which improves the locality of marking, reducing one of
the major overheads seen in mark/sweep allocators (see [1] for
details)
* the selector optimization and indirection shortcutting, which
requires that we track where we found each reference to an object
in case we need to update the reference at a later point (e.g. when
we find that it is an indirection). See Note [Origin references in
the nonmoving collector] (in `NonMovingMark.h`) for details.
Beyond this the mark/sweep is fairly run-of-the-mill.
[1] R. Garner, S.M. Blackburn, D. Frampton. "Effective Prefetch for
Mark-Sweep Garbage Collection." ISMM 2007.
Co-Authored-By: Ben Gamari <ben@well-typed.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
- No need to distinguish between gcc-llvm and clang. First of all,
gcc-llvm is quite old and surely unmaintained by now. Second of all,
none of the code actually care about that distinction!
Now, it does make sense to consider C multiple frontends for LLVMs in
the form of clang vs clang-cl (same clang, yes, but tweaked
interface). But this is better handled in terms of "gccish vs
mvscish" and "is LLVM", yielding 4 combinations. Therefore, I don't
think it is useful saving the existing code for that.
- Get the remaining CC_LLVM_BACKEND, and also TABLES_NEXT_TO_CODE in
mk/config.h the normal way, rather than hacking it post-hoc. No point
keeping these special cases around for now reason.
- Get rid of hand-rolled `die` function and just use `AC_MSG_ERROR`.
- Abstract check + flag override for unregisterised and tables next to
code.
Oh, and as part of the above I also renamed/combined some variables
where it felt appropriate.
- GccIsClang -> CcLlvmBackend. This is for `AC_SUBST`, like the other
Camal case ones. It was never about gcc-llvm, or Apple's renamed clang,
to be clear.
- llvm_CC_FLAVOR -> CC_LLVM_BACKEND. This is for `AC_DEFINE`, like the
other all-caps snake case ones. llvm_CC_FLAVOR was just silly
indirection *and* an odd name to boot.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This moves all URL references to Trac Wiki to their corresponding
GitLab counterparts.
This substitution is classified as follows:
1. Automated substitution using sed with Ben's mapping rule [1]
Old: ghc.haskell.org/trac/ghc/wiki/XxxYyy...
New: gitlab.haskell.org/ghc/ghc/wikis/xxx-yyy...
2. Manual substitution for URLs containing `#` index
Old: ghc.haskell.org/trac/ghc/wiki/XxxYyy...#Zzz
New: gitlab.haskell.org/ghc/ghc/wikis/xxx-yyy...#zzz
3. Manual substitution for strings starting with `Commentary`
Old: Commentary/XxxYyy...
New: commentary/xxx-yyy...
See also !539
[1]: https://gitlab.haskell.org/bgamari/gitlab-migration/blob/master/wiki-mapping.json
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary:
GC sync is the time between a GC being intiated and all the mutator
threads finally stopping so that the GC can start. Problems that cause
the GC sync to be delayed are hard to find and can cause dramatic
slowdowns for heavily parallel programs.
The new flag --long-gc-sync=<time> helps by emitting a warning and
calling a user-overridable hook when the GC sync time exceeds the
specified threshold. A debugger can be used to set a breakpoint when
this happens and inspect the stacks of threads to find the culprit.
Test Plan:
```
$ ./inplace/bin/ghc-stage2 +RTS --long-gc-sync=0.0000001 -S
Alloc Copied Live GC GC TOT TOT Page Flts
bytes bytes bytes user elap user elap
1135856 51144 153736 0.000 0.000 0.002 0.002 0 0 (Gen: 0)
1034760 94704 188752 0.000 0.000 0.002 0.002 0 0 (Gen: 0)
1038888 134832 228888 0.009 0.009 0.011 0.011 0 0 (Gen: 1)
1025288 90128 235184 0.000 0.000 0.012 0.012 0 0 (Gen: 0)
1049088 130080 333984 0.000 0.000 0.013 0.013 0 0 (Gen: 0)
Warning: waited 0us for GC sync
1034424 73360 331976 0.000 0.000 0.013 0.013 0 0 (Gen: 0)
```
Also tested on a real production problem.
Reviewers: niteria, bgamari, erikd
Subscribers: rwbarton, thomie
Differential Revision: https://phabricator.haskell.org/D4193
|
|
|
|
| |
Our new CPP linter enforces this.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This both says what we mean and silences a bunch of spurious CPP linting
warnings. This pragma is supported by all CPP implementations which we
support.
Reviewers: austin, erikd, simonmar, hvr
Reviewed By: simonmar
Subscribers: rwbarton, thomie
Differential Revision: https://phabricator.haskell.org/D3482
|
|
|
|
|
|
|
|
|
|
|
|
| |
Test Plan: Validate on lots of platforms
Reviewers: erikd, simonmar, austin
Reviewed By: erikd, simonmar
Subscribers: michalt, thomie
Differential Revision: https://phabricator.haskell.org/D2699
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary:
The problem boils down to global variables: in particular gc_threads[],
which was being modified by a subsequent GC before the previous GC had
finished with it. The fix is to not use global variables.
This was causing setnumcapabilities001 to fail (again!). It's an old
bug though.
Test Plan:
Ran setnumcapabilities001 in a loop for a couple of hours. Before this
patch it had been failing after a few minutes. Not a very scientific
test, but it's the best I have.
Reviewers: bgamari, austin, fryguybob, niteria, erikd
Subscribers: thomie
Differential Revision: https://phabricator.haskell.org/D2654
|
|
|
|
|
|
|
|
|
|
|
|
| |
The `nat` type was an alias for `unsigned int` with a comment saying
it was at least 32 bits. We keep the typedef in case client code is
using it but mark it as deprecated.
Test Plan: Validated on Linux, OS X and Windows
Reviewers: simonmar, austin, thomie, hvr, bgamari, hsyl20
Differential Revision: https://phabricator.haskell.org/D2166
|
|
|
|
|
|
|
|
|
| |
After a parallel GC, it is possible to have a long list of blocks in
ws->part_list, if we did a lot of work stealing but didn't fill up the
blocks we stole. These blocks persist until the next load-balanced GC,
which might be a long time, and during every GC we were traversing this
list to find its size. The fix is to maintain the size all the time, so
we don't have to compute it.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This hasn't been used for a very long time and will soon be superceded
by perf_events support.
Test Plan: validate
Reviewers: austin, simonmar
Reviewed By: austin, simonmar
Subscribers: thomie, erikd
Differential Revision: https://phabricator.haskell.org/D1493
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary:
[Revised version of D1076 that was committed and then backed out]
In a workload with a large amount of code, zero_static_objects_list()
takes a significant amount of time, and furthermore it is in the
single-threaded part of the GC.
This patch uses a slightly fiddly scheme for marking objects on the
static object lists, using a flag in the low 2 bits that flips between
two states to indicate whether an object has been visited during this
GC or not. We also have to take into account objects that have not
been visited yet, which might appear at any time due to runtime linking.
Test Plan: validate
Reviewers: austin, ezyang, rwbarton, bgamari, thomie
Reviewed By: bgamari, thomie
Subscribers: thomie
Differential Revision: https://phabricator.haskell.org/D1106
|
|
|
|
| |
This reverts commit b949c96b4960168a3b399fe14485b24a2167b982.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary:
In a workload with a large amount of code, zero_static_objects_list()
takes a significant amount of time, and furthermore it is in the
single-threaded part of the GC.
This patch uses a slightly fiddly scheme for marking objects on the
static object lists, using a flag in the low 2 bits that flips between
two states to indicate whether an object has been visited during this
GC or not. We also have to take into account objects that have not
been visited yet, which might appear at any time due to runtime linking.
Test Plan: validate
Reviewers: austin, bgamari, ezyang, rwbarton
Subscribers: thomie
Differential Revision: https://phabricator.haskell.org/D1076
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary:
Hooks rely on static linking semantics, and are broken by -Bsymbolic
which we need when using dynamic linking.
Test Plan: Built it
Reviewers: austin, hvr, tibbe
Differential Revision: https://phabricator.haskell.org/D8
|
|
|
|
| |
This reverts commit 39b5c1cbd8950755de400933cecca7b8deb4ffcd.
|
|
|
|
|
|
|
|
| |
This will hopefully help ensure some basic consistency in the forward by
overriding buffer variables. In particular, it sets the wrap length, the
offset to 4, and turns off tabs.
Signed-off-by: Austin Seipp <austin@well-typed.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary: Avoid unnecessary clock_gettime() syscalls in GC stats.
Test Plan: Use strace.
Reviewers: simonmar, austin
Reviewed By: simonmar, austin
Subscribers: simonmar, relrod, carter
Differential Revision: https://phabricator.haskell.org/D39
|
|
|
|
| |
c.f. commit 0b0fec536e35769b64b8bc5397c84138fa512155
|
| |
|
|
|
|
| |
volatile StgWord8 is not guaranteed to be atomic.
|
|
|
|
|
|
|
|
|
|
| |
rtsBool is defined to only have two inhabitants, which are true (1) and
false (0)
But the wakeup flag is set to 4 possible values, outside the range of
rtsBool. This leads Clang to warn about tautological comparisons.
Signed-off-by: Austin Seipp <aseipp@pobox.com>
|
|
|
|
|
|
|
|
|
|
|
| |
We were doing it in two different ways and asserting that the results
were the same. In most cases they were, but I found one case where
they weren't: the GC itself allocates some memory for running
finalizers, and this memory was accounted for one way but not the
other.
It was simpler to remove the old way of counting allocation that to
try to fix it up, so I did that.
|
|
|
|
|
|
|
| |
Reordering of includes in GC.c broke on OS X because gctKey is
declared in Task.h and is needed in the storage manager. This is
really the wrong place for it anyway, so I've moved the gctKey pieces
to where they should be.
|
|
|
|
|
|
|
|
|
|
|
|
| |
lnat was originally "long unsigned int" but we were using it when we
wanted a 64-bit type on a 64-bit machine. This broke on Windows x64,
where long == int == 32 bits. Using types of unspecified size is bad,
but what we really wanted was a type with N bits on an N-bit machine.
StgWord is exactly that.
lnat was mentioned in some APIs that clients might be using
(e.g. StackOverflowHook()), so we leave it defined but with a comment
to say that it's deprecated.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The clearNurseries() operation resets the free pointer in each nursery
block to the start of the block, emptying the nursery. In the
parallel GC this was done on the main GC thread, but that's bad
because it accesses the bdescr of every nursery block, and move all
those cache lines onto the CPU of the main GC thread. With large
nurseries, this can be especially bad. So instead we want to clear
each nursery in its local GC thread.
Thanks to Andreas Voellmy <andreas.voellmy@gmail.com> for idenitfying
the issue.
After this change and the previous patch to make the last GC a major
one, I see these results for nofib/parallel on 8 cores:
blackscholes +0.0% +0.0% -3.7% -3.3% +0.3%
coins +0.0% +0.0% -5.1% -5.0% +0.4%
gray +0.0% +0.0% -4.5% -2.1% +0.8%
mandel +0.0% -0.0% -7.6% -5.1% -2.3%
matmult +0.0% +5.5% -2.8% -1.9% -5.8%
minimax +0.0% +0.0% -10.6% -10.5% +0.0%
nbody +0.0% -4.4% +0.0% 0.07 +0.0%
parfib +0.0% +1.0% +0.5% +0.9% +0.0%
partree +0.0% +0.0% -2.4% -2.5% +1.7%
prsa +0.0% -0.2% +1.8% +4.2% +0.0%
queens +0.0% -0.0% -1.8% -1.4% -4.8%
ray +0.0% -0.6% -18.5% -17.8% +0.0%
sumeuler +0.0% -0.0% -3.7% -3.7% +0.0%
transclos +0.0% -0.0% -25.7% -26.6% +0.0%
--------------------------------------------------------------------------------
Min +0.0% -4.4% -25.7% -26.6% -5.8%
Max +0.0% +5.5% +1.8% +4.2% +1.7%
Geometric Mean +0.0% +0.1% -6.3% -6.1% -0.7%
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is an experimental tweak to the parallel GC that avoids waking up
a Capability to do parallel GC if we know that the capability has been
idle for a (tunable) number of GC cycles. The idea is that if you're
only using a few Capabilities, there's no point waking up the ones
that aren't busy.
e.g. +RTS -qi3
says "A Capability will participate in parallel GC if it was running
at all since the last 3 GC cycles."
Results are a bit hit and miss, and I don't completely understand why
yet. Hence, for now it is turned off by default, and also not
documented except in the +RTS -? output.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Terminology cleanup: the type "Ticks" has been renamed "Time", which
is an StgWord64 in units of TIME_RESOLUTION (currently nanoseconds).
The terminology "tick" is now used consistently to mean the interval
between timer signals.
The ticker now always ticks in realtime (actually CLOCK_MONOTONIC if
we have it). Before it used CPU time in the non-threaded RTS and
realtime in the threaded RTS, but I've discovered that the CPU timer
has terrible resolution (at least on Linux) and isn't much use for
profiling. So now we always use realtime. This should also fix
The default tick interval is now 10ms, except when profiling where we
drop it to 1ms. This gives more accurate profiles without affecting
runtime too much (<1%).
Lots of cleanups - the resolution of Time is now in one place
only (Rts.h) rather than having calculations that depend on the
resolution scattered all over the RTS. I hope I found them all.
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is a port of some of the changes from my private local-GC branch
(which is still in darcs, I haven't converted it to git yet). There
are a couple of small functional differences in the GC stats: first,
per-thread GC timings should now be more accurate, and secondly we now
report average and maximum pause times. e.g. from minimax +RTS -N8 -s:
Tot time (elapsed) Avg pause Max pause
Gen 0 2755 colls, 2754 par 13.16s 0.93s 0.0003s 0.0150s
Gen 1 769 colls, 769 par 3.71s 0.26s 0.0003s 0.0059s
|
|
|
|
|
|
| |
Store the *number* of the destination generation in the Bdescr struct,
so that in evacuate() we don't have to deref gen to get it.
This is another improvement ported over from my GC branch.
|
|
|
|
| |
Which was being used seemed to be random
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The GC had a two-level structure, G generations each of T steps.
Steps are for aging within a generation, mostly to avoid premature
promotion.
Measurements show that more than 2 steps is almost never worthwhile,
and 1 step is usually worse than 2. In theory fractional steps are
possible, so the ideal number of steps is somewhere between 1 and 3.
GHC's default has always been 2.
We can implement 2 steps quite straightforwardly by having each block
point to the generation to which objects in that block should be
promoted, so blocks in the nursery point to generation 0, and blocks
in gen 0 point to gen 1, and so on.
This commit removes the explicit step structures, merging generations
with steps, thus simplifying a lot of code. Performance is
unaffected. The tunable number of steps is now gone, although it may
be replaced in the future by a way to tune the aging in generation 0.
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
| |
This has no effect with static libraries, but when the RTS is in a
shared library it does two things:
- it prevents the function from being exposed by the shared library
- internal calls to the function can use the faster non-PLT calls,
because the function cannot be overriden at link time.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The first phase of this tidyup is focussed on the header files, and in
particular making sure we are exposinng publicly exactly what we need
to, and no more.
- Rts.h now includes everything that the RTS exposes publicly,
rather than a random subset of it.
- Most of the public header files have moved into subdirectories, and
many of them have been renamed. But clients should not need to
include any of the other headers directly, just #include the main
public headers: Rts.h, HsFFI.h, RtsAPI.h.
- All the headers needed for via-C compilation have moved into the
stg subdirectory, which is self-contained. Most of the headers for
the rest of the RTS APIs have moved into the rts subdirectory.
- I left MachDeps.h where it is, because it is so widely used in
Haskell code.
- I left a deprecated stub for RtsFlags.h in place. The flag
structures are now exposed by Rts.h.
- Various internal APIs are no longer exposed by public header files.
- Various bits of dead code and declarations have been removed
- More gcc warnings are turned on, and the RTS code is more
warning-clean.
- More source files #include "PosixSource.h", and hence only use
standard POSIX (1003.1c-1995) interfaces.
There is a lot more tidying up still to do, this is just the first
pass. I also intend to standardise the names for external RTS APIs
(e.g use the rts_ prefix consistently), and declare the internal APIs
as hidden for shared libraries.
|
|
|
|
|
| |
Can't use windowed regs because the window moves during a function
call. Can't use the global regs because they're reserved for other purposes.
|
|
|
|
|
|
|
| |
With the
On x86, use thread-local storage instead of stealing a reg for gct
patch, on Windows and OS X:
error: thread-local storage not supported for this target
|
|
|
|
|
|
|
|
| |
Benchmarks show that using TLS instead of stealing a register is
better by a few percent on x86, due to the lack of registers.
This only affects -threaded; without -threaded we're (now) using
static storage for the GC data.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
New flag: "+RTS -qb" disables load-balancing in the parallel GC
(though this is subject to change, I think we will probably want to do
something more automatic before releasing this).
To get the "PARGC3" configuration described in the "Runtime support
for Multicore Haskell" paper, use "+RTS -qg0 -qb -RTS".
The main advantage of this is that it allows us to easily disable
load-balancing altogether, which turns out to be important in parallel
programs. Maintaining locality is sometimes more important that
spreading the work out in parallel GC. There is a side benefit in
that the parallel GC should have improved locality even when
load-balancing, because each processor prefers to take work from its
own queue before stealing from others.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This turns out to be quite vital for parallel programs:
- The way we discover which threads to traverse is by finding
dirty threads via the remembered sets (aka mutable lists).
- A dirty thread will be on the remembered set of the capability
that was running it, and we really want to traverse that thread's
stack using the GC thread for the capability, because it is in
that CPU's cache. If we get this wrong, we get penalised badly by
the memory system.
Previously we had per-capability mutable lists but they were
aggregated before GC and traversed by just one of the GC threads.
This resulted in very poor performance particularly for parallel
programs with deep stacks.
Now we keep per-capability remembered sets throughout GC, which also
removes a lock (recordMutableGen_sync).
|
|
|
|
| |
This makes the build work again.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Previously, the GC had its own pool of threads to use as workers when
doing parallel GC. There was a "leader", which was the mutator thread
that initiated the GC, and the other threads were taken from the pool.
This was simple and worked fine for sequential programs, where we did
most of the benchmarking for the parallel GC, but falls down for
parallel programs. When we have N mutator threads and N cores, at GC
time we would have to stop N-1 mutator threads and start up N-1 GC
threads, and hope that the OS schedules them all onto separate cores.
It practice it doesn't, as you might expect.
Now we use the mutator threads to do GC. This works quite nicely,
particularly for parallel programs, where each mutator thread scans
its own spark pool, which is probably in its cache anyway.
There are some flag changes:
-g<n> is removed (-g1 is still accepted for backwards compat).
There's no way to have a different number of GC threads than mutator
threads now.
-q1 Use one OS thread for GC (turns off parallel GC)
-qg<n> Use parallel GC for generations >= <n> (default: 1)
Using parallel GC only for generations >=1 works well for sequential
programs. Compiling an ordinary sequential program with -threaded and
running it with -N2 or more should help if you do a lot of GC. I've
found that adding -qg0 (do parallel GC for generation 0 too) speeds up
some parallel programs, but slows down some sequential programs.
Being conservative, I left the threshold at 1.
ToDo: document the new options.
|