| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary:
The aim here is to reduce the number of remote memory accesses on
systems with a NUMA memory architecture, typically multi-socket servers.
Linux provides a NUMA API for doing two things:
* Allocating memory local to a particular node
* Binding a thread to a particular node
When given the +RTS --numa flag, the runtime will
* Determine the number of NUMA nodes (N) by querying the OS
* Assign capabilities to nodes, so cap C is on node C%N
* Bind worker threads on a capability to the correct node
* Keep a separate free lists in the block layer for each node
* Allocate the nursery for a capability from node-local memory
* Allocate blocks in the GC from node-local memory
For example, using nofib/parallel/queens on a 24-core 2-socket machine:
```
$ ./Main 15 +RTS -N24 -s -A64m
Total time 173.960s ( 7.467s elapsed)
$ ./Main 15 +RTS -N24 -s -A64m --numa
Total time 150.836s ( 6.423s elapsed)
```
The biggest win here is expected to be allocating from node-local
memory, so that means programs using a large -A value (as here).
According to perf, on this program the number of remote memory accesses
were reduced by more than 50% by using `--numa`.
Test Plan:
* validate
* There's a new flag --debug-numa=<n> that pretends to do NUMA without
actually making the OS calls, which is useful for testing the code
on non-NUMA systems.
* TODO: I need to add some unit tests
Reviewers: erikd, austin, rwbarton, ezyang, bgamari, hvr, niteria
Subscribers: thomie
Differential Revision: https://phabricator.haskell.org/D2199
|
|
|
|
|
|
|
|
|
|
|
|
| |
The `nat` type was an alias for `unsigned int` with a comment saying
it was at least 32 bits. We keep the typedef in case client code is
using it but mark it as deprecated.
Test Plan: Validated on Linux, OS X and Windows
Reviewers: simonmar, austin, thomie, hvr, bgamari, hsyl20
Differential Revision: https://phabricator.haskell.org/D2166
|
|
|
|
| |
This causes errors with some versions of gcc (4.4.7 here).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This allows the GC to use fewer threads than the number of capabilities.
At each GC, we choose some of the capabilities to be "idle", which means
that the thread running on that capability (if any) will sleep for the
duration of the GC, and the other threads will do its work. We choose
capabilities that are already idle (if any) to be the idle capabilities.
The idea is that this helps in the following situation:
* We want to use a large -N value so as to make use of hyperthreaded
cores
* We use a large heap size, so GC is infrequent
* But we don't want to use all -N threads in the GC, because that
thrashes the memory too much.
See docs for usage.
|
|
|
|
|
|
|
|
| |
This allows an OS thread to specify which capability it should run on
when it makes a call into Haskell. It is intended for a fairly
specialised use case, when the client wants to have tighter control over
the mapping between OS threads and Capabilities - perhaps 1:1
correspondence, for example.
|
|
|
|
|
|
|
|
|
|
| |
Noticed by uselex.rb:
last_free_capability: [R]: exported from:
./rts/dist/build/Capability.o
shutdownCapability: [R]: exported from:
./rts/dist/build/Capability.o
Signed-off-by: Sergei Trofimovich <siarheit@google.com>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
yieldCapability() was not prepared to be called by a Task that is not
either a worker or a bound Task. This could happen if we ended up in
yieldCapability via this call stack:
performGC()
scheduleDoGC()
requestSync()
yieldCapability()
and there were a few other ways this could happen via requestSync.
The fix is to handle this case in yieldCapability(): when the Task is
not a worker or a bound Task, we put it on the returning_workers
queue, where it will be woken up again.
Summary of changes:
* `yieldCapability`: factored out subroutine waitForWorkerCapability`
* `waitForReturnCapability` renamed to `waitForCapability`, and
factored out subroutine `waitForReturnCapability`
* `releaseCapabilityAndQueue` worker renamed to `enqueueWorker`, does
not take a lock and no longer tests if `!isBoundTask()`
* `yieldCapability` adjusted for refactorings, only change in behavior
is when it is not a worker or bound task.
Test Plan:
* new test concurrent/should_run/performGC
* validate
Reviewers: niteria, austin, ezyang, bgamari
Subscribers: thomie, bgamari
Differential Revision: https://phabricator.haskell.org/D997
GHC Trac Issues: #10545
|
|
|
|
|
|
|
|
| |
This reverts commit f0fcc41d755876a1b02d1c7c79f57515059f6417.
New changes: now works on 32-bit platforms too. I added some basic
support for 64-bit subtraction and comparison operations to the x86
NCG.
|
|
|
|
| |
Signed-off-by: Austin Seipp <austin@well-typed.com>
|
|
|
|
| |
This reverts commit 39b5c1cbd8950755de400933cecca7b8deb4ffcd.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary:
This reverts commit 4748f5936fe72d96edfa17b153dbfd84f2c4c053. The fix for #9423
was reverted because this commit introduced a C function setIOManagerControlFd()
(defined in Schedule.c) defined for all OS types, while the prototype
(in includes/rts/IOManager.h) was only included when mingw32_HOST_OS is
not defined. This broke Windows builds.
This commit reverts the original commit and resolves the problem by only defining
setIOManagerControlFd() when mingw32_HOST_OS is defined. Hence the missing prototype
error should not occur on Windows.
In addition, since the io_manager_control_wr_fd field of the Capability struct is only
usd by the setIOManagerControlFd, this commit includes the io_manager_control_wr_fd
field in the Capability struct only when mingw32_HOST_OS is not defined.
Test Plan: Try to compile successfully on all platforms.
Reviewers: austin
Reviewed By: austin
Subscribers: simonmar, ezyang, carter
Differential Revision: https://phabricator.haskell.org/D174
|
|
|
|
| |
This reverts commit 7bf49f86a20f3beda0ee5fbea2db64cfef730d74.
|
|
|
|
| |
This reverts commit 15df6d98afb8c3813013c5b97efffe0ba8020d32.
|
|
|
|
|
|
|
|
|
| |
This should fix the Windows fallout, and hopefully this will be fixed
once that's sorted out.
This reverts commit f9f89b7884ccc8ee5047cf4fffdf2b36df6832df.
Signed-off-by: Austin Seipp <austin@well-typed.com>
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary:
Fix #9423.
The problem in #9423 is caused when code invoked by `hs_exit()` waits
on all foreign calls to return, but some IO managers are in `safe` foreign
calls and do not return. The previous design signaled to the timer manager
(via its control pipe) that it should "die" and when the timer manager
returned to Haskell-land, the Haskell code in timer manager then signalled
to the IO manager threads that they should return from foreign calls and
`die`. Unfortunately, in the shutdown sequence the timer manager is unable
to return to Haskell-land fast enough and so the code that signals to the
IO manager threads (via their control pipes) is never executed and the IO
manager threads remain out in the foreign calls.
This patch solves this problem by having the RTS signal to all the IO
manager threads (via their control pipes; and in addition to signalling
to the timer manager thread) that they should shutdown (in `ioManagerDie()`
in `rts/Signals.c`. To do this, we arrange for each IO manager thread to
register its control pipe with the RTS (in `GHC.Thread.startIOManagerThread`).
In addition, `GHC.Thread.startTimerManagerThread` registers its control pipe.
These are registered via C functions `setTimerManagerControlFd` (in
`rts/Signals.c`) and `setIOManagerControlFd` (in `rts/Capability.c`). The IO
manager control pipe file descriptors are stored in a new field of the
`Capability_ struct`.
Test Plan: See the notes on #9423 to recreate the problem and to verify that it no longer occurs with the fix.
Auditors: simonmar
Reviewers: simonmar, edsko, ezyang, austin
Reviewed By: austin
Subscribers: phaskell, simonmar, ezyang, carter, relrod
Differential Revision: https://phabricator.haskell.org/D129
GHC Trac Issues: #9423, #9284
|
|
|
|
|
|
|
|
| |
This will hopefully help ensure some basic consistency in the forward by
overriding buffer variables. In particular, it sets the wrap length, the
offset to 4, and turns off tabs.
Signed-off-by: Austin Seipp <austin@well-typed.com>
|
|
|
|
| |
Signed-off-by: Edward Z. Yang <ezyang@cs.stanford.edu>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
UNREG mode has quite nasty invariant to maintain:
capabilities[0] == &MainCapability
and it's a non-heap memory, while other
capabilities are dynamically allocated.
Issue #8748
Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org>
Signed-off-by: Austin Seipp <austin@well-typed.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We have various problems with reallocating the array of Capabilities,
due to threads in waitForReturnCapability that are already holding a
pointer to a Capability.
Rather than add more locking to make this safer, I decided it would be
easier to ensure that we never move the Capabilities at all. The
capabilities array is now an array of pointers to Capabaility. There
are extra indirections, but it rarely matters - we don't often access
Capabilities via the array, normally we already have a pointer to
one. I ran the parallel benchmarks and didn't see any difference.
|
|
|
|
|
|
|
|
|
| |
This adds some new functions: peekRunQueue, promoteInRunQueue,
singletonRunQueue and truncateRunQueue which help abstract away
manual linked list manipulation, making it easier to swap in
a new queue implementation.
Signed-off-by: Edward Z. Yang <ezyang@mit.edu>
|
| |
|
|
|
|
| |
A companion ghc-events pachakge commit displays task ids in the same format.
|
| |
|
|
|
|
|
|
| |
If we are interrupted to do a GC, then we do not immediately do another
one. This avoids a starvation situation where one Capability keeps
forcing a GC and the other Capabilities make no progress at all.
|
|
|
|
|
|
| |
Mostly this meant getting pointer<->int conversions to use the right
sizes. lnat is now size_t, rather than unsigned long, as that seems a
better match for how it's used.
|
|
|
|
|
|
| |
Will let us do final per-cap trace events from stat_exit().
Otherwise we would end up with eventlogs with events for caps
that have already been deleted.
|
|
|
|
|
|
|
| |
In addition to the existing global method. For now we just do
it both ways and assert they give the same grand total. At some
stage we can simplify the global method to just take the sum of
the per-cap counters.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Now that we can adjust the number of capabilities on the fly, we need
this reflected in the eventlog. Previously the eventlog had a single
startup event that declared a static number of capabilities. Obviously
that's no good anymore.
For compatability we're keeping the EVENT_STARTUP but adding new
EVENT_CAP_CREATE/DELETE. The EVENT_CAP_DELETE is actually just the old
EVENT_SHUTDOWN but renamed and extended (using the existing mechanism
to extend eventlog events in a compatible way). So we now emit both
EVENT_STARTUP and EVENT_CAP_CREATE. One day we will drop EVENT_STARTUP.
Since reducing the number of capabilities at runtime does not really
delete them, it just disables them, then we also have new events for
disable/enable.
The old EVENT_SHUTDOWN was in the scheduler class of events. The new
EVENT_CAP_* events are in the unconditional class, along with the
EVENT_CAPSET_* ones. Knowing when capabilities are created and deleted
is crucial to making sense of eventlogs, you always want those events.
In any case, they're extremely low volume.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
allocator.
Prompted by a benchmark posted to parallel-haskell@haskell.org by
Andreas Voellmy <andreas.voellmy@gmail.com>. This program exhibits
contention for the block allocator when run with -N2 and greater
without the fix:
{-# LANGUAGE MagicHash, UnboxedTuples, BangPatterns #-}
module Main where
import Control.Monad
import Control.Concurrent
import System.Environment
import GHC.IO
import GHC.Exts
import GHC.Conc
main = do
[m] <- fmap (fmap read) getArgs
n <- getNumCapabilities
ms <- replicateM n newEmptyMVar
sequence [ forkIO $ busyWorkerB (m `quot` n) >> putMVar mv () | mv <- ms ]
mapM takeMVar ms
busyWorkerB :: Int -> IO ()
busyWorkerB n_loops = go 0
where go !n | n >= n_loops = return ()
| otherwise =
do p <- (IO $ \s ->
case newPinnedByteArray# 1024# s of
{ (# s', mbarr# #) ->
(# s', () #)
}
)
go (n+1)
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch allows setNumCapabilities to /reduce/ the number of active
capabilities as well as increase it. This is particularly tricky to
do, because a Capability is a large data structure and ties into the
rest of the system in many ways. Trying to clean it all up would be
extremely error prone.
So instead, the solution is to mark the extra capabilities as
"disabled". This has the following consequences:
- threads on a disabled capability are migrated away by the
scheduler loop
- disabled capabilities do not participate in GC
(see scheduleDoGC())
- No spark threads are created on this capability
(see scheduleActivateSpark())
- We do not attempt to migrate threads *to* a disabled
capability (see schedulePushWork()).
So a disabled capability should do no work, and does not participate
in GC, although it remains alive in other respects. For example, a
blocked thread might wake up on a disabled capability, and it will get
quickly migrated to a live capability. A disabled capability can
still initiate GC if necessary. Indeed, it turns out to be hard to
migrate bound threads, so we wait until the next GC to do this (see
comments for details).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is an experimental tweak to the parallel GC that avoids waking up
a Capability to do parallel GC if we know that the capability has been
idle for a (tunable) number of GC cycles. The idea is that if you're
only using a few Capabilities, there's no point waking up the ones
that aren't busy.
e.g. +RTS -qi3
says "A Capability will participate in parallel GC if it was running
at all since the last 3 GC cycles."
Results are a bit hit and miss, and I don't completely understand why
yet. Hence, for now it is turned off by default, and also not
documented except in the +RTS -? output.
|
|
|
|
|
| |
At present the number of capabilities can only be *increased*, not
decreased. The latter presents a few more challenges!
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Consider this experimental for the time being. There are a lot of
things that could go wrong, but I've verified that at least it works
on the test cases we have.
I also did some API cleanups while I was here. Previously we had:
Capability * rts_eval (Capability *cap, HaskellObj p, /*out*/HaskellObj *ret);
but this API is particularly error-prone: if you forget to discard the
Capability * you passed in and use the return value instead, then
you're in for subtle bugs with +RTS -N later on. So I changed all
these functions to this form:
void rts_eval (/* inout */ Capability **cap,
/* in */ HaskellObj p,
/* out */ HaskellObj *ret)
It's much harder to use this version incorrectly, because you have to
pass the Capability in by reference.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The parallel GC was using setContextSwitches() to stop all the other
threads, which sets the context_switch flag on every Capability. That
had the side effect of causing every Capability to also switch
threads, and since GCs can be much more frequent than context
switches, this increased the context switch frequency. When context
switches are expensive (because the switch is between two bound
threads or a bound and unbound thread), the difference is quite
noticeable.
The fix is to have a separate flag to indicate that a Capability
should stop and return to the scheduler, but not switch threads. I've
called this the "interrupt" flag.
|
|
|
|
|
| |
Returns a pointer to the current cost-centre stack when profiling,
NULL otherwise.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This means that both time and heap profiling work for parallel
programs. Main internal changes:
- CCCS is no longer a global variable; it is now another
pseudo-register in the StgRegTable struct. Thus every
Capability has its own CCCS.
- There is a new built-in CCS called "IDLE", which records ticks for
Capabilities in the idle state. If you profile a single-threaded
program with +RTS -N2, you'll see about 50% of time in "IDLE".
- There is appropriate locking in rts/Profiling.c to protect the
shared cost-centre-stack data structures.
This patch does enough to get it working, I have cut one big corner:
the cost-centre-stack data structure is still shared amongst all
Capabilities, which means that multiple Capabilities will race when
updating the "allocations" and "entries" fields of a CCS. Not only
does this give unpredictable results, but it runs very slowly due to
cache line bouncing.
It is strongly recommended that you use -fno-prof-count-entries to
disable the "entries" count when profiling parallel programs. (I shall
add a note to this effect to the docs).
|
| |
|
|
|
|
|
| |
See comment for details. I've tried quite hard, but haven't been able
to make a small test case that reproduces the bug.
|
|
|
|
|
| |
being woken already has its wakeup flag set, don't bother signalling
its condition variable again.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Replaces the existing EVENT_RUN/STEAL_SPARK events with 7 new events
covering all stages of the spark lifcycle:
create, dud, overflow, run, steal, fizzle, gc
The sampled spark events are still available. There are now two event
classes for sparks, the sampled and the fully accurate. They can be
enabled/disabled independently. By default +RTS -l includes the sampled
but not full detail spark events. Use +RTS -lf-p to enable the detailed
'f' and disable the sampled 'p' spark.
Includes work by Mikolaj <mikolaj.konarski@gmail.com>
|
| |
|
|
|
|
|
|
|
| |
A new eventlog event containing 7 spark counters/statistics: sparks
created, dud, overflowed, converted, GC'd, fizzled and remaining.
These are maintained and logged separately for each capability.
We log them at startup, on each GC (minor and major) and on shutdown.
|
|
|
|
|
|
| |
Rather than a separate phase of initSparkPools. It means all the spark
stuff for a capability is initialisaed at the same time, which is then
becomes a good place to stick an initial spark trace event.
|
|
|
|
|
|
|
|
|
| |
The invariant is: created = converted + remaining + gcd + fizzled
Since sparks move between capabilities, we have to aggregate the
counters over all capabilities. This in turn means we can only check
the invariant at stable points where all but one capabilities are
stopped. We can do this at shutdown time and before and after a global
synchronised GC.
|