| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When rts is forked it doesn't update toplevel handler, so UserInterrupt
exception is sent to Thread1 that doesn't exist in forked process.
We install toplevel handler when fork so signal will be delivered to the
new main thread.
Fixes #12903
Reviewers: simonmar, austin, erikd, bgamari
Reviewed By: bgamari
Subscribers: thomie
Differential Revision: https://phabricator.haskell.org/D2770
GHC Trac Issues: #12903
|
|
|
|
|
|
|
|
|
|
|
|
| |
Test Plan: Validate on lots of platforms
Reviewers: erikd, simonmar, austin
Reviewed By: erikd, simonmar
Subscribers: michalt, thomie
Differential Revision: https://phabricator.haskell.org/D2699
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary:
The problem boils down to global variables: in particular gc_threads[],
which was being modified by a subsequent GC before the previous GC had
finished with it. The fix is to not use global variables.
This was causing setnumcapabilities001 to fail (again!). It's an old
bug though.
Test Plan:
Ran setnumcapabilities001 in a loop for a couple of hours. Before this
patch it had been failing after a few minutes. Not a very scientific
test, but it's the best I have.
Reviewers: bgamari, austin, fryguybob, niteria, erikd
Subscribers: thomie
Differential Revision: https://phabricator.haskell.org/D2654
|
|
|
|
|
|
| |
It's entirely reasonable to set +RTS -qn without setting -N, because the
program might later call setNumCapabilities. If we disallow it, there's
no way to use -qn on programs that use setNumCapabilities.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The value of enabled_capabilities can change across a call to
requestSync(), and we were erroneously using an old value, causing
things to go wrong later. It manifested as an assertion failure, I'm
not sure whether there are worse consequences or not, but we should
get this fix into 8.0.2 anyway.
The failure didn't happen for me because it only shows up on machines
with fewer than 4 processors, due to the new logic to enable -qn
automatically. I've bumped the test parameter 8 to make it more
likely to exercise that code.
Test Plan: Ran setnumcapabilities001 many times
Reviewers: niteria, austin, erikd, rwbarton, bgamari
Reviewed By: bgamari
Subscribers: thomie
Differential Revision: https://phabricator.haskell.org/D2617
GHC Trac Issues: #12728
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Setting a -N value that is too large has a dramatic negative effect on
performance, but the new -qn flag can mitigate the worst of the effects
by limiting the number of GC threads.
So now, if you don't explcitly set +RTS -qn, and you set -N larger than
the number of cores (or use setNumCapabilities to do the same), we'll
default -qn to the number of cores.
These are the results from nofib/parallel on my 4-core (2 cores x 2
threads) i7 laptop, comparing -N8 before and after this change.
```
------------------------------------------------------------------------
Program Size Allocs Runtime Elapsed TotalMem
------------------------------------------------------------------------
blackscholes +0.0% +0.0% -72.5% -72.0% +9.5%
coins +0.0% -0.0% -73.7% -72.2% -0.8%
mandel +0.0% +0.0% -76.4% -75.4% +3.3%
matmult +0.0% +15.5% -26.8% -33.4% +1.0%
nbody +0.0% +2.4% +0.7% 0.076 0.0%
parfib +0.0% -8.5% -33.2% -31.5% +2.0%
partree +0.0% -0.0% -60.4% -56.8% +5.7%
prsa +0.0% -0.0% -65.4% -60.4% 0.0%
queens +0.0% +0.2% -58.8% -58.8% -1.5%
ray +0.0% -1.5% -88.7% -85.6% -3.6%
sumeuler +0.0% -0.0% -47.8% -46.9% 0.0%
------------------------------------------------------------------------
Min +0.0% -8.5% -88.7% -85.6% -3.6%
Max +0.0% +15.5% +0.7% -31.5% +9.5%
Geometric Mean +0.0% +0.6% -61.4% -63.1% +1.4%
```
Test Plan: validate, nofib/parallel benchmarks
Reviewers: niteria, ezyang, nh2, austin, erikd, trofi, bgamari
Reviewed By: trofi, bgamari
Subscribers: thomie
Differential Revision: https://phabricator.haskell.org/D2580
GHC Trac Issues: #9221
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary:
This is a fast, non-blocking, asynchronous, interface to tryPutMVar that
can be called from C/C++.
It's useful for callback-based C/C++ APIs: the idea is that the callback
invokes hs_try_putmvar(), and the Haskell code waits for the callback to
run by blocking in takeMVar.
The callback doesn't block - this is often a requirement of
callback-based APIs. The callback wakes up the Haskell thread with
minimal overhead and no unnecessary context-switches.
There are a couple of benchmarks in
testsuite/tests/concurrent/should_run. Some example results comparing
hs_try_putmvar() with using a standard foreign export:
./hs_try_putmvar003 1 64 16 100 +RTS -s -N4 0.49s
./hs_try_putmvar003 2 64 16 100 +RTS -s -N4 2.30s
hs_try_putmvar() is 4x faster for this workload (see the source for
hs_try_putmvar003.hs for details of the workload).
An alternative solution is to use the IO Manager for this. We've tried
it, but there are problems with that approach:
* Need to create a new file descriptor for each callback
* The IO Manger thread(s) become a bottleneck
* More potential for things to go wrong, e.g. throwing an exception in
an IO Manager callback kills the IO Manager thread.
Test Plan: validate; new unit tests
Reviewers: niteria, erikd, ezyang, bgamari, austin, hvr
Subscribers: thomie
Differential Revision: https://phabricator.haskell.org/D2501
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary:
This is surprisingly tricky. There were linked list bugs in the
previous version (D2430) that showed up as a test failure in
setnumcapabilities001 (that's a great stress test!).
This new version uses a different strategy that doesn't suffer from
the problem that @ezyang pointed out in D2430. We now pre-calculate
how many threads to keep for this capability, and then migrate any
surplus threads off the front of the queue, taking care to account for
threads that can't be migrated.
Test Plan:
1. setnumcapabilities001 stress test with sanity checking (+RTS -DS) turned on:
```
cd testsuite/tests/concurrent/should_run
make TEST=setnumcapabilities001 WAY=threaded1 EXTRA_HC_OPTS=-with-rtsopts=-DS CLEANUP=0
while true; do ./setnumcapabilities001.run/setnumcapabilities001 4 9 2000 || break; done
```
2. The test case from #12419
Reviewers: niteria, ezyang, rwbarton, austin, bgamari, erikd
Subscribers: thomie, ezyang
Differential Revision: https://phabricator.haskell.org/D2441
GHC Trac Issues: #12419
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary:
If we had 2 threads on the run queue, say [A,B], and B is bound to the
current Task, then we would fail to migrate any threads. This fixes it
so that we would migrate A in that case.
This will help parallelism a bit in programs that have lots of bound
threads.
Test Plan:
Test program in #12419, which is actually not a great program but it
does behave a bit better after this change.
Reviewers: ezyang, niteria, bgamari, austin, erikd
Subscribers: thomie
Differential Revision: https://phabricator.haskell.org/D2430
GHC Trac Issues: #12419
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary:
Knowing the length of the run queue in O(1) time is useful: for example
we don't have to traverse the run queue to know how many threads we have
to migrate in schedulePushWork().
Test Plan: validate
Reviewers: ezyang, erikd, bgamari, austin
Subscribers: thomie
Differential Revision: https://phabricator.haskell.org/D2437
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
@simonmar told me that it makes more sense this way.
Test Plan: it still builds
Reviewers: bgamari, austin, simonmar, erikd
Reviewed By: simonmar, erikd
Subscribers: thomie, simonmar
Differential Revision: https://phabricator.haskell.org/D2428
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary:
The aim here is to reduce the number of remote memory accesses on
systems with a NUMA memory architecture, typically multi-socket servers.
Linux provides a NUMA API for doing two things:
* Allocating memory local to a particular node
* Binding a thread to a particular node
When given the +RTS --numa flag, the runtime will
* Determine the number of NUMA nodes (N) by querying the OS
* Assign capabilities to nodes, so cap C is on node C%N
* Bind worker threads on a capability to the correct node
* Keep a separate free lists in the block layer for each node
* Allocate the nursery for a capability from node-local memory
* Allocate blocks in the GC from node-local memory
For example, using nofib/parallel/queens on a 24-core 2-socket machine:
```
$ ./Main 15 +RTS -N24 -s -A64m
Total time 173.960s ( 7.467s elapsed)
$ ./Main 15 +RTS -N24 -s -A64m --numa
Total time 150.836s ( 6.423s elapsed)
```
The biggest win here is expected to be allocating from node-local
memory, so that means programs using a large -A value (as here).
According to perf, on this program the number of remote memory accesses
were reduced by more than 50% by using `--numa`.
Test Plan:
* validate
* There's a new flag --debug-numa=<n> that pretends to do NUMA without
actually making the OS calls, which is useful for testing the code
on non-NUMA systems.
* TODO: I need to add some unit tests
Reviewers: erikd, austin, rwbarton, ezyang, bgamari, hvr, niteria
Subscribers: thomie
Differential Revision: https://phabricator.haskell.org/D2199
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In addition to more const-correctness fixes this patch fixes an
infelicity of the previous const-correctness patch (995cf0f356) which
left `UNTAG_CLOSURE` taking a `const StgClosure` pointer parameter
but returning a non-const pointer. Here we restore the original type
signature of `UNTAG_CLOSURE` and add a new function
`UNTAG_CONST_CLOSURE` which takes and returns a const `StgClosure`
pointer and uses that wherever possible.
Test Plan: Validate on Linux, OS X and Windows
Reviewers: Phyx, hsyl20, bgamari, austin, simonmar, trofi
Reviewed By: simonmar, trofi
Subscribers: thomie
Differential Revision: https://phabricator.haskell.org/D2231
|
|
|
|
|
|
|
|
|
|
|
|
| |
The assertion failure was fairly benign, I think, but this fixes it.
I've been running the test repeatedly for the last 30 mins and it hasn't
triggered.
There are other problems exposed by this test (see #12038), but I've
worked around those in the test itself for now.
I also copied the relevant bits of the parallel library here so that we
don't need parallel for the test to run.
|
|
|
|
|
|
|
|
| |
It was possible for a thread to read invalid memory after a conflict
when multiple threads were synchronising.
I haven't been successful in constructing a test case that triggers
this, but we have some internal code that ran into it.
|
|
|
|
|
|
|
|
|
|
|
|
| |
The `nat` type was an alias for `unsigned int` with a comment saying
it was at least 32 bits. We keep the typedef in case client code is
using it but mark it as deprecated.
Test Plan: Validated on Linux, OS X and Windows
Reviewers: simonmar, austin, thomie, hvr, bgamari, hsyl20
Differential Revision: https://phabricator.haskell.org/D2166
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This function had some pathalogically bad behaviour: if we had 2 threads
on the current capability and 23 other idle capabilities, we would
* grab all 23 capabilities
* migrate one Haskell thread to one of them
* wake up a worker on *all* 23 other capabilities.
This lead to a lot of unnecessary wakeups when using large -N values.
Now, we
* Count how many capabilities we need to wake up
* Start from cap->no+1, so that we don't overload low-numbered capabilities
* Only wake up capabilities that we migrated a thread to (unless we have
sparks to steal)
This results in a pretty dramatic improvement in our production system.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This allows the GC to use fewer threads than the number of capabilities.
At each GC, we choose some of the capabilities to be "idle", which means
that the thread running on that capability (if any) will sleep for the
duration of the GC, and the other threads will do its work. We choose
capabilities that are already idle (if any) to be the idle capabilities.
The idea is that this helps in the following situation:
* We want to use a large -N value so as to make use of hyperthreaded
cores
* We use a large heap size, so GC is infrequent
* But we don't want to use all -N threads in the GC, because that
thrashes the memory too much.
See docs for usage.
|
| |
|
|
|
|
|
|
|
|
| |
Noticed by uselex.rb:
removeFromRunQueue: [R]: exported from:
./rts/dist/build/Schedule.o
Signed-off-by: Sergei Trofimovich <siarheit@google.com>
|
|
|
|
|
|
|
|
|
|
|
| |
On AIX, C system headers can redirect the token `stat` via
#define stat stat64
to provide large-file support. Simply avoiding the use of `stat` as an
identifier eschews macro-replacement.
Differential Revision: https://phabricator.haskell.org/D1566
|
|
|
|
| |
Differential Revision: https://phabricator.haskell.org/D1348
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary:
When using nursery chunks, if we failed a heap check due to
large_alloc_lim, we would pointlessly keep grabbing new nursery
chunks when we should just GC immediately.
Test Plan: validate
Reviewers: austin, bgamari, niteria
Subscribers: thomie
Differential Revision: https://phabricator.haskell.org/D1072
|
|
|
|
|
|
| |
Also remove 't' and 's' from ALL_WAYS; they don't exist.
Differential Revision: https://phabricator.haskell.org/D1055
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
yieldCapability() was not prepared to be called by a Task that is not
either a worker or a bound Task. This could happen if we ended up in
yieldCapability via this call stack:
performGC()
scheduleDoGC()
requestSync()
yieldCapability()
and there were a few other ways this could happen via requestSync.
The fix is to handle this case in yieldCapability(): when the Task is
not a worker or a bound Task, we put it on the returning_workers
queue, where it will be woken up again.
Summary of changes:
* `yieldCapability`: factored out subroutine waitForWorkerCapability`
* `waitForReturnCapability` renamed to `waitForCapability`, and
factored out subroutine `waitForReturnCapability`
* `releaseCapabilityAndQueue` worker renamed to `enqueueWorker`, does
not take a lock and no longer tests if `!isBoundTask()`
* `yieldCapability` adjusted for refactorings, only change in behavior
is when it is not a worker or bound task.
Test Plan:
* new test concurrent/should_run/performGC
* validate
Reviewers: niteria, austin, ezyang, bgamari
Subscribers: thomie, bgamari
Differential Revision: https://phabricator.haskell.org/D997
GHC Trac Issues: #10545
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary:
There's a race condition like this:
# A foreign pointer gets promoted to the last generation
# It has its finalizer called manually
# We start shutting down the runtime in `hs_exit_` from the main
thread
# A minor GC starts running (`scheduleDoGC`) on one of the threads
# The minor GC notices that we're in `SCHED_INTERRUPTING` state and
advances to `SCHED_SHUTTING_DOWN`
# The main thread tries to do major GC (with `scheduleDoGC`), but it
exits early because we're in `SCHED_SHUTTING_DOWN` state
# We end up with a `DEAD_WEAK` left on the list of weak pointers of
the last generation, because it relied on major GC removing it from
that list
This change:
* Ignores DEAD_WEAK finalizers when shutting down
* Makes the major GC on shutdown more likely
* Fixes a bogus assert
Test Plan:
before this diff https://ghc.haskell.org/trac/ghc/ticket/7170#comment:5
reproduced and after it doesn't
Reviewers: ezyang, austin, simonmar
Reviewed By: simonmar
Subscribers: bgamari, thomie
Differential Revision: https://phabricator.haskell.org/D921
GHC Trac Issues: #7170
|
|
|
|
|
|
|
|
|
|
| |
#10043)
Reviewers: austin, simonmar
Subscribers: thomie
Differential Revision: https://phabricator.haskell.org/D657
|
|
|
|
| |
See the documentation for details.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary:
clearNursery resets all the bd->free pointers of nursery blocks to
make the blocks empty. In profiles we've seen clearNursery taking
significant amounts of time particularly with large -N and -A values.
This patch moves the work of clearNursery to the point at which we
actually need the new block, thereby introducing an invariant that
blocks to the right of the CurrentNursery pointer still need their
bd->free pointer reset. This should make things faster overall,
because we don't need to clear blocks that we don't use.
Test Plan: validate
Reviewers: AndreasVoellmy, ezyang, austin
Subscribers: thomie, carter, ezyang, simonmar
Differential Revision: https://phabricator.haskell.org/D318
|
|
|
|
| |
I think this went wrong in 2a6f193b
|
| |
|
|
|
|
|
|
|
|
| |
This reverts commit f0fcc41d755876a1b02d1c7c79f57515059f6417.
New changes: now works on 32-bit platforms too. I added some basic
support for 64-bit subtraction and comparison operations to the x86
NCG.
|
|
|
|
| |
Signed-off-by: Austin Seipp <austin@well-typed.com>
|
|
|
|
| |
This reverts commit 39b5c1cbd8950755de400933cecca7b8deb4ffcd.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Test Plan: validate
Reviewers: simonmar, austin
Reviewed By: simonmar, austin
Subscribers: phaskell, simonmar, relrod, carter
Differential Revision: https://phabricator.haskell.org/D99
GHC Trac Issues: #9377
|
|
|
|
|
|
|
|
| |
This will hopefully help ensure some basic consistency in the forward by
overriding buffer variables. In particular, it sets the wrap length, the
offset to 4, and turns off tabs.
Signed-off-by: Austin Seipp <austin@well-typed.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary: (for the same reason that we acquire all the other mutexes)
Test Plan: validate
Reviewers: simonmar, austin, duncan
Reviewed By: simonmar, austin, duncan
Subscribers: simonmar, relrod, carter
Differential Revision: https://phabricator.haskell.org/D60
|
|
|
|
|
|
|
|
| |
Problems were found on 32-bit platforms, I'll commit again when I have a fix.
This reverts the following commits:
54b31f744848da872c7c6366dea840748e01b5cf
b0534f78a73f972e279eed4447a5687bd6a8308e
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This tracks the amount of memory allocation by each thread in a
counter stored in the TSO. Optionally, when the counter drops below
zero (it counts down), the thread can be sent an asynchronous
exception: AllocationLimitExceeded. When this happens, given a small
additional limit so that it can handle the exception. See
documentation in GHC.Conc for more details.
Allocation limits are similar to timeouts, but
- timeouts use real time, not CPU time. Allocation limits do not
count anything while the thread is blocked or in foreign code.
- timeouts don't re-trigger if the thread catches the exception,
allocation limits do.
- timeouts can catch non-allocating loops, if you use
-fno-omit-yields. This doesn't work for allocation limits.
I couldn't measure any impact on benchmarks with these changes, even
for nofib/smp.
|
| |
|
|
|
|
| |
Signed-off-by: Edward Z. Yang <ezyang@mit.edu>
|
| |
|
|
|
|
| |
This reverts commit d85044f6b201eae0a9e453b89c0433608e0778f0.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When servicing a stack overflows, only throw an exception to the given
thread if the user explicitly set a max stack size, using +RTS -K.
Otherwise just service it normally and grow the stack.
In case we actually run out of *heap* (stack chuncks are allocated on
the heap), then we need to bail by calling the stackOverflow() hook and
exit immediately.
Authored-by: Ben Gamari <bgamari.foss@gmail.com>
Signed-off-by: Austin Seipp <aseipp@pobox.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We have various problems with reallocating the array of Capabilities,
due to threads in waitForReturnCapability that are already holding a
pointer to a Capability.
Rather than add more locking to make this safer, I decided it would be
easier to ensure that we never move the Capabilities at all. The
capabilities array is now an array of pointers to Capabaility. There
are extra indirections, but it rarely matters - we don't often access
Capabilities via the array, normally we already have a pointer to
one. I ran the parallel benchmarks and didn't see any difference.
|
|
|
|
|
|
|
|
|
| |
We add the invariant to the MVar blocked threads queue that
threads blocked on an atomic read are always at the front of
the queue. This invariant is easy to maintain, since takers
are only ever added to the end of the queue.
Signed-off-by: Edward Z. Yang <ezyang@mit.edu>
|
| |
|
|
|
|
|
|
|
| |
Again, the range of gc_type is actually 1-3, which is technically
outside the range of rtsBool.
Signed-off-by: Austin Seipp <aseipp@pobox.com>
|