| Commit message (Collapse) | Author | Age | Files | Lines |
| |
|
|
|
|
|
|
|
|
|
| |
As noted in #17606, Docker disallows the get_mempolicy syscall by
default. This caused numerous tests to fail under CI in the `debug_numa`
way. Avoid this by disabling the NUMA probing logic when --debug-numa is
in use, instead setting n_numa_nodes in RtsFlags.c.
Fixes #17606.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This extends the non-moving collector to allow concurrent collection.
The full design of the collector implemented here is described in detail
in a technical note
B. Gamari. "A Concurrent Garbage Collector For the Glasgow Haskell
Compiler" (2018)
This extension involves the introduction of a capability-local
remembered set, known as the /update remembered set/, which tracks
objects which may no longer be visible to the collector due to mutation.
To maintain this remembered set we introduce a write barrier on
mutations which is enabled while a concurrent mark is underway.
The update remembered set representation is similar to that of the
nonmoving mark queue, being a chunked array of `MarkEntry`s. Each
`Capability` maintains a single accumulator chunk, which it flushed
when it (a) is filled, or (b) when the nonmoving collector enters its
post-mark synchronization phase.
While the write barrier touches a significant amount of code it is
conceptually straightforward: the mutator must ensure that the referee
of any pointer it overwrites is added to the update remembered set.
However, there are a few details:
* In the case of objects with a dirty flag (e.g. `MVar`s) we can
exploit the fact that only the *first* mutation requires a write
barrier.
* Weak references, as usual, complicate things. In particular, we must
ensure that the referee of a weak object is marked if dereferenced by
the mutator. For this we (unfortunately) must introduce a read
barrier, as described in Note [Concurrent read barrier on deRefWeak#]
(in `NonMovingMark.c`).
* Stable names are also a bit tricky as described in Note [Sweeping
stable names in the concurrent collector] (`NonMovingSweep.c`).
We take quite some pains to ensure that the high thread count often seen
in parallel Haskell applications doesn't affect pause times. To this end
we allow thread stacks to be marked either by the thread itself (when it
is executed or stack-underflows) or the concurrent mark thread (if the
thread owning the stack is never scheduled). There is a non-trivial
handshake to ensure that this happens without racing which is described
in Note [StgStack dirtiness flags and concurrent marking].
Co-Authored-by: Ömer Sinan Ağacan <omer@well-typed.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This implements the core heap structure and a serial mark/sweep
collector which can be used to manage the oldest-generation heap.
This is the first step towards a concurrent mark-and-sweep collector
aimed at low-latency applications.
The full design of the collector implemented here is described in detail
in a technical note
B. Gamari. "A Concurrent Garbage Collector For the Glasgow Haskell
Compiler" (2018)
The basic heap structure used in this design is heavily inspired by
K. Ueno & A. Ohori. "A fully concurrent garbage collector for
functional programs on multicore processors." /ACM SIGPLAN Notices/
Vol. 51. No. 9 (presented by ICFP 2016)
This design is intended to allow both marking and sweeping
concurrent to execution of a multi-core mutator. Unlike the Ueno design,
which requires no global synchronization pauses, the collector
introduced here requires a stop-the-world pause at the beginning and end
of the mark phase.
To avoid heap fragmentation, the allocator consists of a number of
fixed-size /sub-allocators/. Each of these sub-allocators allocators into
its own set of /segments/, themselves allocated from the block
allocator. Each segment is broken into a set of fixed-size allocation
blocks (which back allocations) in addition to a bitmap (used to track
the liveness of blocks) and some additional metadata (used also used
to track liveness).
This heap structure enables collection via mark-and-sweep, which can be
performed concurrently via a snapshot-at-the-beginning scheme (although
concurrent collection is not implemented in this patch).
The mark queue is a fairly straightforward chunked-array structure.
The representation is a bit more verbose than a typical mark queue to
accomodate a combination of two features:
* a mark FIFO, which improves the locality of marking, reducing one of
the major overheads seen in mark/sweep allocators (see [1] for
details)
* the selector optimization and indirection shortcutting, which
requires that we track where we found each reference to an object
in case we need to update the reference at a later point (e.g. when
we find that it is an indirection). See Note [Origin references in
the nonmoving collector] (in `NonMovingMark.h`) for details.
Beyond this the mark/sweep is fairly run-of-the-mill.
[1] R. Garner, S.M. Blackburn, D. Frampton. "Effective Prefetch for
Mark-Sweep Garbage Collection." ISMM 2007.
Co-Authored-By: Ben Gamari <ben@well-typed.com>
|
| |
|
|
|
|
|
|
|
|
| |
These are unexploded minds as far as the linter is concerned. I don't
want to hit in my MRs by mistake!
I did this with `sed`, and then rolled back some changes in the docs,
config.guess, and the linter itself.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This feature has some very serious correctness issues (#14310),
introduces a great deal of complexity, and hasn't seen wide usage.
Consequently we are removing it, as proposed in Proposal #77 [1]. This
is heavily based on a patch from fryguybob.
Updates stm submodule.
[1] https://github.com/ghc-proposals/ghc-proposals/pull/77
Test Plan: Validate
Reviewers: erikd, simonmar, hvr
Reviewed By: simonmar
Subscribers: rwbarton, thomie, carter
GHC Trac Issues: #14310
Differential Revision: https://phabricator.haskell.org/D4760
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
| |
Test Plan: Validate
Reviewers: erikd, simonmar
Reviewed By: simonmar
Subscribers: rwbarton, thomie, carter
Differential Revision: https://phabricator.haskell.org/D4374
|
| |
|
|
|
|
| |
Our new CPP linter enforces this.
|
|
|
|
|
|
|
|
|
|
|
|
| |
Test Plan: Validate on lots of platforms
Reviewers: erikd, simonmar, austin
Reviewed By: erikd, simonmar
Subscribers: michalt, thomie
Differential Revision: https://phabricator.haskell.org/D2699
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary:
This is a fast, non-blocking, asynchronous, interface to tryPutMVar that
can be called from C/C++.
It's useful for callback-based C/C++ APIs: the idea is that the callback
invokes hs_try_putmvar(), and the Haskell code waits for the callback to
run by blocking in takeMVar.
The callback doesn't block - this is often a requirement of
callback-based APIs. The callback wakes up the Haskell thread with
minimal overhead and no unnecessary context-switches.
There are a couple of benchmarks in
testsuite/tests/concurrent/should_run. Some example results comparing
hs_try_putmvar() with using a standard foreign export:
./hs_try_putmvar003 1 64 16 100 +RTS -s -N4 0.49s
./hs_try_putmvar003 2 64 16 100 +RTS -s -N4 2.30s
hs_try_putmvar() is 4x faster for this workload (see the source for
hs_try_putmvar003.hs for details of the workload).
An alternative solution is to use the IO Manager for this. We've tried
it, but there are problems with that approach:
* Need to create a new file descriptor for each callback
* The IO Manger thread(s) become a bottleneck
* More potential for things to go wrong, e.g. throwing an exception in
an IO Manager callback kills the IO Manager thread.
Test Plan: validate; new unit tests
Reviewers: niteria, erikd, ezyang, bgamari, austin, hvr
Subscribers: thomie
Differential Revision: https://phabricator.haskell.org/D2501
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary:
ASSERT_THREADED_CAPABILITY_INVARIANTS was testing properties of the
returning_tasks queue, but that requires cap->lock to access safely.
This assertion would randomly fail if stressed enough.
Instead I've removed it from the catch-all
ASSERT_PARTIAL_CAPABILITIY_INVARIANTS and made it a separate assertion
only called under cap->lock.
Test Plan:
```
cd testsuite/tests/concurrent/should_run
make TEST=setnumcapabilities001 WAY=threaded1 EXTRA_HC_OPTS=-with-rtsopts=-DS CLEANUP=0
while true; do ./setnumcapabilities001.run/setnumcapabilities001 4 9 2000 || break; done
```
Reviewers: niteria, bgamari, ezyang, austin, erikd
Subscribers: thomie
Differential Revision: https://phabricator.haskell.org/D2440
GHC Trac Issues: #10860
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary:
Knowing the length of the run queue in O(1) time is useful: for example
we don't have to traverse the run queue to know how many threads we have
to migrate in schedulePushWork().
Test Plan: validate
Reviewers: ezyang, erikd, bgamari, austin
Subscribers: thomie
Differential Revision: https://phabricator.haskell.org/D2437
|
|
|
|
|
| |
- Move the numaMap and nNumaNodes out of RtsFlags to Capability.c
- Add a test to tests/rts
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary:
The aim here is to reduce the number of remote memory accesses on
systems with a NUMA memory architecture, typically multi-socket servers.
Linux provides a NUMA API for doing two things:
* Allocating memory local to a particular node
* Binding a thread to a particular node
When given the +RTS --numa flag, the runtime will
* Determine the number of NUMA nodes (N) by querying the OS
* Assign capabilities to nodes, so cap C is on node C%N
* Bind worker threads on a capability to the correct node
* Keep a separate free lists in the block layer for each node
* Allocate the nursery for a capability from node-local memory
* Allocate blocks in the GC from node-local memory
For example, using nofib/parallel/queens on a 24-core 2-socket machine:
```
$ ./Main 15 +RTS -N24 -s -A64m
Total time 173.960s ( 7.467s elapsed)
$ ./Main 15 +RTS -N24 -s -A64m --numa
Total time 150.836s ( 6.423s elapsed)
```
The biggest win here is expected to be allocating from node-local
memory, so that means programs using a large -A value (as here).
According to perf, on this program the number of remote memory accesses
were reduced by more than 50% by using `--numa`.
Test Plan:
* validate
* There's a new flag --debug-numa=<n> that pretends to do NUMA without
actually making the OS calls, which is useful for testing the code
on non-NUMA systems.
* TODO: I need to add some unit tests
Reviewers: erikd, austin, rwbarton, ezyang, bgamari, hvr, niteria
Subscribers: thomie
Differential Revision: https://phabricator.haskell.org/D2199
|
|
|
|
|
|
|
|
|
|
|
|
| |
The `nat` type was an alias for `unsigned int` with a comment saying
it was at least 32 bits. We keep the typedef in case client code is
using it but mark it as deprecated.
Test Plan: Validated on Linux, OS X and Windows
Reviewers: simonmar, austin, thomie, hvr, bgamari, hsyl20
Differential Revision: https://phabricator.haskell.org/D2166
|
|
|
|
| |
This causes errors with some versions of gcc (4.4.7 here).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This allows the GC to use fewer threads than the number of capabilities.
At each GC, we choose some of the capabilities to be "idle", which means
that the thread running on that capability (if any) will sleep for the
duration of the GC, and the other threads will do its work. We choose
capabilities that are already idle (if any) to be the idle capabilities.
The idea is that this helps in the following situation:
* We want to use a large -N value so as to make use of hyperthreaded
cores
* We use a large heap size, so GC is infrequent
* But we don't want to use all -N threads in the GC, because that
thrashes the memory too much.
See docs for usage.
|
|
|
|
|
|
|
|
| |
This allows an OS thread to specify which capability it should run on
when it makes a call into Haskell. It is intended for a fairly
specialised use case, when the client wants to have tighter control over
the mapping between OS threads and Capabilities - perhaps 1:1
correspondence, for example.
|
|
|
|
|
|
|
|
|
|
| |
Noticed by uselex.rb:
last_free_capability: [R]: exported from:
./rts/dist/build/Capability.o
shutdownCapability: [R]: exported from:
./rts/dist/build/Capability.o
Signed-off-by: Sergei Trofimovich <siarheit@google.com>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
yieldCapability() was not prepared to be called by a Task that is not
either a worker or a bound Task. This could happen if we ended up in
yieldCapability via this call stack:
performGC()
scheduleDoGC()
requestSync()
yieldCapability()
and there were a few other ways this could happen via requestSync.
The fix is to handle this case in yieldCapability(): when the Task is
not a worker or a bound Task, we put it on the returning_workers
queue, where it will be woken up again.
Summary of changes:
* `yieldCapability`: factored out subroutine waitForWorkerCapability`
* `waitForReturnCapability` renamed to `waitForCapability`, and
factored out subroutine `waitForReturnCapability`
* `releaseCapabilityAndQueue` worker renamed to `enqueueWorker`, does
not take a lock and no longer tests if `!isBoundTask()`
* `yieldCapability` adjusted for refactorings, only change in behavior
is when it is not a worker or bound task.
Test Plan:
* new test concurrent/should_run/performGC
* validate
Reviewers: niteria, austin, ezyang, bgamari
Subscribers: thomie, bgamari
Differential Revision: https://phabricator.haskell.org/D997
GHC Trac Issues: #10545
|
|
|
|
|
|
|
|
| |
This reverts commit f0fcc41d755876a1b02d1c7c79f57515059f6417.
New changes: now works on 32-bit platforms too. I added some basic
support for 64-bit subtraction and comparison operations to the x86
NCG.
|
|
|
|
| |
Signed-off-by: Austin Seipp <austin@well-typed.com>
|
|
|
|
| |
This reverts commit 39b5c1cbd8950755de400933cecca7b8deb4ffcd.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary:
This reverts commit 4748f5936fe72d96edfa17b153dbfd84f2c4c053. The fix for #9423
was reverted because this commit introduced a C function setIOManagerControlFd()
(defined in Schedule.c) defined for all OS types, while the prototype
(in includes/rts/IOManager.h) was only included when mingw32_HOST_OS is
not defined. This broke Windows builds.
This commit reverts the original commit and resolves the problem by only defining
setIOManagerControlFd() when mingw32_HOST_OS is defined. Hence the missing prototype
error should not occur on Windows.
In addition, since the io_manager_control_wr_fd field of the Capability struct is only
usd by the setIOManagerControlFd, this commit includes the io_manager_control_wr_fd
field in the Capability struct only when mingw32_HOST_OS is not defined.
Test Plan: Try to compile successfully on all platforms.
Reviewers: austin
Reviewed By: austin
Subscribers: simonmar, ezyang, carter
Differential Revision: https://phabricator.haskell.org/D174
|
|
|
|
| |
This reverts commit 7bf49f86a20f3beda0ee5fbea2db64cfef730d74.
|
|
|
|
| |
This reverts commit 15df6d98afb8c3813013c5b97efffe0ba8020d32.
|
|
|
|
|
|
|
|
|
| |
This should fix the Windows fallout, and hopefully this will be fixed
once that's sorted out.
This reverts commit f9f89b7884ccc8ee5047cf4fffdf2b36df6832df.
Signed-off-by: Austin Seipp <austin@well-typed.com>
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary:
Fix #9423.
The problem in #9423 is caused when code invoked by `hs_exit()` waits
on all foreign calls to return, but some IO managers are in `safe` foreign
calls and do not return. The previous design signaled to the timer manager
(via its control pipe) that it should "die" and when the timer manager
returned to Haskell-land, the Haskell code in timer manager then signalled
to the IO manager threads that they should return from foreign calls and
`die`. Unfortunately, in the shutdown sequence the timer manager is unable
to return to Haskell-land fast enough and so the code that signals to the
IO manager threads (via their control pipes) is never executed and the IO
manager threads remain out in the foreign calls.
This patch solves this problem by having the RTS signal to all the IO
manager threads (via their control pipes; and in addition to signalling
to the timer manager thread) that they should shutdown (in `ioManagerDie()`
in `rts/Signals.c`. To do this, we arrange for each IO manager thread to
register its control pipe with the RTS (in `GHC.Thread.startIOManagerThread`).
In addition, `GHC.Thread.startTimerManagerThread` registers its control pipe.
These are registered via C functions `setTimerManagerControlFd` (in
`rts/Signals.c`) and `setIOManagerControlFd` (in `rts/Capability.c`). The IO
manager control pipe file descriptors are stored in a new field of the
`Capability_ struct`.
Test Plan: See the notes on #9423 to recreate the problem and to verify that it no longer occurs with the fix.
Auditors: simonmar
Reviewers: simonmar, edsko, ezyang, austin
Reviewed By: austin
Subscribers: phaskell, simonmar, ezyang, carter, relrod
Differential Revision: https://phabricator.haskell.org/D129
GHC Trac Issues: #9423, #9284
|
|
|
|
|
|
|
|
| |
This will hopefully help ensure some basic consistency in the forward by
overriding buffer variables. In particular, it sets the wrap length, the
offset to 4, and turns off tabs.
Signed-off-by: Austin Seipp <austin@well-typed.com>
|
|
|
|
| |
Signed-off-by: Edward Z. Yang <ezyang@cs.stanford.edu>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
UNREG mode has quite nasty invariant to maintain:
capabilities[0] == &MainCapability
and it's a non-heap memory, while other
capabilities are dynamically allocated.
Issue #8748
Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org>
Signed-off-by: Austin Seipp <austin@well-typed.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We have various problems with reallocating the array of Capabilities,
due to threads in waitForReturnCapability that are already holding a
pointer to a Capability.
Rather than add more locking to make this safer, I decided it would be
easier to ensure that we never move the Capabilities at all. The
capabilities array is now an array of pointers to Capabaility. There
are extra indirections, but it rarely matters - we don't often access
Capabilities via the array, normally we already have a pointer to
one. I ran the parallel benchmarks and didn't see any difference.
|
|
|
|
|
|
|
|
|
| |
This adds some new functions: peekRunQueue, promoteInRunQueue,
singletonRunQueue and truncateRunQueue which help abstract away
manual linked list manipulation, making it easier to swap in
a new queue implementation.
Signed-off-by: Edward Z. Yang <ezyang@mit.edu>
|
| |
|
|
|
|
| |
A companion ghc-events pachakge commit displays task ids in the same format.
|
| |
|
|
|
|
|
|
| |
If we are interrupted to do a GC, then we do not immediately do another
one. This avoids a starvation situation where one Capability keeps
forcing a GC and the other Capabilities make no progress at all.
|
|
|
|
|
|
| |
Mostly this meant getting pointer<->int conversions to use the right
sizes. lnat is now size_t, rather than unsigned long, as that seems a
better match for how it's used.
|
|
|
|
|
|
| |
Will let us do final per-cap trace events from stat_exit().
Otherwise we would end up with eventlogs with events for caps
that have already been deleted.
|
|
|
|
|
|
|
| |
In addition to the existing global method. For now we just do
it both ways and assert they give the same grand total. At some
stage we can simplify the global method to just take the sum of
the per-cap counters.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Now that we can adjust the number of capabilities on the fly, we need
this reflected in the eventlog. Previously the eventlog had a single
startup event that declared a static number of capabilities. Obviously
that's no good anymore.
For compatability we're keeping the EVENT_STARTUP but adding new
EVENT_CAP_CREATE/DELETE. The EVENT_CAP_DELETE is actually just the old
EVENT_SHUTDOWN but renamed and extended (using the existing mechanism
to extend eventlog events in a compatible way). So we now emit both
EVENT_STARTUP and EVENT_CAP_CREATE. One day we will drop EVENT_STARTUP.
Since reducing the number of capabilities at runtime does not really
delete them, it just disables them, then we also have new events for
disable/enable.
The old EVENT_SHUTDOWN was in the scheduler class of events. The new
EVENT_CAP_* events are in the unconditional class, along with the
EVENT_CAPSET_* ones. Knowing when capabilities are created and deleted
is crucial to making sense of eventlogs, you always want those events.
In any case, they're extremely low volume.
|