| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Here the following changes are introduced:
- A read barrier machine op is added to Cmm.
- The order in which a closure's fields are read and written is changed.
- Memory barriers are added to RTS code to ensure correctness on
out-or-order machines with weak memory ordering.
Cmm has a new CallishMachOp called MO_ReadBarrier. On weak memory machines, this
is lowered to an instruction that ensures memory reads that occur after said
instruction in program order are not performed before reads coming before said
instruction in program order. On machines with strong memory ordering properties
(e.g. X86, SPARC in TSO mode) no such instruction is necessary, so
MO_ReadBarrier is simply erased. However, such an instruction is necessary on
weakly ordered machines, e.g. ARM and PowerPC.
Weam memory ordering has consequences for how closures are observed and mutated.
For example, consider a closure that needs to be updated to an indirection. In
order for the indirection to be safe for concurrent observers to enter, said
observers must read the indirection's info table before they read the
indirectee. Furthermore, the entering observer makes assumptions about the
closure based on its info table contents, e.g. an INFO_TYPE of IND imples the
closure has an indirectee pointer that is safe to follow.
When a closure is updated with an indirection, both its info table and its
indirectee must be written. With weak memory ordering, these two writes can be
arbitrarily reordered, and perhaps even interleaved with other threads' reads
and writes (in the absence of memory barrier instructions). Consider this
example of a bad reordering:
- An updater writes to a closure's info table (INFO_TYPE is now IND).
- A concurrent observer branches upon reading the closure's INFO_TYPE as IND.
- A concurrent observer reads the closure's indirectee and enters it. (!!!)
- An updater writes the closure's indirectee.
Here the update to the indirectee comes too late and the concurrent observer has
jumped off into the abyss. Speculative execution can also cause us issues,
consider:
- An observer is about to case on a value in closure's info table.
- The observer speculatively reads one or more of closure's fields.
- An updater writes to closure's info table.
- The observer takes a branch based on the new info table value, but with the
old closure fields!
- The updater writes to the closure's other fields, but its too late.
Because of these effects, reads and writes to a closure's info table must be
ordered carefully with respect to reads and writes to the closure's other
fields, and memory barriers must be placed to ensure that reads and writes occur
in program order. Specifically, updates to a closure must follow the following
pattern:
- Update the closure's (non-info table) fields.
- Write barrier.
- Update the closure's info table.
Observing a closure's fields must follow the following pattern:
- Read the closure's info pointer.
- Read barrier.
- Read the closure's (non-info table) fields.
This patch updates RTS code to obey this pattern. This should fix long-standing
SMP bugs on ARM (specifically newer aarch64 microarchitectures supporting
out-of-order execution) and PowerPC. This fixes issue #15449.
Co-Authored-By: Ben Gamari <ben@well-typed.com>
|
| |
|
|
|
|
|
|
|
|
|
|
| |
Reviewers: austin, erikd, simonmar
Reviewed By: simonmar
Subscribers: pacak, rwbarton, thomie
Differential Revision: https://phabricator.haskell.org/D4068
|
|
|
|
| |
We can't tell whether the CAF is actually garbage or not.
|
|
|
|
|
|
|
|
|
|
|
|
| |
The `nat` type was an alias for `unsigned int` with a comment saying
it was at least 32 bits. We keep the typedef in case client code is
using it but mark it as deprecated.
Test Plan: Validated on Linux, OS X and Windows
Reviewers: simonmar, austin, thomie, hvr, bgamari, hsyl20
Differential Revision: https://phabricator.haskell.org/D2166
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary:
The current OS memory allocator conflates the concepts of allocating
address space and allocating memory, which makes the HEAP_ALLOCED()
implementation excessively complicated (as the only thing it cares
about is address space layout) and slow. Instead, what we want
is to allocate a single insanely large contiguous block of address
space (to make HEAP_ALLOCED() checks fast), and then commit subportions
of that in 1MB blocks as we did before.
This is currently behind a flag, USE_LARGE_ADDRESS_SPACE, that is only enabled for
certain OSes.
Test Plan: validate
Reviewers: simonmar, ezyang, austin
Subscribers: thomie, carter
Differential Revision: https://phabricator.haskell.org/D524
GHC Trac Issues: #9706
|
|
|
|
|
|
| |
Also remove 't' and 's' from ALL_WAYS; they don't exist.
Differential Revision: https://phabricator.haskell.org/D1055
|
|
|
|
| |
This reverts commit 39b5c1cbd8950755de400933cecca7b8deb4ffcd.
|
|
|
|
| |
Signed-off-by: Austin Seipp <austin@well-typed.com>
|
|
|
|
|
|
|
|
| |
This will hopefully help ensure some basic consistency in the forward by
overriding buffer variables. In particular, it sets the wrap length, the
offset to 4, and turns off tabs.
Signed-off-by: Austin Seipp <austin@well-typed.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Replaces the existing EVENT_RUN/STEAL_SPARK events with 7 new events
covering all stages of the spark lifcycle:
create, dud, overflow, run, steal, fizzle, gc
The sampled spark events are still available. There are now two event
classes for sparks, the sampled and the fully accurate. They can be
enabled/disabled independently. By default +RTS -l includes the sampled
but not full detail spark events. Use +RTS -lf-p to enable the detailed
'f' and disable the sampled 'p' spark.
Includes work by Mikolaj <mikolaj.konarski@gmail.com>
|
|
|
|
|
|
| |
Rather than a separate phase of initSparkPools. It means all the spark
stuff for a capability is initialisaed at the same time, which is then
becomes a good place to stick an initial spark trace event.
|
|
|
|
|
|
|
|
|
|
|
| |
When you use `par` to make a spark, if the spark pool on the current
capability is full then the spark is discarded. This represents a
loss of potential parallelism and it also means there are simply a
lot of sparks around. Both are things that might be of concern to a
programmer when tuning a parallel program that uses par.
The "+RTS -s" stats command now reports overflowed sparks, e.g.
SPARKS: 100001 (15521 converted, 84480 overflowed, 0 dud, 0 GC'd, 0 fizzled)
|
| |
|
|
|
|
|
| |
We want to count fizzled sparks accurately. Now tryStealSpark returns
fizzled sparks, and the callers now update the fizzled spark count.
|
|
|
|
|
|
|
|
| |
newSpark() checks if the spark is a dud, and if so does not add it to
the spark pool. Previously, newSpark would discard the pointer tag bits
and just check closure_SHOULD_SPARK(p). We can take advantage of the
tag bits which can tell us if the pointer points to a value. If it is,
it's a dud spark and we don't need to add it to the spark pool.
|
| |
|
|
|
|
|
|
|
| |
This was a bug in 6.12.3. I think the problem no longer occurs due to
the way sparks are treated as weak pointers, but it doesn't hurt to
test for tagged pointers anyway: better to do the test than have a
subtle invariant.
|
| |
|
| |
|
| |
|
|
|
|
|
| |
The new strategies library (parallel-2.0+, preferably 2.2+) is now
required for parallel programming, otherwise parallelism will be lost.
|
|
|
|
| |
base_GHCziConc_runSparks_closure directly
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
- Defines a DTrace provider, called 'HaskellEvent', that provides a probe
for every event of the eventlog framework.
- In contrast to the original eventlog, the DTrace probes are available in
all flavours of the runtime system (DTrace probes have virtually no
overhead if not enabled); when -DTRACING is defined both the regular
event log as well as DTrace probes can be used.
- Currently, Mac OS X only. User-space DTrace probes are implemented
differently on Mac OS X than in the original DTrace implementation.
Nevertheless, it shouldn't be too hard to enable these probes on other
platforms, too.
- Documentation is at http://hackage.haskell.org/trac/ghc/wiki/DTrace
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
- tracing facilities are now enabled with -DTRACING, and -DDEBUG
additionally enables debug-tracing. -DEVENTLOG has been
removed.
- -debug now implies -eventlog
- events can be printed to stderr instead of being sent to the
binary .eventlog file by adding +RTS -v (which is implied by the
+RTS -Dx options).
- -Dx debug messages can be sent to the binary .eventlog file
by adding +RTS -l. This should help debugging by reducing
the impact of debug tracing on execution time.
- Various debug messages that duplicated the information in events
have been removed.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The first phase of this tidyup is focussed on the header files, and in
particular making sure we are exposinng publicly exactly what we need
to, and no more.
- Rts.h now includes everything that the RTS exposes publicly,
rather than a random subset of it.
- Most of the public header files have moved into subdirectories, and
many of them have been renamed. But clients should not need to
include any of the other headers directly, just #include the main
public headers: Rts.h, HsFFI.h, RtsAPI.h.
- All the headers needed for via-C compilation have moved into the
stg subdirectory, which is self-contained. Most of the headers for
the rest of the RTS APIs have moved into the rts subdirectory.
- I left MachDeps.h where it is, because it is so widely used in
Haskell code.
- I left a deprecated stub for RtsFlags.h in place. The flag
structures are now exposed by Rts.h.
- Various internal APIs are no longer exposed by public header files.
- Various bits of dead code and declarations have been removed
- More gcc warnings are turned on, and the RTS code is more
warning-clean.
- More source files #include "PosixSource.h", and hence only use
standard POSIX (1003.1c-1995) interfaces.
There is a lot more tidying up still to do, this is just the first
pass. I also intend to standardise the names for external RTS APIs
(e.g use the rts_ prefix consistently), and declare the internal APIs
as hidden for shared libraries.
|
| |
|
|
|
|
| |
Also some tidyups and renaming
|
|
|
|
| |
instead of directly from StgRegTable.
|
| |
|
| |
|
|
|
|
| |
So we can use this abstraction elsewhere in the RTS
|
|
|
|
|
| |
Fixes crashes when using reclaimSpark() (not used currently, but may
be in the future).
|
| |
|
| |
|
|
|
|
|
| |
Signficantly reduces the overhead for par, which means that we can
make use of paralellism at a much finer granularity.
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Change the way we look for work in the scheduler. Previously,
checking to see whether there was anything to do was a
non-side-effecting operation, but this has changed now that we do
work-stealing. This lead to a refactoring of the inner loop of the
scheduler.
Also, lots of cleanup in the new work-stealing code, but no functional
changes.
One new statistic is added to the +RTS -s output:
SPARKS: 1430 (2 converted, 1427 pruned)
lets you know something about the use of `par` in the program.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Spark stealing support for PARALLEL_HASKELL and THREADED_RTS versions of the RTS.
Spark pools are per capability, separately allocated and held in the Capability
structure. The implementation uses Double-Ended Queues (deque) and cas-protected
access.
The write end of the queue (position bottom) can only be used with
mutual exclusion, i.e. by exactly one caller at a time.
Multiple readers can steal()/findSpark() from the read end
(position top), and are synchronised without a lock, based on a cas
of the top position. One reader wins, the others return NULL for a
failure.
Work stealing is called when Capabilities find no other work (inside yieldCapability),
and tries all capabilities 0..n-1 twice, unless a theft succeeds.
Inside schedulePushWork, all considered cap.s (those which were idle and could
be grabbed) are woken up. Future versions should wake up capabilities immediately when
putting a new spark in the local pool, from newSpark().
Patch has been re-recorded due to conflicting bugfixes in the sparks.c, also fixing a
(strange) conflict in the scheduler.
|
|
|
|
| |
Fixes crash when using compacting GC in parallel programs
|
|
|
|
|
| |
Unless sparks are roots, strategies don't work at all: all the sparks
get GC'd. We need to think about this some more.
|
| |
|
| |
|
|
|
|
|
|
|
|
|
| |
- GCAux.c contains code not compiled with the gct register enabled,
it is callable from outside the GC
- marking functions are moved to their relevant subsystems, outside
the GC
- mark_root needs to save the gct register, as it is called from
outside the GC
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch implements pointer tagging as per our ICFP'07 paper "Faster
laziness using dynamic pointer tagging". It improves performance by
10-15% for most workloads, including GHC itself.
The original patches were by Alexey Rodriguez Yakushev
<mrchebas@gmail.com>, with additions and improvements by me. I've
re-recorded the development as a single patch.
The basic idea is this: we use the low 2 bits of a pointer to a heap
object (3 bits on a 64-bit architecture) to encode some information
about the object pointed to. For a constructor, we encode the "tag"
of the constructor (e.g. True vs. False), for a function closure its
arity. This enables some decisions to be made without dereferencing
the pointer, which speeds up some common operations. In particular it
enables us to avoid costly indirect jumps in many cases.
More information in the commentary:
http://hackage.haskell.org/trac/ghc/wiki/Commentary/Rts/HaskellExecution/PointerTagging
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In preparation for parallel GC, split up the monolithic GC.c file into
smaller parts. Also in this patch (and difficult to separate,
unfortunatley):
- Don't include Stable.h in Rts.h, instead just include it where
necessary.
- consistently use STATIC_INLINE in source files, and INLINE_HEADER
in header files. STATIC_INLINE is now turned off when DEBUG is on,
to make debugging easier.
- The GC no longer takes the get_roots function as an argument.
We weren't making use of this generalisation.
|