| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
| |
Signficantly reduces the overhead for par, which means that we can
make use of paralellism at a much finer granularity.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Change the way we look for work in the scheduler. Previously,
checking to see whether there was anything to do was a
non-side-effecting operation, but this has changed now that we do
work-stealing. This lead to a refactoring of the inner loop of the
scheduler.
Also, lots of cleanup in the new work-stealing code, but no functional
changes.
One new statistic is added to the +RTS -s output:
SPARKS: 1430 (2 converted, 1427 pruned)
lets you know something about the use of `par` in the program.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Spark stealing support for PARALLEL_HASKELL and THREADED_RTS versions of the RTS.
Spark pools are per capability, separately allocated and held in the Capability
structure. The implementation uses Double-Ended Queues (deque) and cas-protected
access.
The write end of the queue (position bottom) can only be used with
mutual exclusion, i.e. by exactly one caller at a time.
Multiple readers can steal()/findSpark() from the read end
(position top), and are synchronised without a lock, based on a cas
of the top position. One reader wins, the others return NULL for a
failure.
Work stealing is called when Capabilities find no other work (inside yieldCapability),
and tries all capabilities 0..n-1 twice, unless a theft succeeds.
Inside schedulePushWork, all considered cap.s (those which were idle and could
be grabbed) are woken up. Future versions should wake up capabilities immediately when
putting a new spark in the local pool, from newSpark().
Patch has been re-recorded due to conflicting bugfixes in the sparks.c, also fixing a
(strange) conflict in the scheduler.
|
| |
|
|
|
|
| |
It's no longer needed, as base no longer #includes it
|
| |
|
|
|
|
|
| |
Fixes a long-standing bug that could in some cases cause sub-optimal
scheduling behaviour.
|
| |
|
|
|
|
|
|
| |
gcc has changed the meaning of "extern inline" when certain flags are
on (e.g. --std=gnu99), and this broke our use of it in the header
files.
|
| |
|
|
|
|
| |
Check that all threads marked as dirty are really on the mutable list.
|
|
|
|
|
|
| |
The macros were duplicating their arguments, which was normally
harmless, but in the parallel GC was actually wrong and caused
spurious assertion failures.
|
|
|
|
| |
This means S_ISSOCK gets defined on Linux
|
| |
|
| |
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
2301: Control-C now causes the new exception (AsyncException
UserInterrupt) to be raised in the main thread. The signal handler
is set up by GHC.TopHandler.runMainIO, and can be overriden in the
usual way by installing a new signal handler. The advantage is that
now all programs will get a chance to clean up on ^C.
When UserInterrupt is caught by the topmost handler, we now exit the
program via kill(getpid(),SIGINT), which tells the parent process that
we exited as a result of ^C, so the parent can take appropriate action
(it might want to exit too, for example).
One subtlety is that we have to use a weak reference to the ThreadId
for the main thread, so that the signal handler doesn't prevent the
main thread from being subject to deadlock detection.
1619: we now ignore SIGPIPE by default. Although POSIX says that a
SIGPIPE should terminate the process by default, I wonder if this
decision was made because many C applications failed to check the exit
code from write(). In Haskell a failed write due to a closed pipe
will generate an exception anyway, so the main difference is that we
now get a useful error message instead of silent program termination.
See #1619 for more discussion.
|
|
|
|
| |
available for linking
|
|
|
|
|
|
|
|
|
| |
gcc 4.3 emits warnings for static inline functions that its heuristics
decided not to inline. The workaround is to either mark appropriate
functions as "hot" (a new attribute in gcc 4.3), or sometimes to use
"extern inline" instead.
With this fix I can validate with gcc 4.3 on Fedora 9.
|
|
|
|
| |
Sometimes better than the default copying, enabled by +RTS -w
|
| |
|
|
|
|
|
|
| |
Instead of keeping a single list of all threads, keep one per step
and only look at the threads belonging to steps that we are
collecting.
|
| |
|
| |
|
|
|
|
|
|
|
|
|
| |
- GCAux.c contains code not compiled with the gct register enabled,
it is callable from outside the GC
- marking functions are moved to their relevant subsystems, outside
the GC
- mark_root needs to save the gct register, as it is called from
outside the GC
|
|
|
|
|
|
|
| |
- count and report number of parallel collections
- calculate bytes scanned in addition to bytes copied per thread
- calculate "work balance factor"
- tidy up the formatting a bit
|
|
|
|
|
| |
This means we can calculate slop easily, and also improve
predictability of GC.
|
| |
|
| |
|
| |
|
| |
|
|
|
|
|
|
|
| |
DEBUG imposes a significant performance hit in the GC, yet we often
want some of the debugging output, so -vg gives us the cheap trace
messages without the sanity checking of DEBUG, just like -vs for the
scheduler.
|
| |
|
| |
|
| |
|
|
|
|
|
|
| |
When a stack is occupying less than 1/4 of the memory it owns, and is
larger than a megablock, we release half of it. Shrinking is O(1), it
doesn't need to copy the stack.
|
| |
|
| |
|
|
|
|
| |
in addition to checking for leaks
|
| |
|
| |
|
| |
|
|
|
|
|
|
|
| |
- major (multithreaded) GC is measured separately from minor GC
- events to measure can now be specified on the command line, e.g
prog +RTS -a+PAPI_TOT_CYC
|
|
|
|
|
|
|
|
|
| |
eg. use +RTS -g2 -RTS for 2 threads. Only major GCs are parallelised,
minor GCs are still sequential. Don't use more threads than you
have CPUs.
It works most of the time, although you won't see much speedup yet.
Tuning and more work on stability still required.
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch localises the state of the GC into a gc_thread structure,
and reorganises the inner loop of the GC to scavenge one block at a
time from global work lists in each "step". The gc_thread structure
has a "workspace" for each step, in which it collects evacuated
objects until it has a full block to push out to the step's global
list. Details of the algorithm will be on the wiki in due course.
At the moment, THREADED_RTS does not compile, but the single-threaded
GC works (and is 10-20% slower than before).
|
| |
|
| |
|
| |
|
| |
|