| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
| |
This adds some new functions: peekRunQueue, promoteInRunQueue,
singletonRunQueue and truncateRunQueue which help abstract away
manual linked list manipulation, making it easier to swap in
a new queue implementation.
Signed-off-by: Edward Z. Yang <ezyang@mit.edu>
|
|
|
|
|
|
|
|
|
|
| |
This improves GC performance when there are a lot of TVars in the
heap. For instance, a TChan with a lot of elements causes a massive
GC drag without this patch.
There's more to do - several other STM closure types don't have write
barriers, so GC performance when there are a lot of threads blocked on
STM isn't great. But fixing the problem for TVar is a good start.
|
| |
|
| |
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Improvements:
- we now turn off the timer signal in the non-threaded RTS after
idleGCDelay. This should make the xmonad users on #5991 happy.
- we now turn off the timer signal after idleGCDelay even if the
idle GC is disabled with +RTS -I0.
- we now do *not* turn off the timer when profiling.
- more comments to explain the meaning of the various ACTIVITY_*
values
|
|
|
|
|
|
|
|
|
|
|
|
| |
lnat was originally "long unsigned int" but we were using it when we
wanted a 64-bit type on a 64-bit machine. This broke on Windows x64,
where long == int == 32 bits. Using types of unspecified size is bad,
but what we really wanted was a type with N bits on an N-bit machine.
StgWord is exactly that.
lnat was mentioned in some APIs that clients might be using
(e.g. StackOverflowHook()), so we leave it defined but with a comment
to say that it's deprecated.
|
| |
|
|
|
|
|
|
|
|
| |
The problem occurred when the idle GC was turned off with +RTS -I0.
Then the scheduler would go into the state ACTIVITY_DONE_GC directly
without doing a GC, and a subsequent GC would put it back to
ACTIVITY_YES but without turning the timer back on. Instead if the GC
finds the state is ACTIVITY_DONE_GC it should leave it there.
|
|\ |
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Based on initial patches by Mikolaj Konarski <mikolaj@well-typed.com>
Use the new task tracing functions traceTaskCreate/Migrate/Delete.
There are two key places. One is for worker tasks which have a
relatively simple life cycle. Worker tasks are created and deleted by
the RTS. The other case is bound tasks which are either created by the
RTS, or appear as foreign C threads making calls into the RTS. For bound
threads we do the tracing in rts_lock/unlock, which actually covers both
threads coming in from outside, and also bound threads made by the RTS.
|
|/
|
|
|
|
|
|
| |
We do a final GC before shutting down the system, to clean up.
However, we were doing an ordinary GC rather than forcing a major GC,
so especially when the allocation area is large, this final GC could
be expensive. This is really just a bug - the final GC should have
virtually nothing to do, because there is nothing live.
|
|\ |
|
| | |
|
|/
|
|
|
|
| |
If we are interrupted to do a GC, then we do not immediately do another
one. This avoids a starvation situation where one Capability keeps
forcing a GC and the other Capabilities make no progress at all.
|
|
|
|
|
|
|
|
|
|
|
| |
There was a discrepancy between GC times reported in +RTS -s
and the timestamps of GC_START and GC_END events on the cap,
on which +RTS -s stats for the given GC are based.
This is fixed by posting the events with exactly the same timestamp
as generated for the stat calculation. The calls posting the events
are moved too, so that the events are emitted close to the time instant
they claim to be emitted at. The GC_STATS_GHC was moved, too, ensuring
it's emitted before the moved GC_END on all caps, which simplifies tools code.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Now that we can adjust the number of capabilities on the fly, we need
this reflected in the eventlog. Previously the eventlog had a single
startup event that declared a static number of capabilities. Obviously
that's no good anymore.
For compatability we're keeping the EVENT_STARTUP but adding new
EVENT_CAP_CREATE/DELETE. The EVENT_CAP_DELETE is actually just the old
EVENT_SHUTDOWN but renamed and extended (using the existing mechanism
to extend eventlog events in a compatible way). So we now emit both
EVENT_STARTUP and EVENT_CAP_CREATE. One day we will drop EVENT_STARTUP.
Since reducing the number of capabilities at runtime does not really
delete them, it just disables them, then we also have new events for
disable/enable.
The old EVENT_SHUTDOWN was in the scheduler class of events. The new
EVENT_CAP_* events are in the unconditional class, along with the
EVENT_CAPSET_* ones. Knowing when capabilities are created and deleted
is crucial to making sense of eventlogs, you always want those events.
In any case, they're extremely low volume.
|
| |
|
| |
|
| |
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch allows setNumCapabilities to /reduce/ the number of active
capabilities as well as increase it. This is particularly tricky to
do, because a Capability is a large data structure and ties into the
rest of the system in many ways. Trying to clean it all up would be
extremely error prone.
So instead, the solution is to mark the extra capabilities as
"disabled". This has the following consequences:
- threads on a disabled capability are migrated away by the
scheduler loop
- disabled capabilities do not participate in GC
(see scheduleDoGC())
- No spark threads are created on this capability
(see scheduleActivateSpark())
- We do not attempt to migrate threads *to* a disabled
capability (see schedulePushWork()).
So a disabled capability should do no work, and does not participate
in GC, although it remains alive in other respects. For example, a
blocked thread might wake up on a disabled capability, and it will get
quickly migrated to a live capability. A disabled capability can
still initiate GC if necessary. Indeed, it turns out to be hard to
migrate bound threads, so we wait until the next GC to do this (see
comments for details).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is an experimental tweak to the parallel GC that avoids waking up
a Capability to do parallel GC if we know that the capability has been
idle for a (tunable) number of GC cycles. The idea is that if you're
only using a few Capabilities, there's no point waking up the ones
that aren't busy.
e.g. +RTS -qi3
says "A Capability will participate in parallel GC if it was running
at all since the last 3 GC cycles."
Results are a bit hit and miss, and I don't completely understand why
yet. Hence, for now it is turned off by default, and also not
documented except in the +RTS -? output.
|
|
|
|
|
| |
At present the number of capabilities can only be *increased*, not
decreased. The latter presents a few more challenges!
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Consider this experimental for the time being. There are a lot of
things that could go wrong, but I've verified that at least it works
on the test cases we have.
I also did some API cleanups while I was here. Previously we had:
Capability * rts_eval (Capability *cap, HaskellObj p, /*out*/HaskellObj *ret);
but this API is particularly error-prone: if you forget to discard the
Capability * you passed in and use the return value instead, then
you're in for subtle bugs with +RTS -N later on. So I changed all
these functions to this form:
void rts_eval (/* inout */ Capability **cap,
/* in */ HaskellObj p,
/* out */ HaskellObj *ret)
It's much harder to use this version incorrectly, because you have to
pass the Capability in by reference.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The parallel GC was using setContextSwitches() to stop all the other
threads, which sets the context_switch flag on every Capability. That
had the side effect of causing every Capability to also switch
threads, and since GCs can be much more frequent than context
switches, this increased the context switch frequency. When context
switches are expensive (because the switch is between two bound
threads or a bound and unbound thread), the difference is quite
noticeable.
The fix is to have a separate flag to indicate that a Capability
should stop and return to the scheduler, but not switch threads. I've
called this the "interrupt" flag.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This means that both time and heap profiling work for parallel
programs. Main internal changes:
- CCCS is no longer a global variable; it is now another
pseudo-register in the StgRegTable struct. Thus every
Capability has its own CCCS.
- There is a new built-in CCS called "IDLE", which records ticks for
Capabilities in the idle state. If you profile a single-threaded
program with +RTS -N2, you'll see about 50% of time in "IDLE".
- There is appropriate locking in rts/Profiling.c to protect the
shared cost-centre-stack data structures.
This patch does enough to get it working, I have cut one big corner:
the cost-centre-stack data structure is still shared amongst all
Capabilities, which means that multiple Capabilities will race when
updating the "allocations" and "entries" fields of a CCS. Not only
does this give unpredictable results, but it runs very slowly due to
cache line bouncing.
It is strongly recommended that you use -fno-prof-count-entries to
disable the "entries" count when profiling parallel programs. (I shall
add a note to this effect to the docs).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Terminology cleanup: the type "Ticks" has been renamed "Time", which
is an StgWord64 in units of TIME_RESOLUTION (currently nanoseconds).
The terminology "tick" is now used consistently to mean the interval
between timer signals.
The ticker now always ticks in realtime (actually CLOCK_MONOTONIC if
we have it). Before it used CPU time in the non-threaded RTS and
realtime in the threaded RTS, but I've discovered that the CPU timer
has terrible resolution (at least on Linux) and isn't much use for
profiling. So now we always use realtime. This should also fix
The default tick interval is now 10ms, except when profiling where we
drop it to 1ms. This gives more accurate profiles without affecting
runtime too much (<1%).
Lots of cleanups - the resolution of Time is now in one place
only (Rts.h) rather than having calculations that depend on the
resolution scattered all over the RTS. I hope I found them all.
|
|
|
|
|
|
|
|
|
| |
discard all the sparks from each Capability, but we were forgetting to
account for the discarded sparks in the stats, leading to a failure of
the assertion that tests the spark invariant.
I've moved the discarding of sparks to just before the GC, to avoid
race conditions, and counted the discarded sparks as GC'd.
|
|
|
|
|
|
|
|
| |
calling resurrectThreads() (fixes #5314).
This avoids a lot of problems, because resurrectThreads() may
overwrite some closures in the heap, leaving slop behind. The bug in
instances, this fix avoids them all in one go.
|
|
|
|
|
|
|
| |
A new eventlog event containing 7 spark counters/statistics: sparks
created, dud, overflowed, converted, GC'd, fizzled and remaining.
These are maintained and logged separately for each capability.
We log them at startup, on each GC (minor and major) and on shutdown.
|
|
|
|
|
|
| |
Rather than a separate phase of initSparkPools. It means all the spark
stuff for a capability is initialisaed at the same time, which is then
becomes a good place to stick an initial spark trace event.
|
|
|
|
|
|
|
|
|
| |
The invariant is: created = converted + remaining + gcd + fizzled
Since sparks move between capabilities, we have to aggregate the
counters over all capabilities. This in turn means we can only check
the invariant at stable points where all but one capabilities are
stopped. We can do this at shutdown time and before and after a global
synchronised GC.
|
|
|
|
|
| |
We want to count fizzled sparks accurately. Now tryStealSpark returns
fizzled sparks, and the callers now update the fizzled spark count.
|
|
|
|
|
|
|
|
| |
assembly version as part of the fix for #5250, we inadvertently lost
the Windows magic for extending the stack. Win32 requires that the
stack is extended a page at a time, otherwise you get a segfault. The
C compiler knows how to do this, so we now call a C stub to ensure
there's enough stack space at each invocation of the scheduler.
|
|
|
|
|
|
|
|
|
|
|
| |
Based on a patch from David Terei.
Some parts are a little ugly (e.g. defining things that only ASSERTs
use only when DEBUG is defined), so we might want to tweak things a
little.
I've also turned off -Werror for didn't-inline warnings, as we now
get a few such warnings.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is mostly for the beneift of having sensible places to put tracing
code later. We want a code path that has somewhere to trace (in order):
(1) starting up all capabilities;
(2) N * starting up an individual capability;
(3) N * shutting down an individual capability;
(4) shutting down all capabilities.
This has to work in both threaded and non-threaded modes.
Locations (1) and (2) are provided by initCapabilities and
initCapability respectively. Previously, there was no loccation for (4)
and while shutdownCapability should be usable for (3) it was only called
in the !THREADED_RTS case.
Now, shutdownCapability is called unconditionally (and the body is
conditonal on THREADED_RTS) and there is a new shutdownCapabilities that
calls shutdownCapability in a loop.
|
|
|
|
|
|
|
|
| |
Coutts."
This reverts commit 58532eb46041aec8d4cbb48b054cb5b001edb43c.
Turns out it didn't work on Windows and it'll need some non-trivial changes
to make it work on Windows. We'll get it in later once that's sorted out.
|
| |
|
|
|
|
| |
the other mutator threads (#5127)
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is a port of some of the changes from my private local-GC branch
(which is still in darcs, I haven't converted it to git yet). There
are a couple of small functional differences in the GC stats: first,
per-thread GC timings should now be more accurate, and secondly we now
report average and maximum pause times. e.g. from minimax +RTS -N8 -s:
Tot time (elapsed) Avg pause Max pause
Gen 0 2755 colls, 2754 par 13.16s 0.93s 0.0003s 0.0150s
Gen 1 769 colls, 769 par 3.71s 0.26s 0.0003s 0.0059s
|
| |
|
|
|
|
|
| |
This is an improvement from my GC branch, that helps performance for
intensive message-passing communication between Capabilities.
|
|
|
|
|
|
|
|
|
| |
So we can now get these in ThreadScope:
19487000: cap 1: stopping thread 6 (blocked on black hole owned by thread 4)
Note: needs an update to ghc-events. Older ThreadScopes will just
ignore the new information.
|
|
|
|
| |
threadStackOverflow (#4845)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch makes two changes to the way stacks are managed:
1. The stack is now stored in a separate object from the TSO.
This means that it is easier to replace the stack object for a thread
when the stack overflows or underflows; we don't have to leave behind
the old TSO as an indirection any more. Consequently, we can remove
ThreadRelocated and deRefTSO(), which were a pain.
This is obviously the right thing, but the last time I tried to do it
it made performance worse. This time I seem to have cracked it.
2. Stacks are now represented as a chain of chunks, rather than
a single monolithic object.
The big advantage here is that individual chunks are marked clean or
dirty according to whether they contain pointers to the young
generation, and the GC can avoid traversing clean stack chunks during
a young-generation collection. This means that programs with deep
stacks will see a big saving in GC overhead when using the default GC
settings.
A secondary advantage is that there is much less copying involved as
the stack grows. Programs that quickly grow a deep stack will see big
improvements.
In some ways the implementation is simpler, as nothing special needs
to be done to reclaim stack as the stack shrinks (the GC just recovers
the dead stack chunks). On the other hand, we have to manage stack
underflow between chunks, so there's a new stack frame
(UNDERFLOW_FRAME), and we now have separate TSO and STACK objects.
The total amount of code is probably about the same as before.
There are new RTS flags:
-ki<size> Sets the initial thread stack size (default 1k) Egs: -ki4k -ki2m
-kc<size> Sets the stack chunk size (default 32k)
-kb<size> Sets the stack chunk buffer size (default 1k)
-ki was previously called just -k, and the old name is still accepted
for backwards compatibility. These new options are documented.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is a temporary measure until we fix the bug properly (which is
somewhat tricky, and we think might be easier in the new code
generator).
For now we get:
ghc-stage2: sorry! (unimplemented feature or known bug)
(GHC version 7.1 for i386-unknown-linux):
Trying to allocate more than 1040384 bytes.
See: http://hackage.haskell.org/trac/ghc/ticket/4550
Suggestion: read data from a file instead of having large static data
structures in the code.
|