| Commit message (Collapse) | Author | Age | Files | Lines |
| |
|
| |
|
| |
|
| |
|
|
|
|
|
|
| |
This patch also fixes ullong_format_string (renamed to showStgWord64)
so that it works with values outside the 32bit range (trac #3979), and
simplifies the without-commas case.
|
|
|
|
|
|
|
| |
These are no longer used: once upon a time they used to have different
layout from IND and IND_PERM respectively, but that is no longer the
case since we changed the remembered set to be an array of addresses
instead of a linked list of closures.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The list of threads blocked on an MVar is now represented as a list of
separately allocated objects rather than being linked through the TSOs
themselves. This lets us remove a TSO from the list in O(1) time
rather than O(n) time, by marking the list object. Removing this
linear component fixes some pathalogical performance cases where many
threads were blocked on an MVar and became unreachable simultaneously
(nofib/smp/threads007), or when sending an asynchronous exception to a
TSO in a long list of thread blocked on an MVar.
MVar performance has actually improved by a few percent as a result of
this change, slightly to my surprise.
This is the final cleanup in the sequence, which let me remove the old
way of waking up threads (unblockOne(), MSG_WAKEUP) in favour of the
new way (tryWakeupThread and MSG_TRY_WAKEUP, which is idempotent). It
is now the case that only the Capability that owns a TSO may modify
its state (well, almost), and this simplifies various things. More of
the RTS is based on message-passing between Capabilities now.
|
| |
|
|
|
|
|
|
|
| |
This fixes #3838, and was made possible by the new BLACKHOLE
infrastructure. To allow reording of the run queue I had to make it
doubly-linked, which entails some extra trickiness with regard to
GC write barriers and suchlike.
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This replaces the global blackhole_queue with a clever scheme that
enables us to queue up blocked threads on the closure that they are
blocked on, while still avoiding atomic instructions in the common
case.
Advantages:
- gets rid of a locked global data structure and some tricky GC code
(replacing it with some per-thread data structures and different
tricky GC code :)
- wakeups are more prompt: parallel/concurrent performance should
benefit. I haven't seen anything dramatic in the parallel
benchmarks so far, but a couple of threading benchmarks do improve
a bit.
- waking up a thread blocked on a blackhole is now O(1) (e.g. if
it is the target of throwTo).
- less sharing and better separation of Capabilities: communication
is done with messages, the data structures are strictly owned by a
Capability and cannot be modified except by sending messages.
- this change will utlimately enable us to do more intelligent
scheduling when threads block on each other. This is what started
off the whole thing, but it isn't done yet (#3838).
I'll be documenting all this on the wiki in due course.
|
| |
|
|
|
|
|
| |
mingw doesn't understand %llu/%lld - it treats them as 32-bit rather
than 64-bit. We use %I64u/%I64d instead.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This replaces some complicated locking schemes with message-passing
in the implementation of throwTo. The benefits are
- previously it was impossible to guarantee that a throwTo from
a thread running on one CPU to a thread running on another CPU
would be noticed, and we had to rely on the GC to pick up these
forgotten exceptions. This no longer happens.
- the locking regime is simpler (though the code is about the same
size)
- threads can be unblocked from a blocked_exceptions queue without
having to traverse the whole queue now. It's a rare case, but
replaces an O(n) operation with an O(1).
- generally we move in the direction of sharing less between
Capabilities (aka HECs), which will become important with other
changes we have planned.
Also in this patch I replaced several STM-specific closure types with
a generic MUT_PRIM closure type, which allowed a lot of code in the GC
and other places to go away, hence the line-count reduction. The
message-passing changes resulted in about a net zero line-count
difference.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The idea is that this leaves Tasks and OSThread in one-to-one
correspondence. The part of a Task that represents a call into
Haskell from C is split into a separate struct InCall, pointed to by
the Task and the TSO bound to it. A given OSThread/Task thus always
uses the same mutex and condition variable, rather than getting a new
one for each callback. Conceptually it is simpler, although there are
more types and indirections in a few places now.
This improves callback performance by removing some of the locks that
we had to take when making in-calls. Now we also keep the current Task
in a thread-local variable if supported by the OS and gcc (currently
only Linux).
|
| |
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
| |
This helps when the thread holding the lock has been descheduled,
which is the main cause of the "last-core slowdown" problem. With
this patch, I get much better results with -N8 on an 8-core box,
although some benchmarks are still worse than with 7 cores.
I also added a yieldThread() into the any_work() loop of the parallel
GC when it has no work to do. Oddly, this seems to improve performance
on the parallel GC benchmarks even when all the cores are busy.
Perhaps it is due to reducing contention on the memory bus.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
| |
The card table is an array of bytes, placed directly following the
actual array data. This means that array reading is unaffected, but
array writing needs to read the array size from the header in order to
find the card table.
We use a bytemap rather than a bitmap, because updating the card table
must be multi-thread safe. Each byte refers to 128 entries of the
array, but this is tunable by changing the constant
MUT_ARR_PTRS_CARD_BITS in includes/Constants.h.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
- Defines a DTrace provider, called 'HaskellEvent', that provides a probe
for every event of the eventlog framework.
- In contrast to the original eventlog, the DTrace probes are available in
all flavours of the runtime system (DTrace probes have virtually no
overhead if not enabled); when -DTRACING is defined both the regular
event log as well as DTrace probes can be used.
- Currently, Mac OS X only. User-space DTrace probes are implemented
differently on Mac OS X than in the original DTrace implementation.
Nevertheless, it shouldn't be too hard to enable these probes on other
platforms, too.
- Documentation is at http://hackage.haskell.org/trac/ghc/wiki/DTrace
|
|
|
|
| |
We now just call gcc to get the dependencies directly
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The GC had a two-level structure, G generations each of T steps.
Steps are for aging within a generation, mostly to avoid premature
promotion.
Measurements show that more than 2 steps is almost never worthwhile,
and 1 step is usually worse than 2. In theory fractional steps are
possible, so the ideal number of steps is somewhere between 1 and 3.
GHC's default has always been 2.
We can implement 2 steps quite straightforwardly by having each block
point to the generation to which objects in that block should be
promoted, so blocks in the nursery point to generation 0, and blocks
in gen 0 point to gen 1, and so on.
This commit removes the explicit step structures, merging generations
with steps, thus simplifying a lot of code. Performance is
unaffected. The tunable number of steps is now gone, although it may
be replaced in the future by a way to tune the aging in generation 0.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is a batch of refactoring to remove some of the GC's global
state, as we move towards CPU-local GC.
- allocateLocal() now allocates large objects into the local
nursery, rather than taking a global lock and allocating
then in gen 0 step 0.
- allocatePinned() was still allocating from global storage and
taking a lock each time, now it uses local storage.
(mallocForeignPtrBytes should be faster with -threaded).
- We had a gen 0 step 0, distinct from the nurseries, which are
stored in a separate nurseries[] array. This is slightly strange.
I removed the g0s0 global that pointed to gen 0 step 0, and
removed all uses of it. I think now we don't use gen 0 step 0 at
all, except possibly when there is only one generation. Possibly
more tidying up is needed here.
- I removed the global allocate() function, and renamed
allocateLocal() to allocate().
- the alloc_blocks global is gone. MAYBE_GC() and
doYouWantToGC() now check the local nursery only.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
-H alone causes the RTS to use a larger nursery, but without exceeding
the amount of memory that the application is already using. It trades
off GC time against locality: the default setting is to use a
fixed-size 512k nursery, but this is sometimes worse than using a very
large nursery despite the worse locality.
Not all programs get faster, but some programs that use large heaps do
much better with -H. e.g. this helps a lot with #3061 (binary-trees),
though not as much as specifying -H<large>. Typically using -H<large>
is better than plain -H, because the runtime doesn't know ahead of
time how much memory you want to use.
Should -H be on by default? I'm not sure, it makes some programs go
slower, but others go faster.
|
|
|
|
|
|
|
| |
At the moment, this just saves a memory reference in the GC inner loop
(worth a percent or two of GC time). Later, it will hopefully let me
experiment with partial steps, and simplifying the generation/step
infrastructure.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In a stack overflow situation, stack squeezing may reduce the stack
size, but we don't know whether it has been reduced enough for the
stack check to succeed if we try again. Fortunately stack squeezing
is idempotent, so all we need to do is record whether *any* squeezing
happened. If we are at the stack's absolute -K limit, and stack
squeezing happened, then we try running the thread again.
We also want to avoid enlarging the stack if squeezing has already
released some of it. However, we don't want to get into a
pathalogical situation where a thread has a nearly full stack (near
its current limit, but not near the absolute -K limit), keeps
allocating a little bit, squeezing removes a little bit, and then it
runs again. So to avoid this, if we squeezed *and* there is still
less than BLOCK_SIZE_W words free, then we enlarge the stack anyway.
|
| |
|
| |
|
| |
|
| |
|
|
|
|
|
| |
The log file format was still using 32 bits, this just updates the
header file to match; there should be no functional changes.
|
|
|
|
|
|
|
|
|
|
|
|
| |
Patch 1/2: second part of the patch is to libraries/base
This time without dynamic linker hacks, instead I've expanded the
existing rts/Globals.c to cache more CAFs, specifically those in
GHC.Conc. We were already using this trick for signal handlers, I
should have realised before.
It's still quite unsavoury, but we can do away with rts/Globals.c in
the future when we switch to a dynamically-linked GHCi.
|
|
|
|
|
|
| |
This means we can remove some conditional stuff from the Makefiles,
and means the testsuite doesn't have to work out whether or not it's
on Windows.
|
| |
|
|
|
|
|
| |
While fixing #3578 I noticed that this function was just a field
access to StgTRecHeader, so I inlined it manually.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch eliminates a couple of places where we were assuming that
the host word size is the same as the target word size.
Also a little refactoring: Constants now exports the types TargetInt
and TargetWord corresponding to the Int/Word type on the target
platform, and I moved the definitions of tARGET_INT_MAX and friends
from Literal to Constants.
Thanks to Barney Stratford <barney_stratford@fastmail.fm> for helping
track down the problem and fix it. We now know that GHC can
successfully cross-compile from 32-bit to 64-bit.
|
|
|
|
|
|
|
| |
This is a follow up to the patch tha fixes Trac #3439.
We had forgotten the dynamic linker, which needs to
know all these ticky symbols too.
|
|
|
|
|
| |
This helps on a hyperthreaded CPU by yielding to the other thread in a
spinlock loop.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
added:
primop TraceEventOp "traceEvent#" GenPrimOp
Addr# -> State# s -> State# s
{ Emits an event via the RTS tracing framework. The contents
of the event is the zero-terminated byte string passed as the first
argument. The event will be emitted either to the .eventlog file,
or to stderr, depending on the runtime RTS flags. }
and added the required RTS functionality to support it. Also a bit of
refactoring in the RTS tracing code.
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
| |
This makes events smaller and tracing quicker, and speeds up reading
and sorting the trace file.
HEADS UP: this changes the format of event log files. Corresponding
changes to the ghc-events package are required (and will be pushed
soon). Normally we would make backwards-compatible changes, but this
changes the format of every event (to remove the capability) so I'm
breaking the rules this time. This will be the only time we can do
this, since the format becomes public in 6.12.1.
|
|
|
|
|
| |
These indicate the size and time span of a sequence of events in the
event log, to make it easier to sort and navigate a large event log.
|