| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
| |
Now, the threaded RTS also includes SMP support. The -smp flag is a
synonym for -threaded. The performance implications of this are small
to negligible, and it results in a code cleanup and reduces the number
of combinations we have to test.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Big re-hash of the threaded/SMP runtime
This is a significant reworking of the threaded and SMP parts of
the runtime. There are two overall goals here:
- To push down the scheduler lock, reducing contention and allowing
more parts of the system to run without locks. In particular,
the scheduler does not require a lock any more in the common case.
- To improve affinity, so that running Haskell threads stick to the
same OS threads as much as possible.
At this point we have the basic structure working, but there are some
pieces missing. I believe it's reasonably stable - the important
parts of the testsuite pass in all the (normal,threaded,SMP) ways.
In more detail:
- Each capability now has a run queue, instead of one global run
queue. The Capability and Task APIs have been completely
rewritten; see Capability.h and Task.h for the details.
- Each capability has its own pool of worker Tasks. Hence, Haskell
threads on a Capability's run queue will run on the same worker
Task(s). As long as the OS is doing something reasonable, this
should mean they usually stick to the same CPU. Another way to
look at this is that we're assuming each Capability is associated
with a fixed CPU.
- What used to be StgMainThread is now part of the Task structure.
Every OS thread in the runtime has an associated Task, and it
can ask for its current Task at any time with myTask().
- removed RTS_SUPPORTS_THREADS symbol, use THREADED_RTS instead
(it is now defined for SMP too).
- The RtsAPI has had to change; we must explicitly pass a Capability
around now. The previous interface assumed some global state.
SchedAPI has also changed a lot.
- The OSThreads API now supports thread-local storage, used to
implement myTask(), although it could be done more efficiently
using gcc's __thread extension when available.
- I've moved some POSIX-specific stuff into the posix subdirectory,
moving in the direction of separating out platform-specific
implementations.
- lots of lock-debugging and assertions in the runtime. In particular,
when DEBUG is on, we catch multiple ACQUIRE_LOCK()s, and there is
also an ASSERT_LOCK_HELD() call.
What's missing so far:
- I have almost certainly broken the Win32 build, will fix soon.
- any kind of thread migration or load balancing. This is high up
the agenda, though.
- various performance tweaks to do
- throwTo and forkProcess still do not work in SMP mode
|
|
|
|
|
| |
back out revision 1.22; it led to very bad memory fragmentation. A
rethink is in order.
|
|
|
|
|
|
|
|
|
|
|
|
| |
Block allocator performance fix: instead of keeping the free list
ordered, keep it doubly-linked, and introduce a new flag BF_FREE so we
can tell when a block is free. We can still coalesce blocks on the
free list because block descriptors are kept consecutively in memory,
so we can tell based on the BF_FREE flag whether to coalesce with the
next higher/lower blocks when freeing a block.
This (almost) make freeChain O(n) rather than O(n^2), and has been
reported to help a lot when dealing with very large heaps.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
GC changes: instead of threading old-generation mutable lists
through objects in the heap, keep it in a separate flat array.
This has some advantages:
- the IND_OLDGEN object is now only 2 words, so the minimum
size of a THUNK is now 2 words instead of 3. This saves
some amount of allocation (about 2% on average according to
my measurements), and is more friendly to the cache by
squashing objects together more.
- keeping the mutable list separate from the IND object
will be necessary for our multiprocessor implementation.
- removing the mut_link field makes the layout of some objects
more uniform, leading to less complexity and special cases.
- I also unified the two mutable lists (mut_once_list and mut_list)
into a single mutable list, which lead to more simplifications
in the GC.
|
|
|
|
|
| |
Removed the annoying "Id" CVS keywords, they're a real PITA when it
comes to merging...
|
|
|
|
| |
eliminate some more gcc 3.4 warnings
|
|
|
|
|
|
|
|
|
|
|
|
| |
Cleanup: all (well, most) messages from the RTS now go through the
functions in RtsUtils: barf(), debugBelch() and errorBelch(). The
latter two were previously called belch() and prog_belch()
respectively. See the comments for the right usage of these message
functions.
One reason for doing this is so that we can avoid spurious uses of
stdout/stderr by Haskell apps on platforms where we shouldn't be using
them (eg. non-console apps on Windows).
|
|
|
|
|
|
| |
Tweaks to have RTS (C) sources compile with MSVC. Apart from wibbles
related to the handling of 'inline', changed Schedule.h:POP_RUN_QUEUE()
not to use expression-level statement blocks.
|
|
|
|
| |
make use of MBLOCK_ROUND_DOWN()
|
|
|
|
| |
Make it multi-init-safe
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Remove most #includes of system headers from Stg.h, and instead
#include any required headers directly in each RTS source file.
The idea is to (a) reduce namespace pollution from system headers that
we don't need, (c) be clearer about dependencies on system things in
the RTS, and (c) improve via-C compilation times (maybe).
In practice though, HsBase.h #includes everything anyway, so the
difference from the point of view of .hc source is minimal. However,
this makes it easier to move to zero-includes if we wanted to (see
discussion on the FFI list; I'm still not sure that's possible but
at least this is a step in the right direction).
|
|
|
|
| |
Fix a bug in the previous commit, and add some more sanity checking.
|
|
|
|
|
| |
(addendum to the previous commit) also set bd->blocks to zero in
coalesce().
|
|
|
|
|
| |
For each non-head block in a block group, set its 'blocks' field to
zero (as per comments elsewhere).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Change the story about POSIX headers in C compilation.
Until now, all C code in the RTS and library cbits has by default been
compiled with settings for POSIXness enabled, that is:
#define _POSIX_SOURCE 1
#define _POSIX_C_SOURCE 199309L
#define _ISOC9X_SOURCE
If you wanted to negate this, you'd have to define NON_POSIX_SOURCE
before including headers.
This scheme has some bad effects:
* It means that ccall-unfoldings exported via interfaces from a
module compiled with -DNON_POSIX_SOURCE may not compile when
imported into a module which does not -DNON_POSIX_SOURCE.
* It overlaps with the feature tests we do with autoconf.
* It seems to have caused borkage in the Solaris builds for some
considerable period of time.
The New Way is:
* The default changes to not-being-in-Posix mode.
* If you want to force a C file into Posix mode, #include as
the **first** include the new file ghc/includes/PosixSource.h.
Most of the RTS C sources have this include now.
* NON_POSIX_SOURCE is almost totally expunged. Unfortunately
we have to retain some vestiges of it in ghc/compiler so that
modules compiled via C on Solaris using older compilers don't
break.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Add a compacting garbage collector.
It isn't enabled by default, as there are still a couple of problems:
there's a fallback case I haven't implemented yet which means it will
occasionally bomb out, and speed-wise it's quite a bit slower than the
copying collector (about 1.8x slower).
Until I can make it go faster, it'll only be useful when you're
actually running low on real memory.
'+RTS -c' to enable it.
Oh, and I cleaned up a few things in the RTS while I was there, and
fixed one or two possibly real bugs in the existing GC.
|
|
|
|
|
|
|
|
|
|
|
|
| |
Small changes to improve GC performance slightly:
- store the generation *number* in the block descriptor rather
than a pointer to the generation structure, since the most
common operation is to pull out the generation number, and
it's one less indirection this way.
- cache the generation number in the step structure too, which
avoids an extra indirection in several places.
|
|
|
|
|
| |
The bd->free field of a block descriptor is supposed to be set to -1
for free blocks, if we're #ifdef DEBUGging. It wasn't sometimes.
|
|
|
|
|
|
|
| |
The allocator for mega groups now checks if consecutive megablocks on
the free list are contiguous in memory. The omission of this check
caused all kinds of funny runtime errors and took away at least five
happy years of my life... :-{
|
|
|
|
| |
Fix bug in allocGroup() when allocating an entire megablock in one go.
|
|
|
|
| |
Copyright police.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Added a generational garbage collector.
The collector is reliable but fairly untuned as yet. It works with an
arbitrary number of generations: use +RTS -G<gens> to change the
number of generations used (default 2).
Stats: +RTS -Sstderr is quite useful, but to really see what's going
on compile the RTS with -DDEBUG and use +RTS -D32.
ARR_PTRS removed - it wasn't used anywhere.
Sanity checking improved:
- free blocks are now spammed when sanity checking is turned on
- a check for leaking blocks is performed after each GC.
|
|
Move 4.01 onto the main trunk.
|