summaryrefslogtreecommitdiff
path: root/rts/Capability.c
Commit message (Collapse)AuthorAgeFilesLines
* count fizzled and GC'd sparks separatelySimon Marlow2010-11-111-1/+2
|
* count "dud" sparks (expressions that were already evaluated when sparked)Simon Marlow2010-11-011-0/+1
|
* releaseCapabilityAndQueueWorker: task->stopped should be false (#4850)Simon Marlow2010-12-211-1/+5
|
* Keep a maximum of 6 spare worker threads per Capability (#4262)Simon Marlow2010-11-251-10/+28
|
* Make sparks into weak pointers (#2185)Simon Marlow2010-05-251-4/+2
| | | | | The new strategies library (parallel-2.0+, preferably 2.2+) is now required for parallel programming, otherwise parallelism will be lost.
* New implementation of BLACKHOLEsSimon Marlow2010-03-291-61/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This replaces the global blackhole_queue with a clever scheme that enables us to queue up blocked threads on the closure that they are blocked on, while still avoiding atomic instructions in the common case. Advantages: - gets rid of a locked global data structure and some tricky GC code (replacing it with some per-thread data structures and different tricky GC code :) - wakeups are more prompt: parallel/concurrent performance should benefit. I haven't seen anything dramatic in the parallel benchmarks so far, but a couple of threading benchmarks do improve a bit. - waking up a thread blocked on a blackhole is now O(1) (e.g. if it is the target of throwTo). - less sharing and better separation of Capabilities: communication is done with messages, the data structures are strictly owned by a Capability and cannot be modified except by sending messages. - this change will utlimately enable us to do more intelligent scheduling when threads block on each other. This is what started off the whole thing, but it isn't done yet (#3838). I'll be documenting all this on the wiki in due course.
* Use message-passing to implement throwTo in the RTSSimon Marlow2010-03-111-25/+42
| | | | | | | | | | | | | | | | | | | | | | | | | | | | This replaces some complicated locking schemes with message-passing in the implementation of throwTo. The benefits are - previously it was impossible to guarantee that a throwTo from a thread running on one CPU to a thread running on another CPU would be noticed, and we had to rely on the GC to pick up these forgotten exceptions. This no longer happens. - the locking regime is simpler (though the code is about the same size) - threads can be unblocked from a blocked_exceptions queue without having to traverse the whole queue now. It's a rare case, but replaces an O(n) operation with an O(1). - generally we move in the direction of sharing less between Capabilities (aka HECs), which will become important with other changes we have planned. Also in this patch I replaced several STM-specific closure types with a generic MUT_PRIM closure type, which allowed a lot of code in the GC and other places to go away, hence the line-count reduction. The message-passing changes resulted in about a net zero line-count difference.
* Split part of the Task struct into a separate struct InCallSimon Marlow2010-03-091-21/+19
| | | | | | | | | | | | | | | The idea is that this leaves Tasks and OSThread in one-to-one correspondence. The part of a Task that represents a call into Haskell from C is split into a separate struct InCall, pointed to by the Task and the TSO bound to it. A given OSThread/Task thus always uses the same mutex and condition variable, rather than getting a new one for each callback. Conceptually it is simpler, although there are more types and indirections in a few places now. This improves callback performance by removing some of the locks that we had to take when making in-calls. Now we also keep the current Task in a thread-local variable if supported by the OS and gcc (currently only Linux).
* Fix a rare deadlock when the IO manager thread is slow to start upSimon Marlow2010-03-091-1/+9
| | | | | This fixes occasional failures of ffi002(threaded1) on a loaded machine.
* comment-out an incorrect assertionSimon Marlow2010-01-261-1/+4
|
* Expose all EventLog events as DTrace probesManuel M T Chakravarty2009-12-121-6/+5
| | | | | | | | | | | | | | - Defines a DTrace provider, called 'HaskellEvent', that provides a probe for every event of the eventlog framework. - In contrast to the original eventlog, the DTrace probes are available in all flavours of the runtime system (DTrace probes have virtually no overhead if not enabled); when -DTRACING is defined both the regular event log as well as DTrace probes can be used. - Currently, Mac OS X only. User-space DTrace probes are implemented differently on Mac OS X than in the original DTrace implementation. Nevertheless, it shouldn't be too hard to enable these probes on other platforms, too. - Documentation is at http://hackage.haskell.org/trac/ghc/wiki/DTrace
* remove unused cap->in_gc flagSimon Marlow2009-12-021-1/+0
|
* Make allocatePinned use local storage, and other refactoringsSimon Marlow2009-12-011-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | This is a batch of refactoring to remove some of the GC's global state, as we move towards CPU-local GC. - allocateLocal() now allocates large objects into the local nursery, rather than taking a global lock and allocating then in gen 0 step 0. - allocatePinned() was still allocating from global storage and taking a lock each time, now it uses local storage. (mallocForeignPtrBytes should be faster with -threaded). - We had a gen 0 step 0, distinct from the nurseries, which are stored in a separate nurseries[] array. This is slightly strange. I removed the g0s0 global that pointed to gen 0 step 0, and removed all uses of it. I think now we don't use gen 0 step 0 at all, except possibly when there is only one generation. Possibly more tidying up is needed here. - I removed the global allocate() function, and renamed allocateLocal() to allocate(). - the alloc_blocks global is gone. MAYBE_GC() and doYouWantToGC() now check the local nursery only.
* free cap->saved_mut_lists tooSimon Marlow2009-12-011-0/+1
| | | | fixes some memory leakage at shutdown
* findSpark: exit if there's a returning foreign callSimon Marlow2009-10-091-1/+1
|
* Retry pulling from our own spark pool if there was a collisionSimon Marlow2009-10-071-21/+24
|
* Unify event logging and debug tracing.Simon Marlow2009-08-291-14/+6
| | | | | | | | | | | | | | | | | | | - tracing facilities are now enabled with -DTRACING, and -DDEBUG additionally enables debug-tracing. -DEVENTLOG has been removed. - -debug now implies -eventlog - events can be printed to stderr instead of being sent to the binary .eventlog file by adding +RTS -v (which is implied by the +RTS -Dx options). - -Dx debug messages can be sent to the binary .eventlog file by adding +RTS -l. This should help debugging by reducing the impact of debug tracing on execution time. - Various debug messages that duplicated the information in events have been removed.
* waitForReturnCapability: fix logic bugSimon Marlow2009-08-311-1/+1
| | | | | The check for whether a Capability was free was inverted, which harmed performance for callbacks.
* RTS tidyup sweep, first phaseSimon Marlow2009-08-021-9/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The first phase of this tidyup is focussed on the header files, and in particular making sure we are exposinng publicly exactly what we need to, and no more. - Rts.h now includes everything that the RTS exposes publicly, rather than a random subset of it. - Most of the public header files have moved into subdirectories, and many of them have been renamed. But clients should not need to include any of the other headers directly, just #include the main public headers: Rts.h, HsFFI.h, RtsAPI.h. - All the headers needed for via-C compilation have moved into the stg subdirectory, which is self-contained. Most of the headers for the rest of the RTS APIs have moved into the rts subdirectory. - I left MachDeps.h where it is, because it is so widely used in Haskell code. - I left a deprecated stub for RtsFlags.h in place. The flag structures are now exposed by Rts.h. - Various internal APIs are no longer exposed by public header files. - Various bits of dead code and declarations have been removed - More gcc warnings are turned on, and the RTS code is more warning-clean. - More source files #include "PosixSource.h", and hence only use standard POSIX (1003.1c-1995) interfaces. There is a lot more tidying up still to do, this is just the first pass. I also intend to standardise the names for external RTS APIs (e.g use the rts_ prefix consistently), and declare the internal APIs as hidden for shared libraries.
* Add and export rts_unsafeGetMyCapability from rtsDuncan Coutts2009-06-121-0/+15
| | | | | | | | | | | | | | | We need this, or something equivalent, to be able to implement stgAllocForGMP outside of the rts. That's because we want to use allocateLocal which allocates from the given capability without having to take any locks. In the gmp primops we're basically in an unsafe foreign call, that is a context where we hold a current capability. So it's safe for us to use allocateLocal. We just need a way to get the current capability. The method to get the current capability varies depends on whether we're using the threaded rts or not. When stgAllocForGMP is built inside the rts that's ok because we can do it conditionally on THREADED_RTS. Outside the rts we need a single api we can call without knowing if we're talking to a threaded rts or not, hence this addition.
* Remove old GUM/GranSim codeSimon Marlow2009-06-021-1/+1
|
* Eventlog support for new event type: create spark.donnie@darthik.com2009-04-031-0/+9
|
* Add fast event loggingSimon Marlow2009-03-171-10/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | Generate binary log files from the RTS containing a log of runtime events with timestamps. The log file can be visualised in various ways, for investigating runtime behaviour and debugging performance problems. See for example the forthcoming ThreadScope viewer. New GHC option: -eventlog (link-time option) Enables event logging. +RTS -l (runtime option) Generates <prog>.eventlog with the binary event information. This replaces some of the tracing machinery we already had in the RTS: e.g. +RTS -vg for GC tracing (we should do this using the new event logging instead). Event logging has almost no runtime cost when it isn't enabled, though in the future we might add more fine-grained events and this might change; hence having a link-time option and compiling a separate version of the RTS for event logging. There's a small runtime cost for enabling event-logging, for most programs it shouldn't make much difference. (Todo: docs)
* Instead of a separate context-switch flag, set HpLim to zeroSimon Marlow2009-03-131-8/+11
| | | | | | | | | | | | This reduces the latency between a context-switch being triggered and the thread returning to the scheduler, which in turn should reduce the cost of the GC barrier when there are many cores. We still retain the old context_switch flag which is checked at the end of each block of allocation. The idea is that setting HpLim may fail if the the target thread is modifying HpLim at the same time; the context_switch flag is a fallback. It also allows us to "context switch soon" without forcing an immediate switch, which can be costly.
* Keep the remembered sets local to each thread during parallel GCSimon Marlow2009-01-121-0/+3
| | | | | | | | | | | | | | | | | | | | | This turns out to be quite vital for parallel programs: - The way we discover which threads to traverse is by finding dirty threads via the remembered sets (aka mutable lists). - A dirty thread will be on the remembered set of the capability that was running it, and we really want to traverse that thread's stack using the GC thread for the capability, because it is in that CPU's cache. If we get this wrong, we get penalised badly by the memory system. Previously we had per-capability mutable lists but they were aggregated before GC and traversed by just one of the GC threads. This resulted in very poor performance particularly for parallel programs with deep stacks. Now we keep per-capability remembered sets throughout GC, which also removes a lock (recordMutableGen_sync).
* Use mutator threads to do GC, instead of having a separate pool of GC threadsSimon Marlow2008-11-211-55/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Previously, the GC had its own pool of threads to use as workers when doing parallel GC. There was a "leader", which was the mutator thread that initiated the GC, and the other threads were taken from the pool. This was simple and worked fine for sequential programs, where we did most of the benchmarking for the parallel GC, but falls down for parallel programs. When we have N mutator threads and N cores, at GC time we would have to stop N-1 mutator threads and start up N-1 GC threads, and hope that the OS schedules them all onto separate cores. It practice it doesn't, as you might expect. Now we use the mutator threads to do GC. This works quite nicely, particularly for parallel programs, where each mutator thread scans its own spark pool, which is probably in its cache anyway. There are some flag changes: -g<n> is removed (-g1 is still accepted for backwards compat). There's no way to have a different number of GC threads than mutator threads now. -q1 Use one OS thread for GC (turns off parallel GC) -qg<n> Use parallel GC for generations >= <n> (default: 1) Using parallel GC only for generations >=1 works well for sequential programs. Compiling an ordinary sequential program with -threaded and running it with -N2 or more should help if you do a lot of GC. I've found that adding -qg0 (do parallel GC for generation 0 too) speeds up some parallel programs, but slows down some sequential programs. Being conservative, I left the threshold at 1. ToDo: document the new options.
* don't run sparks if there are other threads on this CapabilitySimon Marlow2008-11-141-4/+7
|
* Add optional eager black-holing, with new flag -feager-blackholingSimon Marlow2008-11-181-0/+1
| | | | | | | | | | | | | | | Eager blackholing can improve parallel performance by reducing the chances that two threads perform the same computation. However, it has a cost: one extra memory write per thunk entry. To get the best results, any code which may be executed in parallel should be compiled with eager blackholing turned on. But since there's a cost for sequential code, we make it optional and turn it on for the parallel package only. It might be a good idea to compile applications (or modules) with parallel code in with -feager-blackholing. ToDo: document -feager-blackholing.
* move an assertionSimon Marlow2008-11-131-2/+2
|
* re-instate counting of sparks convertedSimon Marlow2008-11-061-3/+16
| | | | lost in patch "Run sparks in batches"
* Run sparks in batches, instead of creating a new thread for each oneSimon Marlow2008-11-061-7/+22
| | | | | Signficantly reduces the overhead for par, which means that we can make use of paralellism at a much finer granularity.
* Move the freeing of Capabilities later in the shutdown sequenceSimon Marlow2008-10-241-3/+16
| | | | | Fixes a bug whereby the Capability has been freed but other Capabilities are still trying to steal sparks from its pool.
* fix a warningSimon Marlow2008-10-231-1/+0
|
* traverse the spark pools only once during GC rather than twiceSimon Marlow2008-10-221-17/+8
|
* Refactoring and reorganisation of the schedulerSimon Marlow2008-10-221-72/+53
| | | | | | | | | | | | | | | | | Change the way we look for work in the scheduler. Previously, checking to see whether there was anything to do was a non-side-effecting operation, but this has changed now that we do work-stealing. This lead to a refactoring of the inner loop of the scheduler. Also, lots of cleanup in the new work-stealing code, but no functional changes. One new statistic is added to the +RTS -s output: SPARKS: 1430 (2 converted, 1427 pruned) lets you know something about the use of `par` in the program.
* Work stealing for sparksberthold@mathematik.uni-marburg.de2008-09-151-3/+54
| | | | | | | | | | | | | | | | | | | | | | | | | | Spark stealing support for PARALLEL_HASKELL and THREADED_RTS versions of the RTS. Spark pools are per capability, separately allocated and held in the Capability structure. The implementation uses Double-Ended Queues (deque) and cas-protected access. The write end of the queue (position bottom) can only be used with mutual exclusion, i.e. by exactly one caller at a time. Multiple readers can steal()/findSpark() from the read end (position top), and are synchronised without a lock, based on a cas of the top position. One reader wins, the others return NULL for a failure. Work stealing is called when Capabilities find no other work (inside yieldCapability), and tries all capabilities 0..n-1 twice, unless a theft succeeds. Inside schedulePushWork, all considered cap.s (those which were idle and could be grabbed) are woken up. Future versions should wake up capabilities immediately when putting a new spark in the local pool, from newSpark(). Patch has been re-recorded due to conflicting bugfixes in the sparks.c, also fixing a (strange) conflict in the scheduler.
* Move the context_switch flag into the CapabilitySimon Marlow2008-09-191-0/+15
| | | | | Fixes a long-standing bug that could in some cases cause sub-optimal scheduling behaviour.
* Fix race condition in wakeupThreadOnCapability() (#2574)Simon Marlow2008-09-091-40/+23
| | | | | | | | | | | wakeupThreadOnCapbility() is used to signal another capability that there is a thread waiting to be added to its run queue. It adds the thread to the (locked) wakeup queue on the remote capability. In order to do this, it has to modify the TSO's link field, which has a write barrier. The write barrier might put the TSO on the mutable list, and the bug was that it was using the mutable list of the *target* capability, which we do not have exclusive access to. We should be using the current Capabilty's mutable list in this case.
* Capability stopping when waiting for GCberthold@mathematik.uni-marburg.de2008-08-191-1/+26
|
* Undo fix for #2185: sparks really should be treated as rootsSimon Marlow2008-07-231-15/+5
| | | | | Unless sparks are roots, strategies don't work at all: all the sparks get GC'd. We need to think about this some more.
* FIX #2185: sparks should not be treated as roots by the GCSimon Marlow2008-04-241-4/+28
|
* Reorganisation to fix problems related to the gct register variableSimon Marlow2008-04-161-0/+50
| | | | | | | | | - GCAux.c contains code not compiled with the gct register enabled, it is callable from outside the GC - marking functions are moved to their relevant subsystems, outside the GC - mark_root needs to save the gct register, as it is called from outside the GC
* hs_exit()/shutdownHaskell(): wait for outstanding foreign calls to complete ↵Simon Marlow2007-07-241-2/+18
| | | | | | | | | | | | | | | | before returning This is pertinent to #1177. When shutting down a DLL, we need to be sure that there are no OS threads that can return to the code that we are about to unload, and previously the threaded RTS was unsafe in this respect. When exiting a standalone program we don't have to be quite so paranoid: all the code will disappear at the same time as any running threads. Happily exiting a program happens via a different path: shutdownHaskellAndExit(). If we're about to exit(), then there's no need to wait for foreign calls to complete.
* Warning police: "%p" format expects a void*sven.panne@aedion.de2007-02-031-1/+1
|
* Partial fix for #926Simon Marlow2007-02-011-0/+24
| | | | | | | | | | | | | | It seems that when a program exits with open DLLs on Windows, the system attempts to shut down the DLLs, but it also terminates (some of?) the running threads. The RTS isn't prepared for threads to die unexpectedly, so it sits around waiting for its workers to finish. This bites in two places: ShutdownIOManager() in the the unthreaded RTS, and shutdownCapability() in the threaded RTS. So far I've modified the latter to notice when worker threads have died unexpectedly and continue shutting down. It seems a bit trickier to fix the unthreaded RTS, so for now the workaround for #926 is to use the threaded RTS.
* Free more things that we allocate2006-12-16Ian Lynagh2006-12-151-2/+8
|
* Free various things we allocateIan Lynagh2006-12-151-0/+2
|
* remove unused includes, now that Storage.h & Stable.h are included by Rts.hSimon Marlow2006-11-151-1/+0
|
* Split GC.c, and move storage manager into sm/ directorySimon Marlow2006-10-241-0/+1
| | | | | | | | | | | | | | | | | In preparation for parallel GC, split up the monolithic GC.c file into smaller parts. Also in this patch (and difficult to separate, unfortunatley): - Don't include Stable.h in Rts.h, instead just include it where necessary. - consistently use STATIC_INLINE in source files, and INLINE_HEADER in header files. STATIC_INLINE is now turned off when DEBUG is on, to make debugging easier. - The GC no longer takes the get_roots function as an argument. We weren't making use of this generalisation.
* STM invariantstharris@microsoft.com2006-10-071-1/+2
|