summaryrefslogtreecommitdiff
path: root/rts/sm
Commit message (Collapse)AuthorAgeFilesLines
* Working towards fixing DLLs on Win64Ian Lynagh2012-05-062-2/+2
|
* Fix maintenance of n_blocks in the RTSIan Lynagh2012-05-011-1/+1
| | | | | | | It was causing assertion failures of ASSERT(countBlocks(nursery->blocks) == nursery->n_blocks) at ghc-stage2: internal error: ASSERTION FAILED: file rts/sm/Sanity.c, line 878
* Fix warnings on Win64Ian Lynagh2012-04-262-11/+11
| | | | | | Mostly this meant getting pointer<->int conversions to use the right sizes. lnat is now size_t, rather than unsigned long, as that seems a better match for how it's used.
* Fix the timestamps in GC_START and GC_END events on the GC-initiating capMikolaj2012-04-041-1/+1
| | | | | | | | | | | There was a discrepancy between GC times reported in +RTS -s and the timestamps of GC_START and GC_END events on the cap, on which +RTS -s stats for the given GC are based. This is fixed by posting the events with exactly the same timestamp as generated for the stat calculation. The calls posting the events are moved too, so that the events are emitted close to the time instant they claim to be emitted at. The GC_STATS_GHC was moved, too, ensuring it's emitted before the moved GC_END on all caps, which simplifies tools code.
* Emit final heap alloc events and rearrange code to calculate alloc totalsDuncan Coutts2012-04-043-24/+28
| | | | | | | | | | | | | | | | | In stat_exit we want to emit a final EVENT_HEAP_ALLOCATED for each cap so that we get the same total allocation count as reported via +RTS -s. To do so we need to update the per-cap total_allocated counts. Previously we had a single calcAllocated(rtsBool) function that counted the large allocations and optionally the nurseries for all caps. The GC would always call it with false, and the stat_exit always with true. The reason for these two modes is that the GC counts the nurseries via clearNurseries() (which also updates the per-cap total_allocated counts), so it's only the stat_exit() path that needs to count them. We now split the calcAllocated() function into two: countLargeAllocated and updateNurseriesStats. As the name suggests, the latter now updates the per-cap total_allocated counts, in additon to returning a total.
* Add new eventlog events for various heap and GC statisticsDuncan Coutts2012-04-042-3/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | They cover much the same info as is available via the GHC.Stats module or via the '+RTS -s' textual output, but via the eventlog and with a better sampling frequency. We have three new generic heap info events and two very GHC-specific ones. (The hope is the general ones are usable by other implementations that use the same eventlog system, or indeed not so sensitive to changes in GHC itself.) The general ones are: * total heap mem allocated since prog start, on a per-HEC basis * current size of the heap (MBlocks reserved from OS for the heap) * current size of live data in the heap Currently these are all emitted by GHC at GC time (live data only at major GC). The GHC specific ones are: * an event giving various static heap paramaters: * number of generations (usually 2) * max size if any * nursary size * MBlock and block sizes * a event emitted on each GC containing: * GC generation (usually just 0,1) * total bytes copied * bytes lost to heap slop and fragmentation * the number of threads in the parallel GC (1 for serial) * the maximum number of bytes copied by any par GC thread * the total number of bytes copied by all par GC threads (these last three can be used to calculate an estimate of the work balance in parallel GCs)
* Change the presentation of parallel GC work balance in +RTS -sDuncan Coutts2012-04-041-9/+8
| | | | | | | | | | | | | | | | | | | | Also rename internal variables to make the names match what they hold. The parallel GC work balance is calculated using the total amount of memory copied by all GC threads, and the maximum copied by any individual thread. You have serial GC when the max is the same as copied, and perfectly balanced GC when total/max == n_caps. Previously we presented this as the ratio total/max and told users that the serial value was 1 and the ideal value N, for N caps, e.g. Parallel GC work balance: 1.05 (4045071 / 3846774, ideal 2) The downside of this is that the user always has to keep in mind the number of cores being used. Our new presentation uses a normalised scale 0--1 as a percentage. The 0% means completely serial and 100% is perfect balance, e.g. Parallel GC work balance: 4.56% (serial 0%, perfect 100%)
* Calculate the total memory allocated on a per-capability basisDuncan Coutts2012-04-041-1/+3
| | | | | | | In addition to the existing global method. For now we just do it both ways and assert they give the same grand total. At some stage we can simplify the global method to just take the sum of the per-cap counters.
* Drop the per-task timing stats, give a summary only (#5897)Simon Marlow2012-03-021-1/+1
| | | | | | | | | | | | | | | | | | | | | | We were keeping around the Task struct (216 bytes) for every worker we ever created, even though we only keep a maximum of 6 workers per Capability. These Task structs accumulate and cause a space leak in programs that do lots of safe FFI calls; this patch frees the Task struct as soon as a worker exits. One reason we were keeping the Task structs around is because we print out per-Task timing stats in +RTS -s, but that isn't terribly useful. What is sometimes useful is knowing how *many* Tasks there were. So now I'm printing a single-line summary, this is for the program in TASKS: 2001 (1 bound, 31 peak workers (2000 total), using -N1) So although we created 2k tasks overall, there were only 31 workers active at any one time (which is exactly what we expect: the program makes 30 safe FFI calls concurrently). This also gives an indication of how many capabilities were being used, which is handy if you use +RTS -N without an explicit number.
* formatting tweaksGabor Greif2012-02-271-8/+8
|
* tabs -> spacesGabor Greif2012-02-271-53/+53
|
* Allocate pinned object blocks from the nursery, not the globalSimon Marlow2012-02-133-13/+91
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | allocator. Prompted by a benchmark posted to parallel-haskell@haskell.org by Andreas Voellmy <andreas.voellmy@gmail.com>. This program exhibits contention for the block allocator when run with -N2 and greater without the fix: {-# LANGUAGE MagicHash, UnboxedTuples, BangPatterns #-} module Main where import Control.Monad import Control.Concurrent import System.Environment import GHC.IO import GHC.Exts import GHC.Conc main = do [m] <- fmap (fmap read) getArgs n <- getNumCapabilities ms <- replicateM n newEmptyMVar sequence [ forkIO $ busyWorkerB (m `quot` n) >> putMVar mv () | mv <- ms ] mapM takeMVar ms busyWorkerB :: Int -> IO () busyWorkerB n_loops = go 0 where go !n | n >= n_loops = return () | otherwise = do p <- (IO $ \s -> case newPinnedByteArray# 1024# s of { (# s', mbarr# #) -> (# s', () #) } ) go (n+1)
* Fix for a bug in setNumCapabilitiesSimon Marlow2011-12-131-3/+8
|
* Fix for a bug in +RTS -qi (crash in zero_static_object_list)Simon Marlow2011-12-131-1/+3
|
* Add a comment about oddity with yieldThread() and timing results on LinuxSimon Marlow2011-12-131-0/+5
|
* Avoid integer overflow when calling allocGroup() (#5071)Simon Marlow2011-12-131-2/+5
|
* New flag +RTS -qi<n>, avoid waking up idle Capabilities to do parallel GCSimon Marlow2011-12-132-8/+21
| | | | | | | | | | | | | | | | | This is an experimental tweak to the parallel GC that avoids waking up a Capability to do parallel GC if we know that the capability has been idle for a (tunable) number of GC cycles. The idea is that if you're only using a few Capabilities, there's no point waking up the ones that aren't busy. e.g. +RTS -qi3 says "A Capability will participate in parallel GC if it was running at all since the last 3 GC cycles." Results are a bit hit and miss, and I don't completely understand why yet. Hence, for now it is turned off by default, and also not documented except in the +RTS -? output.
* waitForGcThreads: should be calling interruptCapability(), not ↵Simon Marlow2011-12-131-1/+1
| | | | interruptAllCapabilities()
* Allow the number of capabilities to be increased at runtime (#3729)Simon Marlow2011-12-064-50/+76
| | | | | At present the number of capabilities can only be *increased*, not decreased. The latter presents a few more challenges!
* Make forkProcess work with +RTS -NSimon Marlow2011-12-061-1/+1
| | | | | | | | | | | | | | | | | | | | | | Consider this experimental for the time being. There are a lot of things that could go wrong, but I've verified that at least it works on the test cases we have. I also did some API cleanups while I was here. Previously we had: Capability * rts_eval (Capability *cap, HaskellObj p, /*out*/HaskellObj *ret); but this API is particularly error-prone: if you forget to discard the Capability * you passed in and use the return value instead, then you're in for subtle bugs with +RTS -N later on. So I changed all these functions to this form: void rts_eval (/* inout */ Capability **cap, /* in */ HaskellObj p, /* out */ HaskellObj *ret) It's much harder to use this version incorrectly, because you have to pass the Capability in by reference.
* Fix a scheduling bug in the threaded RTSSimon Marlow2011-12-011-1/+1
| | | | | | | | | | | | | | | The parallel GC was using setContextSwitches() to stop all the other threads, which sets the context_switch flag on every Capability. That had the side effect of causing every Capability to also switch threads, and since GCs can be much more frequent than context switches, this increased the context switch frequency. When context switches are expensive (because the switch is between two bound threads or a bound and unbound thread), the difference is quite noticeable. The fix is to have a separate flag to indicate that a Capability should stop and return to the scheduler, but not switch threads. I've called this the "interrupt" flag.
* Make profiling work with multiple capabilities (+RTS -N)Simon Marlow2011-11-292-10/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | This means that both time and heap profiling work for parallel programs. Main internal changes: - CCCS is no longer a global variable; it is now another pseudo-register in the StgRegTable struct. Thus every Capability has its own CCCS. - There is a new built-in CCS called "IDLE", which records ticks for Capabilities in the idle state. If you profile a single-threaded program with +RTS -N2, you'll see about 50% of time in "IDLE". - There is appropriate locking in rts/Profiling.c to protect the shared cost-centre-stack data structures. This patch does enough to get it working, I have cut one big corner: the cost-centre-stack data structure is still shared amongst all Capabilities, which means that multiple Capabilities will race when updating the "allocations" and "entries" fields of a CCS. Not only does this give unpredictable results, but it runs very slowly due to cache line bouncing. It is strongly recommended that you use -fno-prof-count-entries to disable the "entries" count when profiling parallel programs. (I shall add a note to this effect to the docs).
* Time handling overhaulSimon Marlow2011-11-251-3/+3
| | | | | | | | | | | | | | | | | | | | | Terminology cleanup: the type "Ticks" has been renamed "Time", which is an StgWord64 in units of TIME_RESOLUTION (currently nanoseconds). The terminology "tick" is now used consistently to mean the interval between timer signals. The ticker now always ticks in realtime (actually CLOCK_MONOTONIC if we have it). Before it used CPU time in the non-threaded RTS and realtime in the threaded RTS, but I've discovered that the CPU timer has terrible resolution (at least on Linux) and isn't much use for profiling. So now we always use realtime. This should also fix The default tick interval is now 10ms, except when profiling where we drop it to 1ms. This gives more accurate profiles without affecting runtime too much (<1%). Lots of cleanups - the resolution of Time is now in one place only (Rts.h) rather than having calculations that depend on the resolution scattered all over the RTS. I hope I found them all.
* mergeSimon Marlow2011-11-222-3/+14
|\
| * Enable pthread_getspecific() tls for LLVM compilerDavid M Peixotto2011-10-072-3/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | LLVM does not support the __thread attribute for thread local storage and may generate incorrect code for global register variables. We want to allow building the runtime with LLVM-based compilers such as llvm-gcc and clang, particularly for MacOS. This patch changes the gct variable used by the garbage collector to use pthread_getspecific() for thread local storage when an llvm based compiler is used to build the runtime.
* | Fix bug in the handling of TSOs in the compacting GC (#5644)Simon Marlow2011-11-211-1/+2
| |
* | Overhaul of infrastructure for profiling, coverage (HPC) and breakpointsSimon Marlow2011-11-021-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | User visible changes ==================== Profilng -------- Flags renamed (the old ones are still accepted for now): OLD NEW --------- ------------ -auto-all -fprof-auto -auto -fprof-exported -caf-all -fprof-cafs New flags: -fprof-auto Annotates all bindings (not just top-level ones) with SCCs -fprof-top Annotates just top-level bindings with SCCs -fprof-exported Annotates just exported bindings with SCCs -fprof-no-count-entries Do not maintain entry counts when profiling (can make profiled code go faster; useful with heap profiling where entry counts are not used) Cost-centre stacks have a new semantics, which should in most cases result in more useful and intuitive profiles. If you find this not to be the case, please let me know. This is the area where I have been experimenting most, and the current solution is probably not the final version, however it does address all the outstanding bugs and seems to be better than GHC 7.2. Stack traces ------------ +RTS -xc now gives more information. If the exception originates from a CAF (as is common, because GHC tends to lift exceptions out to the top-level), then the RTS walks up the stack and reports the stack in the enclosing update frame(s). Result: +RTS -xc is much more useful now - but you still have to compile for profiling to get it. I've played around a little with adding 'head []' to GHC itself, and +RTS -xc does pinpoint the problem quite accurately. I plan to add more facilities for stack tracing (e.g. in GHCi) in the future. Coverage (HPC) -------------- * derived instances are now coloured yellow if they weren't used * likewise record field names * entry counts are more accurate (hpc --fun-entry-count) * tab width is now correct (markup was previously off in source with tabs) Internal changes ================ In Core, the Note constructor has been replaced by Tick (Tickish b) (Expr b) which is used to represent all the kinds of source annotation we support: profiling SCCs, HPC ticks, and GHCi breakpoints. Depending on the properties of the Tickish, different transformations apply to Tick. See CoreUtils.mkTick for details. Tickets ======= This commit closes the following tickets, test cases to follow: - Close #2552: not a bug, but the behaviour is now more intuitive (test is T2552) - Close #680 (test is T680) - Close #1531 (test is result001) - Close #949 (test is T949) - Close #2466: test case has bitrotted (doesn't compile against current version of vector-space package)
* | fix for a deadlock when using +RTS -hb with -prof -threadedSimon Marlow2011-11-021-2/+5
| |
* | make CAFs atomic, to fix #5558Simon Marlow2011-10-171-37/+106
|/ | | | See Note [atomic CAFs] in rts/sm/Storage.c
* Fix heap profiling timesIan Lynagh2011-07-241-1/+1
| | | | | | | | | | Now that the heap census runs in the middle of garbage collections, the "CPU time" it was calculating included any CPU time used so far in the current GC. This could cause CPU time to appear to go down, which means hp2ps complained about "samples out of sequence". I'm not sure if this is the nicest way to solve this (maybe resurrecting mut_user_time_during_GC would be better?) but it gets things working again.
* need to release the SM lock around heapCensus() to avoid deadlock withSimon Marlow2011-07-211-0/+2
| | | | +RTS -hT and -threaded.
* Move the call to heapCensus() into GarbageCollect(), just beforeSimon Marlow2011-07-202-1/+13
| | | | | | | | calling resurrectThreads() (fixes #5314). This avoids a lot of problems, because resurrectThreads() may overwrite some closures in the heap, leaving slop behind. The bug in instances, this fix avoids them all in one go.
* Fix gcc 4.6 warnings; fixes #5176Ian Lynagh2011-06-252-4/+15
| | | | | | | | | | | Based on a patch from David Terei. Some parts are a little ugly (e.g. defining things that only ASSERTs use only when DEBUG is defined), so we might want to tweak things a little. I've also turned off -Werror for didn't-inline warnings, as we now get a few such warnings.
* Remove unused variableIan Lynagh2011-06-241-3/+0
|
* Remove a couple of unused bindingsIan Lynagh2011-06-241-4/+0
|
* fix an integer overflow (#5086), and pre-emptively avoid more of theseSimon Marlow2011-05-251-1/+1
| | | | in the future.
* Fix +RTS -G1 (by deleting code, yay!) (#5026)Simon Marlow2011-05-241-10/+0
|
* Avoid accumulating slop in the pinned_object_block.Simon Marlow2011-04-143-10/+24
| | | | | | | | | | | | | | | | | The pinned_object_block is where we allocate small pinned ByteArray# objects. At a GC the pinned_object_block was being treated like other large objects and promoted to the next step/generation, even if it was only partly full. Under some ByteString-heavy workloads this would accumulate on average 2k of slop per GC, and this memory is never released until the ByteArray# objects in the block are freed. So now, we keep allocating into the pinned_object_block until it is completely full, at which point it is handed over to the GC as before. The pinned_object_block might therefore contain objects which a large range of ages, but I don't think this is any worse than the situation before. We still have the fragmentation issue in general, but the new scheme can improve the memory overhead for some workloads dramatically.
* fix a bug introduced in 1fb38442d3a55ac92795aa6c5ed4df82011df724,Simon Marlow2011-04-131-2/+6
| | | | symptom was 2047(threaded2) was crashing.
* isAlive: re-apply the tag if we find a forwarding pointer. This is aSimon Marlow2011-04-121-1/+1
| | | | | real bug, spotted by Marcin Orczyk (thanks!). I'm not sure if it lead to any actual crashes.
* Refactoring and tidy upSimon Marlow2011-04-119-129/+174
| | | | | | | | | | | | This is a port of some of the changes from my private local-GC branch (which is still in darcs, I haven't converted it to git yet). There are a couple of small functional differences in the GC stats: first, per-thread GC timings should now be more accurate, and secondly we now report average and maximum pause times. e.g. from minimax +RTS -N8 -s: Tot time (elapsed) Avg pause Max pause Gen 0 2755 colls, 2754 par 13.16s 0.93s 0.0003s 0.0150s Gen 1 769 colls, 769 par 3.71s 0.26s 0.0003s 0.0059s
* add missing initialisation of ws->todo_large_objectsSimon Marlow2011-02-041-1/+2
| | | | Found-by: Valgrind. Thanks Julian!
* fix compacting GCSimon Marlow2011-02-021-2/+6
|
* fix warningSimon Marlow2011-02-021-2/+1
|
* GC refactoring and cleanupSimon Marlow2011-02-027-276/+325
| | | | | | | | | Now we keep any partially-full blocks in the gc_thread[] structs after each GC, rather than moving them to the generation. This should give us slightly better locality (though I wasn't able to measure any difference). Also in this patch: better sanity checking with THREADED.
* A small GC optimisationSimon Marlow2011-02-025-65/+70
| | | | | | Store the *number* of the destination generation in the Bdescr struct, so that in evacuate() we don't have to deref gen to get it. This is another improvement ported over from my GC branch.
* Remove the per-generation mutable listsSimon Marlow2011-02-027-94/+6
| | | | Now that we use the per-capability mutable lists exclusively.
* update debugging code for fragmentationSimon Marlow2011-01-251-2/+3
|
* fix warningSimon Marlow2011-01-311-2/+1
|
* 32-bit fixSimon Marlow2010-10-131-0/+2
|