summaryrefslogtreecommitdiff
path: root/rts/sm
Commit message (Collapse)AuthorAgeFilesLines
* Hopefully fix breakage on OS X w/ LLVMSimon Marlow2013-01-173-1/+12
| | | | | | | Reordering of includes in GC.c broke on OS X because gctKey is declared in Task.h and is needed in the storage manager. This is really the wrong place for it anyway, so I've moved the gctKey pieces to where they should be.
* Rearrange includes to avoid a clash on ARM/LinuxSimon Marlow2013-01-171-12/+13
|
* Add a write barrier for TVAR closuresSimon Marlow2012-11-168-15/+114
| | | | | | | | | | This improves GC performance when there are a lot of TVars in the heap. For instance, a TChan with a lot of elements causes a massive GC drag without this patch. There's more to do - several other STM closure types don't have write barriers, so GC performance when there are a lot of threads blocked on STM isn't great. But fixing the problem for TVar is a good start.
* fix bug in previous commit, 65e46f144f3d8b18de7264b0b099086153c68d6cSimon Marlow2012-11-161-1/+1
|
* a fix for checkTSO(): the TSO could be a WHITEHOLESimon Marlow2012-11-121-3/+10
|
* Don't clearNurseries() in parallel with -debugSimon Marlow2012-11-011-3/+5
| | | | It makes sanity-checking fail.
* Produce new-style Cmm from the Cmm parserSimon Marlow2012-10-084-85/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The main change here is that the Cmm parser now allows high-level cmm code with argument-passing and function calls. For example: foo ( gcptr a, bits32 b ) { if (b > 0) { // we can make tail calls passing arguments: jump stg_ap_0_fast(a); } return (x,y); } More details on the new cmm syntax are in Note [Syntax of .cmm files] in CmmParse.y. The old syntax is still more-or-less supported for those occasional code fragments that really need to explicitly manipulate the stack. However there are a couple of differences: it is now obligatory to give a list of live GlobalRegs on every jump, e.g. jump %ENTRY_CODE(Sp(0)) [R1]; Again, more details in Note [Syntax of .cmm files]. I have rewritten most of the .cmm files in the RTS into the new syntax, except for AutoApply.cmm which is generated by the genapply program: this file could be generated in the new syntax instead and would probably be better off for it, but I ran out of enthusiasm. Some other changes in this batch: - The PrimOp calling convention is gone, primops now use the ordinary NativeNodeCall convention. This means that primops and "foreign import prim" code must be written in high-level cmm, but they can now take more than 10 arguments. - CmmSink now does constant-folding (should fix #7219) - .cmm files now go through the cmmPipeline, and as a result we generate better code in many cases. All the object files generated for the RTS .cmm files are now smaller. Performance should be better too, but I haven't measured it yet. - RET_DYN frames are removed from the RTS, lots of code goes away - we now have some more canned GC points to cover unboxed-tuples with 2-4 pointers, which will reduce code size a little.
* Fix the profiling buildIan Lynagh2012-09-211-2/+2
|
* Convert more RTS macros to functionsIan Lynagh2012-09-212-6/+6
| | | | No size changes in the non-debug object files
* Include pinned memory in the stats for allocated memorySimon Marlow2012-09-212-1/+2
| | | | | This broke with the changes to the pinned object handling in 67f4ab7e6b7705a9d617c6109a8c5434ede13cae.
* Cache the result of countOccupied(gen->large_objects) as gen->n_large_words ↵Simon Marlow2012-09-212-2/+6
| | | | | | | | | (#7257) The program in #7257 was spending 90% of its time counting the live data in gen->large_objects. We already avoid doing this for small objects, but in this example the old generation was full of large objects (actually pinned ByteStrings).
* Allow allocNursery() to allocate single blocks (#7257)Simon Marlow2012-09-212-11/+13
| | | | | | | Forcing large allocations here can creates serious fragmentation in some cases, and since the large allocations are only a small optimisation we should allow the nursery to hoover up small blocks before allocating large chunks.
* Small parallel GC improvementSimon Marlow2012-09-181-2/+12
| | | | Overlap the main thread's clearNursery() with the other threads.
* More OS X build fixesIan Lynagh2012-09-141-8/+8
|
* Lots of nat -> StgWord changesSimon Marlow2012-09-077-49/+49
|
* Deprecate lnat, and use StgWord insteadSimon Marlow2012-09-0712-83/+83
| | | | | | | | | | | | lnat was originally "long unsigned int" but we were using it when we wanted a 64-bit type on a 64-bit machine. This broke on Windows x64, where long == int == 32 bits. Using types of unspecified size is bad, but what we really wanted was a type with N bits on an N-bit machine. StgWord is exactly that. lnat was mentioned in some APIs that clients might be using (e.g. StackOverflowHook()), so we leave it defined but with a comment to say that it's deprecated.
* Some further tweaks to reduce fragmentation when allocating the nurserySimon Marlow2012-09-073-19/+37
|
* some nats should be lnatsSimon Marlow2012-09-071-1/+1
|
* When using -H with -M<size>, don't exceed the maximum heap sizeSimon Marlow2012-09-071-1/+5
|
* memInventory(): tweak pretty-printingSimon Marlow2012-09-071-8/+8
|
* More CPP macros -> inline functionsIan Lynagh2012-08-252-6/+6
| | | | | | | | All the wibble seem to have cancelled out, and (non-debug) object sizes are back to where they started. I'm not 100% sure that the types are optimal, but at least now the functions have types and we can fix them if necessary.
* Make a function for get_itbl, rather than using a CPP macroIan Lynagh2012-08-252-6/+6
| | | | | | | | | | | | This has several advantages: * It can be called from gdb * There is more type information for the user, and type checking for the compiler * Less opportunity for things to go wrong, e.g. due to missing parentheses or repeated execution The sizes of the non-debug .o files hasn't changed (other than Inlines.o), so I'm pretty sure the compiled code is identical.
* tidy upSimon Marlow2012-08-211-5/+4
|
* Reduce fragmentation when using +RTS -H (with or without a size)Simon Marlow2012-08-213-2/+45
|
* improve debug outputSimon Marlow2012-08-211-1/+1
|
* Fix a discrepancy between two calculations of which generation to collectSimon Marlow2012-08-213-59/+39
| | | | The calculation should be done in one place, of course.
* Retain ordering of finalizers during GC (#7160)Simon Marlow2012-08-211-5/+14
| | | | | | | | | | | This came up since the addition of C finalizers, since Haskell finalizers are already stored in an explicit list. C finalizers on the other hand get a WEAK object each, so in order to run them in the right order we have to make sure that list stays in the correct order. I hate adding new invariants, but this is the quickest way to fix the bug for now. A better way to fix it would be to have a single WEAK object with a list of finaliers attached to it, and a primop for adding finalizers to the list.
* Parallelise clearNurseries() in the parallel GCSimon Marlow2012-07-104-15/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The clearNurseries() operation resets the free pointer in each nursery block to the start of the block, emptying the nursery. In the parallel GC this was done on the main GC thread, but that's bad because it accesses the bdescr of every nursery block, and move all those cache lines onto the CPU of the main GC thread. With large nurseries, this can be especially bad. So instead we want to clear each nursery in its local GC thread. Thanks to Andreas Voellmy <andreas.voellmy@gmail.com> for idenitfying the issue. After this change and the previous patch to make the last GC a major one, I see these results for nofib/parallel on 8 cores: blackscholes +0.0% +0.0% -3.7% -3.3% +0.3% coins +0.0% +0.0% -5.1% -5.0% +0.4% gray +0.0% +0.0% -4.5% -2.1% +0.8% mandel +0.0% -0.0% -7.6% -5.1% -2.3% matmult +0.0% +5.5% -2.8% -1.9% -5.8% minimax +0.0% +0.0% -10.6% -10.5% +0.0% nbody +0.0% -4.4% +0.0% 0.07 +0.0% parfib +0.0% +1.0% +0.5% +0.9% +0.0% partree +0.0% +0.0% -2.4% -2.5% +1.7% prsa +0.0% -0.2% +1.8% +4.2% +0.0% queens +0.0% -0.0% -1.8% -1.4% -4.8% ray +0.0% -0.6% -18.5% -17.8% +0.0% sumeuler +0.0% -0.0% -3.7% -3.7% +0.0% transclos +0.0% -0.0% -25.7% -26.6% +0.0% -------------------------------------------------------------------------------- Min +0.0% -4.4% -25.7% -26.6% -5.8% Max +0.0% +5.5% +1.8% +4.2% +1.7% Geometric Mean +0.0% +0.1% -6.3% -6.1% -0.7%
* Working towards fixing DLLs on Win64Ian Lynagh2012-05-062-2/+2
|
* Fix maintenance of n_blocks in the RTSIan Lynagh2012-05-011-1/+1
| | | | | | | It was causing assertion failures of ASSERT(countBlocks(nursery->blocks) == nursery->n_blocks) at ghc-stage2: internal error: ASSERTION FAILED: file rts/sm/Sanity.c, line 878
* Fix warnings on Win64Ian Lynagh2012-04-262-11/+11
| | | | | | Mostly this meant getting pointer<->int conversions to use the right sizes. lnat is now size_t, rather than unsigned long, as that seems a better match for how it's used.
* Fix the timestamps in GC_START and GC_END events on the GC-initiating capMikolaj2012-04-041-1/+1
| | | | | | | | | | | There was a discrepancy between GC times reported in +RTS -s and the timestamps of GC_START and GC_END events on the cap, on which +RTS -s stats for the given GC are based. This is fixed by posting the events with exactly the same timestamp as generated for the stat calculation. The calls posting the events are moved too, so that the events are emitted close to the time instant they claim to be emitted at. The GC_STATS_GHC was moved, too, ensuring it's emitted before the moved GC_END on all caps, which simplifies tools code.
* Emit final heap alloc events and rearrange code to calculate alloc totalsDuncan Coutts2012-04-043-24/+28
| | | | | | | | | | | | | | | | | In stat_exit we want to emit a final EVENT_HEAP_ALLOCATED for each cap so that we get the same total allocation count as reported via +RTS -s. To do so we need to update the per-cap total_allocated counts. Previously we had a single calcAllocated(rtsBool) function that counted the large allocations and optionally the nurseries for all caps. The GC would always call it with false, and the stat_exit always with true. The reason for these two modes is that the GC counts the nurseries via clearNurseries() (which also updates the per-cap total_allocated counts), so it's only the stat_exit() path that needs to count them. We now split the calcAllocated() function into two: countLargeAllocated and updateNurseriesStats. As the name suggests, the latter now updates the per-cap total_allocated counts, in additon to returning a total.
* Add new eventlog events for various heap and GC statisticsDuncan Coutts2012-04-042-3/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | They cover much the same info as is available via the GHC.Stats module or via the '+RTS -s' textual output, but via the eventlog and with a better sampling frequency. We have three new generic heap info events and two very GHC-specific ones. (The hope is the general ones are usable by other implementations that use the same eventlog system, or indeed not so sensitive to changes in GHC itself.) The general ones are: * total heap mem allocated since prog start, on a per-HEC basis * current size of the heap (MBlocks reserved from OS for the heap) * current size of live data in the heap Currently these are all emitted by GHC at GC time (live data only at major GC). The GHC specific ones are: * an event giving various static heap paramaters: * number of generations (usually 2) * max size if any * nursary size * MBlock and block sizes * a event emitted on each GC containing: * GC generation (usually just 0,1) * total bytes copied * bytes lost to heap slop and fragmentation * the number of threads in the parallel GC (1 for serial) * the maximum number of bytes copied by any par GC thread * the total number of bytes copied by all par GC threads (these last three can be used to calculate an estimate of the work balance in parallel GCs)
* Change the presentation of parallel GC work balance in +RTS -sDuncan Coutts2012-04-041-9/+8
| | | | | | | | | | | | | | | | | | | | Also rename internal variables to make the names match what they hold. The parallel GC work balance is calculated using the total amount of memory copied by all GC threads, and the maximum copied by any individual thread. You have serial GC when the max is the same as copied, and perfectly balanced GC when total/max == n_caps. Previously we presented this as the ratio total/max and told users that the serial value was 1 and the ideal value N, for N caps, e.g. Parallel GC work balance: 1.05 (4045071 / 3846774, ideal 2) The downside of this is that the user always has to keep in mind the number of cores being used. Our new presentation uses a normalised scale 0--1 as a percentage. The 0% means completely serial and 100% is perfect balance, e.g. Parallel GC work balance: 4.56% (serial 0%, perfect 100%)
* Calculate the total memory allocated on a per-capability basisDuncan Coutts2012-04-041-1/+3
| | | | | | | In addition to the existing global method. For now we just do it both ways and assert they give the same grand total. At some stage we can simplify the global method to just take the sum of the per-cap counters.
* Drop the per-task timing stats, give a summary only (#5897)Simon Marlow2012-03-021-1/+1
| | | | | | | | | | | | | | | | | | | | | | We were keeping around the Task struct (216 bytes) for every worker we ever created, even though we only keep a maximum of 6 workers per Capability. These Task structs accumulate and cause a space leak in programs that do lots of safe FFI calls; this patch frees the Task struct as soon as a worker exits. One reason we were keeping the Task structs around is because we print out per-Task timing stats in +RTS -s, but that isn't terribly useful. What is sometimes useful is knowing how *many* Tasks there were. So now I'm printing a single-line summary, this is for the program in TASKS: 2001 (1 bound, 31 peak workers (2000 total), using -N1) So although we created 2k tasks overall, there were only 31 workers active at any one time (which is exactly what we expect: the program makes 30 safe FFI calls concurrently). This also gives an indication of how many capabilities were being used, which is handy if you use +RTS -N without an explicit number.
* formatting tweaksGabor Greif2012-02-271-8/+8
|
* tabs -> spacesGabor Greif2012-02-271-53/+53
|
* Allocate pinned object blocks from the nursery, not the globalSimon Marlow2012-02-133-13/+91
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | allocator. Prompted by a benchmark posted to parallel-haskell@haskell.org by Andreas Voellmy <andreas.voellmy@gmail.com>. This program exhibits contention for the block allocator when run with -N2 and greater without the fix: {-# LANGUAGE MagicHash, UnboxedTuples, BangPatterns #-} module Main where import Control.Monad import Control.Concurrent import System.Environment import GHC.IO import GHC.Exts import GHC.Conc main = do [m] <- fmap (fmap read) getArgs n <- getNumCapabilities ms <- replicateM n newEmptyMVar sequence [ forkIO $ busyWorkerB (m `quot` n) >> putMVar mv () | mv <- ms ] mapM takeMVar ms busyWorkerB :: Int -> IO () busyWorkerB n_loops = go 0 where go !n | n >= n_loops = return () | otherwise = do p <- (IO $ \s -> case newPinnedByteArray# 1024# s of { (# s', mbarr# #) -> (# s', () #) } ) go (n+1)
* Fix for a bug in setNumCapabilitiesSimon Marlow2011-12-131-3/+8
|
* Fix for a bug in +RTS -qi (crash in zero_static_object_list)Simon Marlow2011-12-131-1/+3
|
* Add a comment about oddity with yieldThread() and timing results on LinuxSimon Marlow2011-12-131-0/+5
|
* Avoid integer overflow when calling allocGroup() (#5071)Simon Marlow2011-12-131-2/+5
|
* New flag +RTS -qi<n>, avoid waking up idle Capabilities to do parallel GCSimon Marlow2011-12-132-8/+21
| | | | | | | | | | | | | | | | | This is an experimental tweak to the parallel GC that avoids waking up a Capability to do parallel GC if we know that the capability has been idle for a (tunable) number of GC cycles. The idea is that if you're only using a few Capabilities, there's no point waking up the ones that aren't busy. e.g. +RTS -qi3 says "A Capability will participate in parallel GC if it was running at all since the last 3 GC cycles." Results are a bit hit and miss, and I don't completely understand why yet. Hence, for now it is turned off by default, and also not documented except in the +RTS -? output.
* waitForGcThreads: should be calling interruptCapability(), not ↵Simon Marlow2011-12-131-1/+1
| | | | interruptAllCapabilities()
* Allow the number of capabilities to be increased at runtime (#3729)Simon Marlow2011-12-064-50/+76
| | | | | At present the number of capabilities can only be *increased*, not decreased. The latter presents a few more challenges!
* Make forkProcess work with +RTS -NSimon Marlow2011-12-061-1/+1
| | | | | | | | | | | | | | | | | | | | | | Consider this experimental for the time being. There are a lot of things that could go wrong, but I've verified that at least it works on the test cases we have. I also did some API cleanups while I was here. Previously we had: Capability * rts_eval (Capability *cap, HaskellObj p, /*out*/HaskellObj *ret); but this API is particularly error-prone: if you forget to discard the Capability * you passed in and use the return value instead, then you're in for subtle bugs with +RTS -N later on. So I changed all these functions to this form: void rts_eval (/* inout */ Capability **cap, /* in */ HaskellObj p, /* out */ HaskellObj *ret) It's much harder to use this version incorrectly, because you have to pass the Capability in by reference.
* Fix a scheduling bug in the threaded RTSSimon Marlow2011-12-011-1/+1
| | | | | | | | | | | | | | | The parallel GC was using setContextSwitches() to stop all the other threads, which sets the context_switch flag on every Capability. That had the side effect of causing every Capability to also switch threads, and since GCs can be much more frequent than context switches, this increased the context switch frequency. When context switches are expensive (because the switch is between two bound threads or a bound and unbound thread), the difference is quite noticeable. The fix is to have a separate flag to indicate that a Capability should stop and return to the scheduler, but not switch threads. I've called this the "interrupt" flag.
* Make profiling work with multiple capabilities (+RTS -N)Simon Marlow2011-11-292-10/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | This means that both time and heap profiling work for parallel programs. Main internal changes: - CCCS is no longer a global variable; it is now another pseudo-register in the StgRegTable struct. Thus every Capability has its own CCCS. - There is a new built-in CCS called "IDLE", which records ticks for Capabilities in the idle state. If you profile a single-threaded program with +RTS -N2, you'll see about 50% of time in "IDLE". - There is appropriate locking in rts/Profiling.c to protect the shared cost-centre-stack data structures. This patch does enough to get it working, I have cut one big corner: the cost-centre-stack data structure is still shared amongst all Capabilities, which means that multiple Capabilities will race when updating the "allocations" and "entries" fields of a CCS. Not only does this give unpredictable results, but it runs very slowly due to cache line bouncing. It is strongly recommended that you use -fno-prof-count-entries to disable the "entries" count when profiling parallel programs. (I shall add a note to this effect to the docs).