delta/haskell.git - gitlab.haskell.org: ghc/ghc.git

	Commit message (Collapse)	Author	Age	Files	Lines
*	rts: gc: use mutex+condvar instead of spinlooks in gc entry/exit	Douglas Wilson	2021-01-17	1	-3/+1
\| \| \| \| \| \|	used timed wait on condition variable in waitForGcThreads fix dodgy timespec calculation
*	rts: add max_n_todo_overflow internal counter	Douglas Wilson	2021-01-17	1	-0/+1
\| \| \| \| \| \| \| \|	I've never observed this counter taking a non-zero value, however I do think it's existence is justified by the comment in grab_local_todo_block. I've not added it to RTSStats in GHC.Stats, as it doesn't seem worth the api churn.
*	rts: remove no_work counter	Douglas Wilson	2021-01-17	1	-1/+0
\| \| \| \|	We are no longer busyish waiting, so this is no longer meaningful
*	rts: Use SEQ_CST accesses when touching `wakeup`	Ben Gamari	2021-01-09	1	-1/+1
\| \| \| \| \|	These are the two remaining non-atomic accesses to `wakeup` which were missed by the original TSAN patch.
*	nonmoving-gc: Track time usage of nonmoving marking	Ben Gamari	2020-03-05	1	-3/+5
\|
*	rts: Non-concurrent mark and sweep	Ömer Sinan Ağacan	2019-10-20	1	-3/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This implements the core heap structure and a serial mark/sweep collector which can be used to manage the oldest-generation heap. This is the first step towards a concurrent mark-and-sweep collector aimed at low-latency applications. The full design of the collector implemented here is described in detail in a technical note B. Gamari. "A Concurrent Garbage Collector For the Glasgow Haskell Compiler" (2018) The basic heap structure used in this design is heavily inspired by K. Ueno & A. Ohori. "A fully concurrent garbage collector for functional programs on multicore processors." /ACM SIGPLAN Notices/ Vol. 51. No. 9 (presented by ICFP 2016) This design is intended to allow both marking and sweeping concurrent to execution of a multi-core mutator. Unlike the Ueno design, which requires no global synchronization pauses, the collector introduced here requires a stop-the-world pause at the beginning and end of the mark phase. To avoid heap fragmentation, the allocator consists of a number of fixed-size /sub-allocators/. Each of these sub-allocators allocators into its own set of /segments/, themselves allocated from the block allocator. Each segment is broken into a set of fixed-size allocation blocks (which back allocations) in addition to a bitmap (used to track the liveness of blocks) and some additional metadata (used also used to track liveness). This heap structure enables collection via mark-and-sweep, which can be performed concurrently via a snapshot-at-the-beginning scheme (although concurrent collection is not implemented in this patch). The mark queue is a fairly straightforward chunked-array structure. The representation is a bit more verbose than a typical mark queue to accomodate a combination of two features: * a mark FIFO, which improves the locality of marking, reducing one of the major overheads seen in mark/sweep allocators (see [1] for details) * the selector optimization and indirection shortcutting, which requires that we track where we found each reference to an object in case we need to update the reference at a later point (e.g. when we find that it is an indirection). See Note [Origin references in the nonmoving collector] (in `NonMovingMark.h`) for details. Beyond this the mark/sweep is fairly run-of-the-mill. [1] R. Garner, S.M. Blackburn, D. Frampton. "Effective Prefetch for Mark-Sweep Garbage Collection." ISMM 2007. Co-Authored-By: Ben Gamari <ben@well-typed.com>
*	Simplify Configure in a few ways	John Ericson	2019-10-12	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	- No need to distinguish between gcc-llvm and clang. First of all, gcc-llvm is quite old and surely unmaintained by now. Second of all, none of the code actually care about that distinction! Now, it does make sense to consider C multiple frontends for LLVMs in the form of clang vs clang-cl (same clang, yes, but tweaked interface). But this is better handled in terms of "gccish vs mvscish" and "is LLVM", yielding 4 combinations. Therefore, I don't think it is useful saving the existing code for that. - Get the remaining CC_LLVM_BACKEND, and also TABLES_NEXT_TO_CODE in mk/config.h the normal way, rather than hacking it post-hoc. No point keeping these special cases around for now reason. - Get rid of hand-rolled `die` function and just use `AC_MSG_ERROR`. - Abstract check + flag override for unregisterised and tables next to code. Oh, and as part of the above I also renamed/combined some variables where it felt appropriate. - GccIsClang -> CcLlvmBackend. This is for `AC_SUBST`, like the other Camal case ones. It was never about gcc-llvm, or Apple's renamed clang, to be clear. - llvm_CC_FLAVOR -> CC_LLVM_BACKEND. This is for `AC_DEFINE`, like the other all-caps snake case ones. llvm_CC_FLAVOR was just silly indirection and an odd name to boot.
*	Update Wiki URLs to point to GitLab	Takenobu Tani	2019-03-25	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This moves all URL references to Trac Wiki to their corresponding GitLab counterparts. This substitution is classified as follows: 1. Automated substitution using sed with Ben's mapping rule [1] Old: ghc.haskell.org/trac/ghc/wiki/XxxYyy... New: gitlab.haskell.org/ghc/ghc/wikis/xxx-yyy... 2. Manual substitution for URLs containing `#` index Old: ghc.haskell.org/trac/ghc/wiki/XxxYyy...#Zzz New: gitlab.haskell.org/ghc/ghc/wikis/xxx-yyy...#zzz 3. Manual substitution for strings starting with `Commentary` Old: Commentary/XxxYyy... New: commentary/xxx-yyy... See also !539 [1]: https://gitlab.haskell.org/bgamari/gitlab-migration/blob/master/wiki-mapping.json
*	Detect overly long GC sync	Simon Marlow	2017-11-16	1	-0/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Summary: GC sync is the time between a GC being intiated and all the mutator threads finally stopping so that the GC can start. Problems that cause the GC sync to be delayed are hard to find and can cause dramatic slowdowns for heavily parallel programs. The new flag --long-gc-sync=<time> helps by emitting a warning and calling a user-overridable hook when the GC sync time exceeds the specified threshold. A debugger can be used to set a breakpoint when this happens and inspect the stacks of threads to find the culprit. Test Plan: ``` $ ./inplace/bin/ghc-stage2 +RTS --long-gc-sync=0.0000001 -S Alloc Copied Live GC GC TOT TOT Page Flts bytes bytes bytes user elap user elap 1135856 51144 153736 0.000 0.000 0.002 0.002 0 0 (Gen: 0) 1034760 94704 188752 0.000 0.000 0.002 0.002 0 0 (Gen: 0) 1038888 134832 228888 0.009 0.009 0.011 0.011 0 0 (Gen: 1) 1025288 90128 235184 0.000 0.000 0.012 0.012 0 0 (Gen: 0) 1049088 130080 333984 0.000 0.000 0.013 0.013 0 0 (Gen: 0) Warning: waited 0us for GC sync 1034424 73360 331976 0.000 0.000 0.013 0.013 0 0 (Gen: 0) ``` Also tested on a real production problem. Reviewers: niteria, bgamari, erikd Subscribers: rwbarton, thomie Differential Revision: https://phabricator.haskell.org/D4193
*	Prefer #if defined to #ifdef	Ben Gamari	2017-04-28	1	-1/+1
\| \| \| \|	Our new CPP linter enforces this.
*	cpp: Use #pragma once instead of #ifndef guards	Ben Gamari	2017-04-23	1	-4/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	This both says what we mean and silences a bunch of spurious CPP linting warnings. This pragma is supported by all CPP implementations which we support. Reviewers: austin, erikd, simonmar, hvr Reviewed By: simonmar Subscribers: rwbarton, thomie Differential Revision: https://phabricator.haskell.org/D3482
*	Use C99's bool	Ben Gamari	2016-11-29	1	-4/+4
\| \| \| \| \| \| \| \| \| \| \| \|	Test Plan: Validate on lots of platforms Reviewers: erikd, simonmar, austin Reviewed By: erikd, simonmar Subscribers: michalt, thomie Differential Revision: https://phabricator.haskell.org/D2699
*	Fix a bug in parallel GC synchronisation	Simon Marlow	2016-10-29	1	-6/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Summary: The problem boils down to global variables: in particular gc_threads[], which was being modified by a subsequent GC before the previous GC had finished with it. The fix is to not use global variables. This was causing setnumcapabilities001 to fail (again!). It's an old bug though. Test Plan: Ran setnumcapabilities001 in a loop for a couple of hours. Before this patch it had been failing after a few minutes. Not a very scientific test, but it's the best I have. Reviewers: bgamari, austin, fryguybob, niteria, erikd Subscribers: thomie Differential Revision: https://phabricator.haskell.org/D2654
*	rts: Replace `nat` with `uint32_t`	Erik de Castro Lopo	2016-05-05	1	-4/+4
\| \| \| \| \| \| \| \| \| \| \| \|	The `nat` type was an alias for `unsigned int` with a comment saying it was at least 32 bits. We keep the typedef in case client code is using it but mark it as deprecated. Test Plan: Validated on Linux, OS X and Windows Reviewers: simonmar, austin, thomie, hvr, bgamari, hsyl20 Differential Revision: https://phabricator.haskell.org/D2166
*	Cache the size of part_list/scavd_list (#11783)	Simon Marlow	2016-04-12	1	-3/+5
\| \| \| \| \| \| \| \| \|	After a parallel GC, it is possible to have a long list of blocks in ws->part_list, if we did a lot of work stealing but didn't fill up the blocks we stole. These blocks persist until the next load-balanced GC, which might be a long time, and during every GC we were traversing this list to find its size. The fix is to maintain the size all the time, so we don't have to compute it.
*	rts: Kill PAPI support	Ben Gamari	2015-11-18	1	-8/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This hasn't been used for a very long time and will soon be superceded by perf_events support. Test Plan: validate Reviewers: austin, simonmar Reviewed By: austin, simonmar Subscribers: thomie, erikd Differential Revision: https://phabricator.haskell.org/D1493
*	Eliminate zero_static_objects_list()	Simon Marlow	2015-07-28	1	-2/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Summary: [Revised version of D1076 that was committed and then backed out] In a workload with a large amount of code, zero_static_objects_list() takes a significant amount of time, and furthermore it is in the single-threaded part of the GC. This patch uses a slightly fiddly scheme for marking objects on the static object lists, using a flag in the low 2 bits that flips between two states to indicate whether an object has been visited during this GC or not. We also have to take into account objects that have not been visited yet, which might appear at any time due to runtime linking. Test Plan: validate Reviewers: austin, ezyang, rwbarton, bgamari, thomie Reviewed By: bgamari, thomie Subscribers: thomie Differential Revision: https://phabricator.haskell.org/D1106
*	Revert "Eliminate zero_static_objects_list()"	Simon Marlow	2015-07-27	1	-5/+2
\| \| \| \|	This reverts commit b949c96b4960168a3b399fe14485b24a2167b982.
*	Eliminate zero_static_objects_list()	Simon Marlow	2015-07-22	1	-2/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Summary: In a workload with a large amount of code, zero_static_objects_list() takes a significant amount of time, and furthermore it is in the single-threaded part of the GC. This patch uses a slightly fiddly scheme for marking objects on the static object lists, using a flag in the low 2 bits that flips between two states to indicate whether an object has been visited during this GC or not. We also have to take into account objects that have not been visited yet, which might appear at any time due to runtime linking. Test Plan: validate Reviewers: austin, bgamari, ezyang, rwbarton Subscribers: thomie Differential Revision: https://phabricator.haskell.org/D1076
*	Replace hooks by callbacks in RtsConfig (#8785)	Simon Marlow	2015-04-07	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \|	Summary: Hooks rely on static linking semantics, and are broken by -Bsymbolic which we need when using dynamic linking. Test Plan: Built it Reviewers: austin, hvr, tibbe Differential Revision: https://phabricator.haskell.org/D8
*	Revert "rts: add Emacs 'Local Variables' to every .c file"	Simon Marlow	2014-09-29	1	-8/+0
\| \| \| \|	This reverts commit 39b5c1cbd8950755de400933cecca7b8deb4ffcd.
*	rts: add Emacs 'Local Variables' to every .c file	Austin Seipp	2014-07-28	1	-0/+8
\| \| \| \| \| \| \| \|	This will hopefully help ensure some basic consistency in the forward by overriding buffer variables. In particular, it sets the wrap length, the offset to 4, and turns off tabs. Signed-off-by: Austin Seipp <austin@well-typed.com>
*	Avoid unnecessary clock_gettime() syscalls in GC stats.	Brian Brooks	2014-07-10	1	-2/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Summary: Avoid unnecessary clock_gettime() syscalls in GC stats. Test Plan: Use strace. Reviewers: simonmar, austin Reviewed By: simonmar, austin Subscribers: simonmar, relrod, carter Differential Revision: https://phabricator.haskell.org/D39
*	Tiny comment on the change from StgWord8 to StgWord	Simon Peyton Jones	2013-10-03	1	-1/+1
\| \| \| \|	c.f. commit 0b0fec536e35769b64b8bc5397c84138fa512155
*	Globally replace "hackage.haskell.org" with "ghc.haskell.org"	Simon Marlow	2013-10-01	1	-1/+1
\|
*	use StgWord not StgWord8 for wakeup	Simon Marlow	2013-10-01	1	-1/+1
\| \| \| \|	volatile StgWord8 is not guaranteed to be atomic.
*	Ensure gc_thread->wakeup is of type StgWord8.	Austin Seipp	2013-06-21	1	-1/+1
\| \| \| \| \| \| \| \| \| \|	rtsBool is defined to only have two inhabitants, which are true (1) and false (0) But the wakeup flag is set to 4 possible values, outside the range of rtsBool. This leads Clang to warn about tautological comparisons. Signed-off-by: Austin Seipp <aseipp@pobox.com>
*	Simplify the allocation stats accounting	Simon Marlow	2013-02-14	1	-1/+0
\| \| \| \| \| \| \| \| \| \| \|	We were doing it in two different ways and asserting that the results were the same. In most cases they were, but I found one case where they weren't: the GC itself allocates some memory for running finalizers, and this memory was accounted for one way but not the other. It was simpler to remove the old way of counting allocation that to try to fix it up, so I did that.
*	Hopefully fix breakage on OS X w/ LLVM	Simon Marlow	2013-01-17	1	-0/+4
\| \| \| \| \| \| \|	Reordering of includes in GC.c broke on OS X because gctKey is declared in Task.h and is needed in the storage manager. This is really the wrong place for it anyway, so I've moved the gctKey pieces to where they should be.
*	Deprecate lnat, and use StgWord instead	Simon Marlow	2012-09-07	1	-9/+9
\| \| \| \| \| \| \| \| \| \| \| \|	lnat was originally "long unsigned int" but we were using it when we wanted a 64-bit type on a 64-bit machine. This broke on Windows x64, where long == int == 32 bits. Using types of unspecified size is bad, but what we really wanted was a type with N bits on an N-bit machine. StgWord is exactly that. lnat was mentioned in some APIs that clients might be using (e.g. StackOverflowHook()), so we leave it defined but with a comment to say that it's deprecated.
*	Parallelise clearNurseries() in the parallel GC	Simon Marlow	2012-07-10	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The clearNurseries() operation resets the free pointer in each nursery block to the start of the block, emptying the nursery. In the parallel GC this was done on the main GC thread, but that's bad because it accesses the bdescr of every nursery block, and move all those cache lines onto the CPU of the main GC thread. With large nurseries, this can be especially bad. So instead we want to clear each nursery in its local GC thread. Thanks to Andreas Voellmy <andreas.voellmy@gmail.com> for idenitfying the issue. After this change and the previous patch to make the last GC a major one, I see these results for nofib/parallel on 8 cores: blackscholes +0.0% +0.0% -3.7% -3.3% +0.3% coins +0.0% +0.0% -5.1% -5.0% +0.4% gray +0.0% +0.0% -4.5% -2.1% +0.8% mandel +0.0% -0.0% -7.6% -5.1% -2.3% matmult +0.0% +5.5% -2.8% -1.9% -5.8% minimax +0.0% +0.0% -10.6% -10.5% +0.0% nbody +0.0% -4.4% +0.0% 0.07 +0.0% parfib +0.0% +1.0% +0.5% +0.9% +0.0% partree +0.0% +0.0% -2.4% -2.5% +1.7% prsa +0.0% -0.2% +1.8% +4.2% +0.0% queens +0.0% -0.0% -1.8% -1.4% -4.8% ray +0.0% -0.6% -18.5% -17.8% +0.0% sumeuler +0.0% -0.0% -3.7% -3.7% +0.0% transclos +0.0% -0.0% -25.7% -26.6% +0.0% -------------------------------------------------------------------------------- Min +0.0% -4.4% -25.7% -26.6% -5.8% Max +0.0% +5.5% +1.8% +4.2% +1.7% Geometric Mean +0.0% +0.1% -6.3% -6.1% -0.7%
*	New flag +RTS -qi<n>, avoid waking up idle Capabilities to do parallel GC	Simon Marlow	2011-12-13	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This is an experimental tweak to the parallel GC that avoids waking up a Capability to do parallel GC if we know that the capability has been idle for a (tunable) number of GC cycles. The idea is that if you're only using a few Capabilities, there's no point waking up the ones that aren't busy. e.g. +RTS -qi3 says "A Capability will participate in parallel GC if it was running at all since the last 3 GC cycles." Results are a bit hit and miss, and I don't completely understand why yet. Hence, for now it is turned off by default, and also not documented except in the +RTS -? output.
*	Time handling overhaul	Simon Marlow	2011-11-25	1	-3/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Terminology cleanup: the type "Ticks" has been renamed "Time", which is an StgWord64 in units of TIME_RESOLUTION (currently nanoseconds). The terminology "tick" is now used consistently to mean the interval between timer signals. The ticker now always ticks in realtime (actually CLOCK_MONOTONIC if we have it). Before it used CPU time in the non-threaded RTS and realtime in the threaded RTS, but I've discovered that the CPU timer has terrible resolution (at least on Linux) and isn't much use for profiling. So now we always use realtime. This should also fix The default tick interval is now 10ms, except when profiling where we drop it to 1ms. This gives more accurate profiles without affecting runtime too much (<1%). Lots of cleanups - the resolution of Time is now in one place only (Rts.h) rather than having calculations that depend on the resolution scattered all over the RTS. I hope I found them all.
*	Refactoring and tidy up	Simon Marlow	2011-04-11	1	-85/+11
\| \| \| \| \| \| \| \| \| \| \| \|	This is a port of some of the changes from my private local-GC branch (which is still in darcs, I haven't converted it to git yet). There are a couple of small functional differences in the GC stats: first, per-thread GC timings should now be more accurate, and secondly we now report average and maximum pause times. e.g. from minimax +RTS -N8 -s: Tot time (elapsed) Avg pause Max pause Gen 0 2755 colls, 2754 par 13.16s 0.93s 0.0003s 0.0150s Gen 1 769 colls, 769 par 3.71s 0.26s 0.0003s 0.0059s
*	A small GC optimisation	Simon Marlow	2011-02-02	1	-1/+1
\| \| \| \| \| \|	Store the number of the destination generation in the Bdescr struct, so that in evacuate() we don't have to deref gen to get it. This is another improvement ported over from my GC branch.
*	Change some TARGET tests to HOST tests in the RTS	Ian Lynagh	2010-07-13	1	-1/+1
\| \| \| \|	Which was being used seemed to be random
*	Fix the symbol visibility pragmas	Simon Marlow	2010-06-17	1	-2/+2
\|
*	GC refactoring, remove "steps"	Simon Marlow	2009-12-03	1	-22/+23
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The GC had a two-level structure, G generations each of T steps. Steps are for aging within a generation, mostly to avoid premature promotion. Measurements show that more than 2 steps is almost never worthwhile, and 1 step is usually worse than 2. In theory fractional steps are possible, so the ideal number of steps is somewhere between 1 and 3. GHC's default has always been 2. We can implement 2 steps quite straightforwardly by having each block point to the generation to which objects in that block should be promoted, so blocks in the nursery point to generation 0, and blocks in gen 0 point to gen 1, and so on. This commit removes the explicit step structures, merging generations with steps, thus simplifying a lot of code. Performance is unaffected. The tunable number of steps is now gone, although it may be replaced in the future by a way to tune the aging in generation 0.
*	add comment: __thread is not supported by gcc on OS X yet	Simon Marlow	2009-09-10	1	-0/+3
\|
*	Omit visibility pragmas on Windows (fixes warnings/validate failures)	Simon Marlow	2009-09-09	1	-2/+2
\|
*	Declare RTS-private prototypes with __attribute__((visibility("hidden")))	Simon Marlow	2009-08-05	1	-0/+4
\| \| \| \| \| \| \| \| \| \|	This has no effect with static libraries, but when the RTS is in a shared library it does two things: - it prevents the function from being exposed by the shared library - internal calls to the function can use the faster non-PLT calls, because the function cannot be overriden at link time.
*	RTS tidyup sweep, first phase	Simon Marlow	2009-08-02	1	-4/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The first phase of this tidyup is focussed on the header files, and in particular making sure we are exposinng publicly exactly what we need to, and no more. - Rts.h now includes everything that the RTS exposes publicly, rather than a random subset of it. - Most of the public header files have moved into subdirectories, and many of them have been renamed. But clients should not need to include any of the other headers directly, just #include the main public headers: Rts.h, HsFFI.h, RtsAPI.h. - All the headers needed for via-C compilation have moved into the stg subdirectory, which is self-contained. Most of the headers for the rest of the RTS APIs have moved into the rts subdirectory. - I left MachDeps.h where it is, because it is so widely used in Haskell code. - I left a deprecated stub for RtsFlags.h in place. The flag structures are now exposed by Rts.h. - Various internal APIs are no longer exposed by public header files. - Various bits of dead code and declarations have been removed - More gcc warnings are turned on, and the RTS code is more warning-clean. - More source files #include "PosixSource.h", and hence only use standard POSIX (1003.1c-1995) interfaces. There is a lot more tidying up still to do, this is just the first pass. I also intend to standardise the names for external RTS APIs (e.g use the rts_ prefix consistently), and declare the internal APIs as hidden for shared libraries.
*	SPARC NCG: Add a comment explaining why we can't used a pinned reg for gct	Ben.Lippmeier@anu.edu.au	2009-04-20	1	-3/+20
\| \| \| \| \|	Can't use windowed regs because the window moves during a function call. Can't use the global regs because they're reserved for other purposes.
*	Don't use thread local storage on x86/not-Linux	Ian Lynagh	2009-04-04	1	-2/+2
\| \| \| \| \| \| \|	With the On x86, use thread-local storage instead of stealing a reg for gct patch, on Windows and OS X: error: thread-local storage not supported for this target
*	On x86, use thread-local storage instead of stealing a reg for gct	Simon Marlow	2009-04-03	1	-1/+6
\| \| \| \| \| \| \| \|	Benchmarks show that using TLS instead of stealing a register is better by a few percent on x86, due to the lack of registers. This only affects -threaded; without -threaded we're (now) using static storage for the GC data.
*	in the non-threaded RTS, use a static gc_thread structure	Simon Marlow	2009-04-03	1	-3/+17
\|
*	Use work-stealing for load-balancing in the GC	Simon Marlow	2009-03-13	1	-4/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	New flag: "+RTS -qb" disables load-balancing in the parallel GC (though this is subject to change, I think we will probably want to do something more automatic before releasing this). To get the "PARGC3" configuration described in the "Runtime support for Multicore Haskell" paper, use "+RTS -qg0 -qb -RTS". The main advantage of this is that it allows us to easily disable load-balancing altogether, which turns out to be important in parallel programs. Maintaining locality is sometimes more important that spreading the work out in parallel GC. There is a side benefit in that the parallel GC should have improved locality even when load-balancing, because each processor prefers to take work from its own queue before stealing from others.
*	Keep the remembered sets local to each thread during parallel GC	Simon Marlow	2009-01-12	1	-0/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This turns out to be quite vital for parallel programs: - The way we discover which threads to traverse is by finding dirty threads via the remembered sets (aka mutable lists). - A dirty thread will be on the remembered set of the capability that was running it, and we really want to traverse that thread's stack using the GC thread for the capability, because it is in that CPU's cache. If we get this wrong, we get penalised badly by the memory system. Previously we had per-capability mutable lists but they were aggregated before GC and traversed by just one of the GC threads. This resulted in very poor performance particularly for parallel programs with deep stacks. Now we keep per-capability remembered sets throughout GC, which also removes a lock (recordMutableGen_sync).
*	Don't pin a register for gc_thread on SPARC.	Ben.Lippmeier@anu.edu.au	2009-01-05	1	-1/+8
\| \| \| \|	This makes the build work again.
*	Use mutator threads to do GC, instead of having a separate pool of GC threads	Simon Marlow	2008-11-21	1	-4/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Previously, the GC had its own pool of threads to use as workers when doing parallel GC. There was a "leader", which was the mutator thread that initiated the GC, and the other threads were taken from the pool. This was simple and worked fine for sequential programs, where we did most of the benchmarking for the parallel GC, but falls down for parallel programs. When we have N mutator threads and N cores, at GC time we would have to stop N-1 mutator threads and start up N-1 GC threads, and hope that the OS schedules them all onto separate cores. It practice it doesn't, as you might expect. Now we use the mutator threads to do GC. This works quite nicely, particularly for parallel programs, where each mutator thread scans its own spark pool, which is probably in its cache anyway. There are some flag changes: -g<n> is removed (-g1 is still accepted for backwards compat). There's no way to have a different number of GC threads than mutator threads now. -q1 Use one OS thread for GC (turns off parallel GC) -qg<n> Use parallel GC for generations >= <n> (default: 1) Using parallel GC only for generations >=1 works well for sequential programs. Compiling an ordinary sequential program with -threaded and running it with -N2 or more should help if you do a lot of GC. I've found that adding -qg0 (do parallel GC for generation 0 too) speeds up some parallel programs, but slows down some sequential programs. Being conservative, I left the threshold at 1. ToDo: document the new options.