delta/haskell.git - gitlab.haskell.org: ghc/ghc.git

	Commit message (Collapse)	Author	Age	Files	Lines
*	Nonmoving: Ensure write barrier vanishes in non-threaded RTS	Ben Gamari	2019-10-21	1	-8/+12
\|
*	Don't cleanup until we've stopped the collector	Ben Gamari	2019-10-20	1	-0/+1
\| \| \| \| \| \| \|	This requires that we break nonmovingExit into two pieces since we need to first stop the collector to relinquish any capabilities, then we need to shutdown the scheduler, then we need to free the nonmoving allocators.
*	rts: Implement concurrent collection in the nonmoving collector	Ben Gamari	2019-10-20	1	-7/+104
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This extends the non-moving collector to allow concurrent collection. The full design of the collector implemented here is described in detail in a technical note B. Gamari. "A Concurrent Garbage Collector For the Glasgow Haskell Compiler" (2018) This extension involves the introduction of a capability-local remembered set, known as the /update remembered set/, which tracks objects which may no longer be visible to the collector due to mutation. To maintain this remembered set we introduce a write barrier on mutations which is enabled while a concurrent mark is underway. The update remembered set representation is similar to that of the nonmoving mark queue, being a chunked array of `MarkEntry`s. Each `Capability` maintains a single accumulator chunk, which it flushed when it (a) is filled, or (b) when the nonmoving collector enters its post-mark synchronization phase. While the write barrier touches a significant amount of code it is conceptually straightforward: the mutator must ensure that the referee of any pointer it overwrites is added to the update remembered set. However, there are a few details: * In the case of objects with a dirty flag (e.g. `MVar`s) we can exploit the fact that only the first mutation requires a write barrier. * Weak references, as usual, complicate things. In particular, we must ensure that the referee of a weak object is marked if dereferenced by the mutator. For this we (unfortunately) must introduce a read barrier, as described in Note [Concurrent read barrier on deRefWeak#] (in `NonMovingMark.c`). * Stable names are also a bit tricky as described in Note [Sweeping stable names in the concurrent collector] (`NonMovingSweep.c`). We take quite some pains to ensure that the high thread count often seen in parallel Haskell applications doesn't affect pause times. To this end we allow thread stacks to be marked either by the thread itself (when it is executed or stack-underflows) or the concurrent mark thread (if the thread owning the stack is never scheduled). There is a non-trivial handshake to ensure that this happens without racing which is described in Note [StgStack dirtiness flags and concurrent marking]. Co-Authored-by: Ömer Sinan Ağacan <omer@well-typed.com>
*	rts: Non-concurrent mark and sweep	Ömer Sinan Ağacan	2019-10-20	1	-10/+20
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This implements the core heap structure and a serial mark/sweep collector which can be used to manage the oldest-generation heap. This is the first step towards a concurrent mark-and-sweep collector aimed at low-latency applications. The full design of the collector implemented here is described in detail in a technical note B. Gamari. "A Concurrent Garbage Collector For the Glasgow Haskell Compiler" (2018) The basic heap structure used in this design is heavily inspired by K. Ueno & A. Ohori. "A fully concurrent garbage collector for functional programs on multicore processors." /ACM SIGPLAN Notices/ Vol. 51. No. 9 (presented by ICFP 2016) This design is intended to allow both marking and sweeping concurrent to execution of a multi-core mutator. Unlike the Ueno design, which requires no global synchronization pauses, the collector introduced here requires a stop-the-world pause at the beginning and end of the mark phase. To avoid heap fragmentation, the allocator consists of a number of fixed-size /sub-allocators/. Each of these sub-allocators allocators into its own set of /segments/, themselves allocated from the block allocator. Each segment is broken into a set of fixed-size allocation blocks (which back allocations) in addition to a bitmap (used to track the liveness of blocks) and some additional metadata (used also used to track liveness). This heap structure enables collection via mark-and-sweep, which can be performed concurrently via a snapshot-at-the-beginning scheme (although concurrent collection is not implemented in this patch). The mark queue is a fairly straightforward chunked-array structure. The representation is a bit more verbose than a typical mark queue to accomodate a combination of two features: * a mark FIFO, which improves the locality of marking, reducing one of the major overheads seen in mark/sweep allocators (see [1] for details) * the selector optimization and indirection shortcutting, which requires that we track where we found each reference to an object in case we need to update the reference at a later point (e.g. when we find that it is an indirection). See Note [Origin references in the nonmoving collector] (in `NonMovingMark.h`) for details. Beyond this the mark/sweep is fairly run-of-the-mill. [1] R. Garner, S.M. Blackburn, D. Frampton. "Effective Prefetch for Mark-Sweep Garbage Collection." ISMM 2007. Co-Authored-By: Ben Gamari <ben@well-typed.com>
*	rts: Give stack flags proper macros	Ben Gamari	2019-10-18	1	-2/+2
\| \| \| \| \|	This were previously quite unclear and will change a bit under the non-moving collector so let's clear this up now.
*	Simplify Configure in a few ways	John Ericson	2019-10-12	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	- No need to distinguish between gcc-llvm and clang. First of all, gcc-llvm is quite old and surely unmaintained by now. Second of all, none of the code actually care about that distinction! Now, it does make sense to consider C multiple frontends for LLVMs in the form of clang vs clang-cl (same clang, yes, but tweaked interface). But this is better handled in terms of "gccish vs mvscish" and "is LLVM", yielding 4 combinations. Therefore, I don't think it is useful saving the existing code for that. - Get the remaining CC_LLVM_BACKEND, and also TABLES_NEXT_TO_CODE in mk/config.h the normal way, rather than hacking it post-hoc. No point keeping these special cases around for now reason. - Get rid of hand-rolled `die` function and just use `AC_MSG_ERROR`. - Abstract check + flag override for unregisterised and tables next to code. Oh, and as part of the above I also renamed/combined some variables where it felt appropriate. - GccIsClang -> CcLlvmBackend. This is for `AC_SUBST`, like the other Camal case ones. It was never about gcc-llvm, or Apple's renamed clang, to be clear. - llvm_CC_FLAVOR -> CC_LLVM_BACKEND. This is for `AC_DEFINE`, like the other all-caps snake case ones. llvm_CC_FLAVOR was just silly indirection and an odd name to boot.
*	Add new debug flag -DZ	Tobias Guggenmos	2019-10-03	1	-1/+1
\| \| \| \|	Zeros heap memory after gc freed it.
*	Correct closure observation, construction, and mutation on weak memory machines.	Travis Whitaker	2019-06-28	1	-1/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Here the following changes are introduced: - A read barrier machine op is added to Cmm. - The order in which a closure's fields are read and written is changed. - Memory barriers are added to RTS code to ensure correctness on out-or-order machines with weak memory ordering. Cmm has a new CallishMachOp called MO_ReadBarrier. On weak memory machines, this is lowered to an instruction that ensures memory reads that occur after said instruction in program order are not performed before reads coming before said instruction in program order. On machines with strong memory ordering properties (e.g. X86, SPARC in TSO mode) no such instruction is necessary, so MO_ReadBarrier is simply erased. However, such an instruction is necessary on weakly ordered machines, e.g. ARM and PowerPC. Weam memory ordering has consequences for how closures are observed and mutated. For example, consider a closure that needs to be updated to an indirection. In order for the indirection to be safe for concurrent observers to enter, said observers must read the indirection's info table before they read the indirectee. Furthermore, the entering observer makes assumptions about the closure based on its info table contents, e.g. an INFO_TYPE of IND imples the closure has an indirectee pointer that is safe to follow. When a closure is updated with an indirection, both its info table and its indirectee must be written. With weak memory ordering, these two writes can be arbitrarily reordered, and perhaps even interleaved with other threads' reads and writes (in the absence of memory barrier instructions). Consider this example of a bad reordering: - An updater writes to a closure's info table (INFO_TYPE is now IND). - A concurrent observer branches upon reading the closure's INFO_TYPE as IND. - A concurrent observer reads the closure's indirectee and enters it. (!!!) - An updater writes the closure's indirectee. Here the update to the indirectee comes too late and the concurrent observer has jumped off into the abyss. Speculative execution can also cause us issues, consider: - An observer is about to case on a value in closure's info table. - The observer speculatively reads one or more of closure's fields. - An updater writes to closure's info table. - The observer takes a branch based on the new info table value, but with the old closure fields! - The updater writes to the closure's other fields, but its too late. Because of these effects, reads and writes to a closure's info table must be ordered carefully with respect to reads and writes to the closure's other fields, and memory barriers must be placed to ensure that reads and writes occur in program order. Specifically, updates to a closure must follow the following pattern: - Update the closure's (non-info table) fields. - Write barrier. - Update the closure's info table. Observing a closure's fields must follow the following pattern: - Read the closure's info pointer. - Read barrier. - Read the closure's (non-info table) fields. This patch updates RTS code to obey this pattern. This should fix long-standing SMP bugs on ARM (specifically newer aarch64 microarchitectures supporting out-of-order execution) and PowerPC. This fixes issue #15449. Co-Authored-By: Ben Gamari <ben@well-typed.com>
*	Fix GCC warnings with __clear_cache builtin (#16867)	Sylvain Henry	2019-06-27	1	-6/+8
\|
*	Improve performance of newSmallArray#	Michal Terepeta	2019-04-01	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	This: - Hoists part of the condition outside of the initialization loop in `stg_newSmallArrayzh`. - Annotates one of the unlikely branches as unlikely, also in `stg_newSmallArrayzh`. - Adds a couple of annotations to `allocateMightFail` indicating which branches are likely to be taken. Together this gives about 5% improvement. Signed-off-by: Michal Terepeta <michal.terepeta@gmail.com>
*	Update Wiki URLs to point to GitLab	Takenobu Tani	2019-03-25	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This moves all URL references to Trac Wiki to their corresponding GitLab counterparts. This substitution is classified as follows: 1. Automated substitution using sed with Ben's mapping rule [1] Old: ghc.haskell.org/trac/ghc/wiki/XxxYyy... New: gitlab.haskell.org/ghc/ghc/wikis/xxx-yyy... 2. Manual substitution for URLs containing `#` index Old: ghc.haskell.org/trac/ghc/wiki/XxxYyy...#Zzz New: gitlab.haskell.org/ghc/ghc/wikis/xxx-yyy...#zzz 3. Manual substitution for strings starting with `Commentary` Old: Commentary/XxxYyy... New: commentary/xxx-yyy... See also !539 [1]: https://gitlab.haskell.org/bgamari/gitlab-migration/blob/master/wiki-mapping.json
*	Update Trac ticket URLs to point to GitLab	Ryan Scott	2019-03-15	1	-1/+1
\| \| \| \| \|	This moves all URL references to Trac tickets to their corresponding GitLab counterparts.
*	rts: Update some comments, minor refactoring	Ömer Sinan Ağacan	2018-06-27	1	-9/+8
\|
*	Typo fix in rts [skip ci]	Ömer Sinan Ağacan	2018-06-26	1	-1/+1
\|
*	storageAddCapabilities: fix bug in updating nursery pointers	Simon Marlow	2018-05-02	1	-2/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Summary: We were unconditionally updating the nursery pointers to be `nurseries[cap->no]`, but when using nursery chunks this might be wrong. This manifested as a later assertion failure in allocate(). Test Plan: new test case Reviewers: bgamari, niteria, erikd Subscribers: thomie, carter Differential Revision: https://phabricator.haskell.org/D4649
*	Update a few comments regarding CAF lists	Ömer Sinan Ağacan	2018-03-30	1	-5/+5
\| \| \| \|	[skip ci]
*	[rts] Adjust whitehole_spin	Douglas Wilson	2018-01-21	1	-4/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Rename to whitehole_gc_spin, in preparation for adding stats for the whitehole busy-loop in SMPClosureOps. Make whitehole_gc_spin volatile, and move it to be defined and statically initialised in GC.c. This saves some #ifs, and I'm pretty sure it should be volatile. Test Plan: ./validate Reviewers: bgamari, erikd, simonmar Reviewed By: bgamari Subscribers: rwbarton, thomie, carter Differential Revision: https://phabricator.haskell.org/D4300
*	rts: Throw proper HeapOverflow exception on allocating large array	Ben Gamari	2017-09-26	1	-30/+54
\| \| \| \| \| \| \| \| \| \| \| \|	Test Plan: Validate, add tests Reviewers: simonmar, austin, erikd Reviewed By: simonmar Subscribers: rwbarton, thomie Differential Revision: https://phabricator.haskell.org/D4021
*	A bunch of typofixes	Gabor Greif	2017-09-26	1	-1/+1
\|
*	lowercase clang	Moritz Angermann	2017-07-06	1	-2/+2
\|
*	rts/sm/Storage.c: tweak __clear_cache proto for clang	Sergei Trofimovich	2017-07-05	1	-2/+15
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	clang defines '__clear_cache' slightly differently from gcc: rts/sm/Storage.c:1349:13: error: error: conflicting types for '__clear_cache' \| 1349 \| extern void __clear_cache(char * begin, char * end); \| ^ extern void __clear_cache(char * begin, char * end); ^ note: '__clear_cache' is a builtin with type 'void (void , void )' Reported by Moritz Angermann. While at it used '__builtin___clear_cache' if advertised by clang. Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org>
*	Revert "rts/sm/Storage.c: tweak __clear_cache proto for clang"	Sergei Trofimovich	2017-07-05	1	-13/+2
\| \| \| \| \| \|	This reverts commit 9492703a5862ee8623455209e50344cf8c4de077. Incomplete patch (missing begin, end assignments).
*	rts/sm/Storage.c: tweak __clear_cache proto for clang	Sergei Trofimovich	2017-07-05	1	-2/+13
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	clang defines '__clear_cache' slightly differently from gcc: rts/sm/Storage.c:1349:13: error: error: conflicting types for '__clear_cache' \| 1349 \| extern void __clear_cache(char * begin, char * end); \| ^ extern void __clear_cache(char * begin, char * end); ^ note: '__clear_cache' is a builtin with type 'void (void , void )' Reported by Moritz Angermann. While at it used '__builtin___clear_cache' if advertised by clang. Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org>
*	UNREG: use __builtin___clear_cache where available	Sergei Trofimovich	2017-06-22	1	-0/+16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Noticed when was building UNREG ghc with -optc{-Wall,-Werror}: rts/sm/Storage.c:1359:3: error: error: implicit declaration of function '__clear_cache' [-Werror=implicit-function-declaration] __clear_cache((void)begin, (void)end); ^~~~~~~~~~~~~ \| 1359 \| __clear_cache((void)begin, (void)end); \| ^ Left direct '__clear_cache' usage gcc toolchain before 4.4. Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org>
*	Revert "rts: Suppress unused gcc_clear_cache warning"	Ben Gamari	2017-06-21	1	-2/+0
\| \| \| \|	This reverts commit d1d3e98443cf263ef09253e2478e3e638e174e0d.
*	rts: Suppress unused gcc_clear_cache warning	Ben Gamari	2017-06-21	1	-0/+2
\|
*	Revert "UNREG: use __builtin___clear_cache where available"	Sergei Trofimovich	2017-06-21	1	-21/+1
\| \| \| \| \| \| \| \| \|	This reverts commit 6dd1257fdd4d18e84d32e89bf0ec664b3c8f7b93. Change fails vaildation: rts/sm/Storage.c:1351:20: error: error: ‘gcc_clear_cache’ defined but not used [-Werror=unused-function] STATIC_INLINE void gcc_clear_cache(void * begin, void * end)
*	UNREG: use __builtin___clear_cache where available	Sergei Trofimovich	2017-06-21	1	-1/+21
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Noticed when was building UNREG ghc with -optc{-Wall,-Werror}: rts/sm/Storage.c:1359:3: error: error: implicit declaration of function '__clear_cache' [-Werror=implicit-function-declaration] __clear_cache((void)begin, (void)end); ^~~~~~~~~~~~~ \| 1359 \| __clear_cache((void)begin, (void)end); \| ^ Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org>
*	Prefer #if defined to #ifdef	Ben Gamari	2017-04-28	1	-8/+8
\| \| \| \|	Our new CPP linter enforces this.
*	Report heap overflow in the same way as stack overflow	Simon Marlow	2017-04-02	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Now that we throw an exception for heap overflow, we should only print the heap overflow message in the main thread when the HeapOverflow exception is caught, rather than as a side effect in the GC. Stack overflows were already done this way, I just made heap overflow consistent with stack overflow, and did some related cleanup. Fixes broken T2592(profasm) which was reporting the heap overflow message twice (you would only notice when building with profiling libs enabled). Test Plan: validate Reviewers: bgamari, niteria, austin, DemiMarie, hvr, erikd Reviewed By: bgamari Subscribers: rwbarton, thomie Differential Revision: https://phabricator.haskell.org/D3394
*	Overhaul of Compact Regions (#12455)	Simon Marlow	2016-12-07	1	-0/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Summary: This commit makes various improvements and addresses some issues with Compact Regions (aka Compact Normal Forms). This was the most important thing I wanted to fix. Compaction previously prevented GC from running until it was complete, which would be a problem in a multicore setting. Now, we compact using a hand-written Cmm routine that can be interrupted at any point. When a GC is triggered during a sharing-enabled compaction, the GC has to traverse and update the hash table, so this hash table is now stored in the StgCompactNFData object. Previously, compaction consisted of a deepseq using the NFData class, followed by a traversal in C code to copy the data. This is now done in a single pass with hand-written Cmm (see rts/Compact.cmm). We no longer use the NFData instances, instead the Cmm routine evaluates components directly as it compacts. The new compaction is about 50% faster than the old one with no sharing, and a little faster on average with sharing (the cost of the hash table dominates when we're doing sharing). Static objects that don't (transitively) refer to any CAFs don't need to be copied into the compact region. In particular this means we often avoid copying Char values and small Int values, because these are static closures in the runtime. Each Compact# object can support a single compactAdd# operation at any given time, so the Data.Compact library now enforces mutual exclusion using an MVar stored in the Compact object. We now get exceptions rather than killing everything with a barf() when we encounter an object that cannot be compacted (a function, or a mutable object). We now also detect pinned objects, which can't be compacted either. The Data.Compact API has been refactored and cleaned up. A new compactSize operation returns the size (in bytes) of the compact object. Most of the documentation is in the Haddock docs for the compact library, which I've expanded and improved here. Various comments in the code have been improved, especially the main Note [Compact Normal Forms] in rts/sm/CNF.c. I've added a few tests, and expanded a few of the tests that were there. We now also run the tests with GHCi, and in a new test way that enables sanity checking (+RTS -DS). There's a benchmark in libraries/compact/tests/compact_bench.hs for measuring compaction speed and comparing sharing vs. no sharing. The field totalDataW in StgCompactNFData was unnecessary. Test Plan: * new unit tests * validate * tested manually that we can compact Data.Aeson data Reviewers: gcampax, bgamari, ezyang, austin, niteria, hvr, erikd Subscribers: thomie, simonpj Differential Revision: https://phabricator.haskell.org/D2751 GHC Trac Issues: #12455
*	Overhaul GC stats	Simon Marlow	2016-12-06	1	-0/+22
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Summary: Visible API changes: * The C struct `GCDetails` gives the stats about a single GC. This is passed to the `gcDone()` callback if one is set via the RtsConfig. (previously we just passed a collection of values, so this is more extensible, at the expense of breaking the existing API) * `RTSStats` gives cumulative stats since the start of the program, and includes the `GCDetails` for the most recent GC. This struct can be obtained via `getRTSStats()` (the old `getGCStats()` has been removed, and `getGCStatsEnabled()` has been renamed to `getRTSStatsEnabled()`) Improvements: * The per-GC stats and cumulative stats are now cleanly separated. * Inside the RTS we have a top-level `RTSStats` struct to keep all our stats in, previously this was just a collection of strangely-named variables. This struct is mostly just copied in `getRTSStats()`, so the implementation of that function is a lot shorter. * Types are more consistent. We use a uint64_t byte count for all memory values, and Time for all time values. * Names are more consistent. We use a suffix `_bytes` for all byte counts and `_ns` for all time values. * We now collect information about the amount of memory in large objects and compact objects in `GCDetails`. (the latter was the reason I started doing this patch but it seems to have ballooned a bit!) * I fixed a bug in the calculation of the elapsed MUT time, and added an ASSERT to stop the calculations going wrong in the future. For now I kept the Haskell API in `GHC.Stats` the same, by impedence-matching with the new API. We could either break that API and make it match the C API more closely, or we could add a new API and deprecate the old one. Opinions welcome. This stuff is very easy to get wrong, and it's hard to test. Reviews welcome! Test Plan: manual testing validate Reviewers: bgamari, niteria, austin, ezyang, hvr, erikd, rwbarton, Phyx Subscribers: thomie Differential Revision: https://phabricator.haskell.org/D2756
*	Use C99's bool	Ben Gamari	2016-11-29	1	-14/+14
\| \| \| \| \| \| \| \| \| \| \| \|	Test Plan: Validate on lots of platforms Reviewers: erikd, simonmar, austin Reviewed By: erikd, simonmar Subscribers: michalt, thomie Differential Revision: https://phabricator.haskell.org/D2699
*	Storage.c: Pass a size to sys_icache_invalidate	Shea Levy	2016-11-15	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The previous code passed an end pointer, but the interface takes a size instead. Fixes #12838. Reviewers: austin, erikd, simonmar, bgamari Reviewed By: simonmar, bgamari Subscribers: thomie Differential Revision: https://phabricator.haskell.org/D2711 GHC Trac Issues: #12838
*	Turn on -n4m with -A16m or greater	Simon Marlow	2016-10-09	1	-13/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Nursery chunks help reduce the cost of GC when capabilities are unevenly loaded, by ensuring that we use more of the available nursery. The rationale for enabling this at -A16m is that any negative effects due to loss of cache locality are less likely to be an issue at -A16m and above. It's a conservative guess. If we had a lot of benchmark data we could probably do better. Results for nofib/parallel at -N4 -A32m with and without -n4m: ``` ------------------------------------------------------------------------ Program Size Allocs Runtime Elapsed TotalMem ------------------------------------------------------------------------ blackscholes 0.0% -9.5% -9.0% -15.0% -2.2% coins 0.0% -4.7% -3.6% -0.6% -13.6% mandel 0.0% -0.3% +7.7% +13.1% +0.1% matmult 0.0% +1.5% +10.0% +7.7% +0.1% nbody 0.0% -4.1% -2.9% 0.085 0.0% parfib 0.0% -1.4% +1.0% +1.5% +0.2% partree 0.0% -0.3% +0.8% +2.9% -0.8% prsa 0.0% -0.5% -2.1% -7.6% 0.0% queens 0.0% -3.2% -1.4% +2.2% +1.3% ray 0.0% -5.6% -14.5% -7.6% +0.8% sumeuler 0.0% -0.4% +2.4% +1.1% 0.0% ------------------------------------------------------------------------ Min 0.0% -9.5% -14.5% -15.0% -13.6% Max 0.0% +1.5% +10.0% +13.1% +1.3% Geometric Mean +0.0% -2.6% -1.3% -0.5% -1.4% ``` Not conclusive, but slightly better. This matters a lot more when you have more cores. Test Plan: validate, nofib/paralel Reviewers: niteria, ezyang, nh2, trofi, austin, erikd, bgamari Reviewed By: bgamari Subscribers: thomie Differential Revision: https://phabricator.haskell.org/D2581 GHC Trac Issues: #9221
*	When in sanity mode, un-zero malloc'd memory; fix uninitialized memory bugs.	Edward Z. Yang	2016-08-15	1	-0/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	malloc'd memory is not guaranteed to be zeroed. On Linux, however, it is often zeroed, leading to latent bugs. In fact, with this patch I fix two uninitialized memory bugs stemming from this. Signed-off-by: Edward Z. Yang <ezyang@cs.stanford.edu> Test Plan: validate Reviewers: simonmar, austin, Phyx, bgamari, erikd Subscribers: thomie Differential Revision: https://phabricator.haskell.org/D2455
*	Compact Regions	Giovanni Campagna	2016-07-20	1	-3/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This brings in initial support for compact regions, as described in the ICFP 2015 paper "Efficient Communication and Collection with Compact Normal Forms" (Edward Z. Yang et.al.) and implemented by Giovanni Campagna. Some things may change before the 8.2 release, but I (Simon M.) wanted to get the main patch committed so that we can iterate. What documentation there is is in the Data.Compact module in the new compact package. We'll need to extend and polish the documentation before the release. Test Plan: validate (new test cases included) Reviewers: ezyang, simonmar, hvr, bgamari, austin Subscribers: vikraman, Yuras, RyanGlScott, qnikst, mboes, facundominguez, rrnewton, thomie, erikd Differential Revision: https://phabricator.haskell.org/D1264 GHC Trac Issues: #11493
*	NUMA cleanups	Simon Marlow	2016-06-17	1	-10/+8
\| \| \| \| \|	- Move the numaMap and nNumaNodes out of RtsFlags to Capability.c - Add a test to tests/rts
*	NUMA support	Simon Marlow	2016-06-10	1	-99/+133
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Summary: The aim here is to reduce the number of remote memory accesses on systems with a NUMA memory architecture, typically multi-socket servers. Linux provides a NUMA API for doing two things: * Allocating memory local to a particular node * Binding a thread to a particular node When given the +RTS --numa flag, the runtime will * Determine the number of NUMA nodes (N) by querying the OS * Assign capabilities to nodes, so cap C is on node C%N * Bind worker threads on a capability to the correct node * Keep a separate free lists in the block layer for each node * Allocate the nursery for a capability from node-local memory * Allocate blocks in the GC from node-local memory For example, using nofib/parallel/queens on a 24-core 2-socket machine: ``` $ ./Main 15 +RTS -N24 -s -A64m Total time 173.960s ( 7.467s elapsed) $ ./Main 15 +RTS -N24 -s -A64m --numa Total time 150.836s ( 6.423s elapsed) ``` The biggest win here is expected to be allocating from node-local memory, so that means programs using a large -A value (as here). According to perf, on this program the number of remote memory accesses were reduced by more than 50% by using `--numa`. Test Plan: * validate * There's a new flag --debug-numa=<n> that pretends to do NUMA without actually making the OS calls, which is useful for testing the code on non-NUMA systems. * TODO: I need to add some unit tests Reviewers: erikd, austin, rwbarton, ezyang, bgamari, hvr, niteria Subscribers: thomie Differential Revision: https://phabricator.haskell.org/D2199
*	{,M}BLOCK_SIZE_W * sizeof(W_) -> {,M}BLOCK_SIZE	Tomas Carnecky	2016-05-19	1	-4/+4
\| \| \| \| \| \| \| \| \| \|	Reviewers: austin, erikd, simonmar, bgamari Reviewed By: bgamari Subscribers: thomie Differential Revision: https://phabricator.haskell.org/D2241
*	rts: Replace `nat` with `uint32_t`	Erik de Castro Lopo	2016-05-05	1	-19/+19
\| \| \| \| \| \| \| \| \| \| \| \|	The `nat` type was an alias for `unsigned int` with a comment saying it was at least 32 bits. We keep the typedef in case client code is using it but mark it as deprecated. Test Plan: Validated on Linux, OS X and Windows Reviewers: simonmar, austin, thomie, hvr, bgamari, hsyl20 Differential Revision: https://phabricator.haskell.org/D2166
*	Add +RTS -AL<size>	Simon Marlow	2016-05-04	1	-2/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	+RTS -AL<size> controls the total size of large objects that can be allocated before a GC is triggered. Previously this was always just the value of -A, and the limit mainly existed to prevent runaway allocation in pathalogical programs that allocate a lot of large objects. However, since the limit is shared between all cores, on a large multicore the default becomes more restrictive, and can end up triggering GC well before it would normally have been. Arguably a better default would be A*N, but this is probably excessive. Adding a flag lets you choose, and I've left the default as it was. See docs for usage.
*	Cache the size of part_list/scavd_list (#11783)	Simon Marlow	2016-04-12	1	-5/+6
\| \| \| \| \| \| \| \| \|	After a parallel GC, it is possible to have a long list of blocks in ws->part_list, if we did a lot of work stealing but didn't fill up the blocks we stole. These blocks persist until the next load-balanced GC, which might be a long time, and during every GC we were traversing this list to find its size. The fix is to maintain the size all the time, so we don't have to compute it.
*	rts: drop unused global 'blackhole_queue'	Sergei Trofimovich	2016-02-27	1	-1/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Commit 5d52d9b64c21dcf77849866584744722f8121389 removed global 'blackhole_queue' in favour of new mechanism: when TSO hits blackhole TSO blocks waiting for 'MessgaeBlackhole' delivery. Patch removed unused global and updates stale comments. Noticed by Yuras Shumovich. Signed-off-by: Sergei Trofimovich <siarheit@google.com> Test Plan: build test Reviewers: simonmar, austin, Yuras, bgamari Reviewed By: Yuras, bgamari Subscribers: thomie Differential Revision: https://phabricator.haskell.org/D1953
*	rts: drop unused calcLiveBlocks, calcLiveWords	Sergei Trofimovich	2016-02-07	1	-26/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Use of these helper functions was removed by commit 18896fa2b06844407fd1e0d3f85cd3db97a96ff4 Author: Simon Marlow <marlowsd@gmail.com> Date: Wed Feb 2 15:49:55 2011 +0000 Noticed by uselex.rb: calcLiveBlocks: [R]: exported from: ./rts/dist/build/sm/Storage.o calcLiveWords: [R]: exported from: ./rts/dist/build/sm/Storage.o Signed-off-by: Sergei Trofimovich <siarheit@google.com>
*	Eliminate zero_static_objects_list()	Simon Marlow	2015-07-28	1	-5/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Summary: [Revised version of D1076 that was committed and then backed out] In a workload with a large amount of code, zero_static_objects_list() takes a significant amount of time, and furthermore it is in the single-threaded part of the GC. This patch uses a slightly fiddly scheme for marking objects on the static object lists, using a flag in the low 2 bits that flips between two states to indicate whether an object has been visited during this GC or not. We also have to take into account objects that have not been visited yet, which might appear at any time due to runtime linking. Test Plan: validate Reviewers: austin, ezyang, rwbarton, bgamari, thomie Reviewed By: bgamari, thomie Subscribers: thomie Differential Revision: https://phabricator.haskell.org/D1106
*	Revert "Eliminate zero_static_objects_list()"	Simon Marlow	2015-07-27	1	-5/+5
\| \| \| \|	This reverts commit b949c96b4960168a3b399fe14485b24a2167b982.
*	Eliminate zero_static_objects_list()	Simon Marlow	2015-07-22	1	-5/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Summary: In a workload with a large amount of code, zero_static_objects_list() takes a significant amount of time, and furthermore it is in the single-threaded part of the GC. This patch uses a slightly fiddly scheme for marking objects on the static object lists, using a flag in the low 2 bits that flips between two states to indicate whether an object has been visited during this GC or not. We also have to take into account objects that have not been visited yet, which might appear at any time due to runtime linking. Test Plan: validate Reviewers: austin, bgamari, ezyang, rwbarton Subscribers: thomie Differential Revision: https://phabricator.haskell.org/D1076
*	Fix for crash in setnumcapabilities001	Simon Marlow	2015-06-26	1	-6/+12
\| \| \| \| \| \| \|	getNewNursery() was unconditionally incrementing next_nursery, which is normally fine but it broke an assumption in storageAddCapabilities(). This manifested as an occasional crash in the setnumcapabilities001 test.
*	Fix for CAF retention when dynamically loading & unloading code	Simon Marlow	2015-06-08	1	-7/+34
\| \| \| \| \| \| \| \| \| \| \| \| \|	In a situaion where we have some statically-linked code and we want to load and unload a series of objects, we need the CAFs in the statically-linked code to be retained indefinitely, while the CAFs in the dynamically-linked code should be GC'd as normal, so that we can detect when the code is unloadable. This was wrong before - we GC'd CAFs in the static code, leading to a crash in the rare case where we use a CAF, GC it, and then load a new object that uses it again. I also did some tidy up: RtsConfig now has a field keep_cafs to indicate whether we want CAFs to be retained in static code.