delta/haskell.git - gitlab.haskell.org: ghc/ghc.git

	Commit message (Collapse)	Author	Age	Files	Lines
*	Fix the symbol visibility pragmas	Simon Marlow	2010-06-17	1	-9/+1
\|
*	messageBlackHole: fix deadlock bug caused by a missing 'volatile'	Simon Marlow	2010-06-10	1	-0/+8
\|
*	includes/rts/storage/GC.h: generation_: n_words: Improve comment.	Marco Túlio Gontijo e Silva	2010-05-26	1	-1/+1
\|
*	fix 64-bit value for W_SHIFT, which thankfully appears to be not used	Simon Marlow	2010-04-22	1	-1/+1
\|
*	Use StgWord64 instead of ullong	Ian Lynagh	2010-04-21	1	-7/+0
\| \| \| \| \| \|	This patch also fixes ullong_format_string (renamed to showStgWord64) so that it works with values outside the 32bit range (trac #3979), and simplifies the without-commas case.
*	Remove the IND_OLDGEN and IND_OLDGEN_PERM closure types	Simon Marlow	2010-04-01	4	-39/+31
\| \| \| \| \| \| \|	These are no longer used: once upon a time they used to have different layout from IND and IND_PERM respectively, but that is no longer the case since we changed the remembered set to be an array of addresses instead of a linked list of closures.
*	Change the representation of the MVar blocked queue	Simon Marlow	2010-04-01	6	-13/+26
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The list of threads blocked on an MVar is now represented as a list of separately allocated objects rather than being linked through the TSOs themselves. This lets us remove a TSO from the list in O(1) time rather than O(n) time, by marking the list object. Removing this linear component fixes some pathalogical performance cases where many threads were blocked on an MVar and became unreachable simultaneously (nofib/smp/threads007), or when sending an asynchronous exception to a TSO in a long list of thread blocked on an MVar. MVar performance has actually improved by a few percent as a result of this change, slightly to my surprise. This is the final cleanup in the sequence, which let me remove the old way of waking up threads (unblockOne(), MSG_WAKEUP) in favour of the new way (tryWakeupThread and MSG_TRY_WAKEUP, which is idempotent). It is now the case that only the Capability that owns a TSO may modify its state (well, almost), and this simplifies various things. More of the RTS is based on message-passing between Capabilities now.
*	eliminate some duplication with a bit of CPP	Simon Marlow	2010-03-30	1	-491/+334
\|
*	Move a thread to the front of the run queue when another thread blocks on it	Simon Marlow	2010-03-29	1	-0/+2
\| \| \| \| \| \| \|	This fixes #3838, and was made possible by the new BLACKHOLE infrastructure. To allow reording of the run queue I had to make it doubly-linked, which entails some extra trickiness with regard to GC write barriers and suchlike.
*	remove non-existent MUT_CONS symbols	Simon Marlow	2010-03-30	1	-2/+0
\|
*	change throwTo to use tryWakeupThread rather than unblockOne	Simon Marlow	2010-03-29	1	-0/+2
\|
*	New implementation of BLACKHOLEs	Simon Marlow	2010-03-29	7	-11/+82
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This replaces the global blackhole_queue with a clever scheme that enables us to queue up blocked threads on the closure that they are blocked on, while still avoiding atomic instructions in the common case. Advantages: - gets rid of a locked global data structure and some tricky GC code (replacing it with some per-thread data structures and different tricky GC code :) - wakeups are more prompt: parallel/concurrent performance should benefit. I haven't seen anything dramatic in the parallel benchmarks so far, but a couple of threading benchmarks do improve a bit. - waking up a thread blocked on a blackhole is now O(1) (e.g. if it is the target of throwTo). - less sharing and better separation of Capabilities: communication is done with messages, the data structures are strictly owned by a Capability and cannot be modified except by sending messages. - this change will utlimately enable us to do more intelligent scheduling when threads block on each other. This is what started off the whole thing, but it isn't done yet (#3838). I'll be documenting all this on the wiki in due course.
*	Add a 'setKeepCAFs' external function (#3900)	Simon Marlow	2010-03-29	1	-0/+3
\|
*	Fix the format specifier for Int64/Word64 on Windows	Ian Lynagh	2010-03-27	1	-0/+7
\| \| \| \| \|	mingw doesn't understand %llu/%lld - it treats them as 32-bit rather than 64-bit. We use %I64u/%I64d instead.
*	Use message-passing to implement throwTo in the RTS	Simon Marlow	2010-03-11	7	-52/+78
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This replaces some complicated locking schemes with message-passing in the implementation of throwTo. The benefits are - previously it was impossible to guarantee that a throwTo from a thread running on one CPU to a thread running on another CPU would be noticed, and we had to rely on the GC to pick up these forgotten exceptions. This no longer happens. - the locking regime is simpler (though the code is about the same size) - threads can be unblocked from a blocked_exceptions queue without having to traverse the whole queue now. It's a rare case, but replaces an O(n) operation with an O(1). - generally we move in the direction of sharing less between Capabilities (aka HECs), which will become important with other changes we have planned. Also in this patch I replaced several STM-specific closure types with a generic MUT_PRIM closure type, which allowed a lot of code in the GC and other places to go away, hence the line-count reduction. The message-passing changes resulted in about a net zero line-count difference.
*	Split part of the Task struct into a separate struct InCall	Simon Marlow	2010-03-09	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The idea is that this leaves Tasks and OSThread in one-to-one correspondence. The part of a Task that represents a call into Haskell from C is split into a separate struct InCall, pointed to by the Task and the TSO bound to it. A given OSThread/Task thus always uses the same mutex and condition variable, rather than getting a new one for each callback. Conceptually it is simpler, although there are more types and indirections in a few places now. This improves callback performance by removing some of the locks that we had to take when making in-calls. Now we also keep the current Task in a thread-local variable if supported by the OS and gcc (currently only Linux).
*	add comment for srt_bitmap field	Simon Marlow	2010-02-03	1	-1/+6
\|
*	define INFINITY and NAN if they don't exist (#2929)	Simon Marlow	2010-01-27	1	-0/+21
\|
*	avoid using non-standard %zd format specifier (#3804)	Simon Marlow	2010-01-26	1	-8/+2
\|
*	When acquiring a spinlock, yieldThread() every 1000 spins (#3553, #3758)	Simon Marlow	2010-01-22	3	-13/+27
\| \| \| \| \| \| \| \| \| \| \| \|	This helps when the thread holding the lock has been descheduled, which is the main cause of the "last-core slowdown" problem. With this patch, I get much better results with -N8 on an 8-core box, although some benchmarks are still worse than with 7 cores. I also added a yieldThread() into the any_work() loop of the parallel GC when it has no work to do. Oddly, this seems to improve performance on the parallel GC benchmarks even when all the cores are busy. Perhaps it is due to reducing contention on the memory bus.
*	take newCAF() out from sm_mutex; use the capability-local mut list instead	Simon Marlow	2009-12-31	1	-2/+2
\|
*	Fix #650: use a card table to mark dirty sections of mutable arrays	Simon Marlow	2009-12-17	6	-1/+45
\| \| \| \| \| \| \| \| \| \| \| \|	The card table is an array of bytes, placed directly following the actual array data. This means that array reading is unaffected, but array writing needs to read the array size from the header in order to find the card table. We use a bytemap rather than a bitmap, because updating the card table must be multi-thread safe. Each byte refers to 128 entries of the array, but this is tunable by changing the constant MUT_ARR_PTRS_CARD_BITS in includes/Constants.h.
*	Expose all EventLog events as DTrace probes	Manuel M T Chakravarty	2009-12-12	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	- Defines a DTrace provider, called 'HaskellEvent', that provides a probe for every event of the eventlog framework. - In contrast to the original eventlog, the DTrace probes are available in all flavours of the runtime system (DTrace probes have virtually no overhead if not enabled); when -DTRACING is defined both the regular event log as well as DTrace probes can be used. - Currently, Mac OS X only. User-space DTrace probes are implemented differently on Mac OS X than in the original DTrace implementation. Nevertheless, it shouldn't be too hard to enable these probes on other platforms, too. - Documentation is at http://hackage.haskell.org/trac/ghc/wiki/DTrace
*	Eliminate mkdependC	Ian Lynagh	2009-12-09	2	-2/+5
\| \| \| \|	We now just call gcc to get the dependencies directly
*	declare g0 (fixes compilation failure with -fvia-C)	Simon Marlow	2009-12-08	1	-1/+1
\|
*	Correction to the allocation stats following earlier refactoring	Simon Marlow	2009-12-04	3	-3/+4
\|
*	GC refactoring, remove "steps"	Simon Marlow	2009-12-03	5	-48/+36
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The GC had a two-level structure, G generations each of T steps. Steps are for aging within a generation, mostly to avoid premature promotion. Measurements show that more than 2 steps is almost never worthwhile, and 1 step is usually worse than 2. In theory fractional steps are possible, so the ideal number of steps is somewhere between 1 and 3. GHC's default has always been 2. We can implement 2 steps quite straightforwardly by having each block point to the generation to which objects in that block should be promoted, so blocks in the nursery point to generation 0, and blocks in gen 0 point to gen 1, and so on. This commit removes the explicit step structures, merging generations with steps, thus simplifying a lot of code. Performance is unaffected. The tunable number of steps is now gone, although it may be replaced in the future by a way to tune the aging in generation 0.
*	Make allocatePinned use local storage, and other refactorings	Simon Marlow	2009-12-01	3	-31/+15
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This is a batch of refactoring to remove some of the GC's global state, as we move towards CPU-local GC. - allocateLocal() now allocates large objects into the local nursery, rather than taking a global lock and allocating then in gen 0 step 0. - allocatePinned() was still allocating from global storage and taking a lock each time, now it uses local storage. (mallocForeignPtrBytes should be faster with -threaded). - We had a gen 0 step 0, distinct from the nurseries, which are stored in a separate nurseries[] array. This is slightly strange. I removed the g0s0 global that pointed to gen 0 step 0, and removed all uses of it. I think now we don't use gen 0 step 0 at all, except possibly when there is only one generation. Possibly more tidying up is needed here. - I removed the global allocate() function, and renamed allocateLocal() to allocate(). - the alloc_blocks global is gone. MAYBE_GC() and doYouWantToGC() now check the local nursery only.
*	Free full_prog_argv at exit, closing a memory leak	Simon Marlow	2009-12-01	1	-0/+1
\|
*	Implement a new heap-tuning option: -H	Simon Marlow	2009-11-30	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	-H alone causes the RTS to use a larger nursery, but without exceeding the amount of memory that the application is already using. It trades off GC time against locality: the default setting is to use a fixed-size 512k nursery, but this is sometimes worse than using a very large nursery despite the worse locality. Not all programs get faster, but some programs that use large heaps do much better with -H. e.g. this helps a lot with #3061 (binary-trees), though not as much as specifying -H<large>. Typically using -H<large> is better than plain -H, because the runtime doesn't know ahead of time how much memory you want to use. Should -H be on by default? I'm not sure, it makes some programs go slower, but others go faster.
*	Store a destination step in the block descriptor	Simon Marlow	2009-11-29	2	-14/+25
\| \| \| \| \| \| \|	At the moment, this just saves a memory reference in the GC inner loop (worth a percent or two of GC time). Later, it will hopefully let me experiment with partial steps, and simplifying the generation/step infrastructure.
*	threadStackOverflow: check whether stack squeezing released some stack (#3677)	Simon Marlow	2009-11-25	1	-0/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In a stack overflow situation, stack squeezing may reduce the stack size, but we don't know whether it has been reduced enough for the stack check to succeed if we try again. Fortunately stack squeezing is idempotent, so all we need to do is record whether any squeezing happened. If we are at the stack's absolute -K limit, and stack squeezing happened, then we try running the thread again. We also want to avoid enlarging the stack if squeezing has already released some of it. However, we don't want to get into a pathalogical situation where a thread has a nearly full stack (near its current limit, but not near the absolute -K limit), keeps allocating a little bit, squeezing removes a little bit, and then it runs again. So to avoid this, if we squeezed and there is still less than BLOCK_SIZE_W words free, then we enlarge the stack anyway.
*	add a comment to TSO_MARKED	Simon Marlow	2009-11-25	1	-0/+4
\|
*	define HS_WORD_MAX	Simon Marlow	2009-11-19	1	-0/+2
\|
*	Windows DLLs: Don't rely on stg/DLL.h being included in RtsAPI.h	Ben.Lippmeier@anu.edu.au	2009-11-18	2	-14/+13
\|
*	Don't share low valued Int and Char closures with Windows DLLs	Ben.Lippmeier@anu.edu.au	2009-11-14	2	-4/+18
\|
*	Switch EventThreadID back to 32 bits.	Simon Marlow	2009-11-12	1	-1/+1
\| \| \| \| \|	The log file format was still using 32 bits, this just updates the header file to match; there should be no functional changes.
*	Second attempt to fix #1185 (forkProcess and -threaded)	Simon Marlow	2009-11-11	2	-2/+7
\| \| \| \| \| \| \| \| \| \| \| \|	Patch 1/2: second part of the patch is to libraries/base This time without dynamic linker hacks, instead I've expanded the existing rts/Globals.c to cache more CAFs, specifically those in GHC.Conc. We were already using this trick for signal handlers, I should have realised before. It's still quite unsavoury, but we can do away with rts/Globals.c in the future when we switch to a dynamically-linked GHCi.
*	Make installation on *nix work for paths with spaces in their name	Ian Lynagh	2009-11-05	1	-3/+3
\| \| \| \| \| \|	This means we can remove some conditional stuff from the Makefiles, and means the testsuite doesn't have to work out whether or not it's on Windows.
*	Add events to show when GC threads are idle/working	Simon Marlow	2009-10-15	1	-1/+4
\|
*	micro-opt: replace stmGetEnclosingTRec() with a field access	Simon Marlow	2009-10-14	1	-0/+2
\| \| \| \| \|	While fixing #3578 I noticed that this function was just a field access to StgTRecHeader, so I inlined it manually.
*	Fixes for cross-compiling to a different word size	Simon Marlow	2009-10-14	1	-0/+22
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch eliminates a couple of places where we were assuming that the host word size is the same as the target word size. Also a little refactoring: Constants now exports the types TargetInt and TargetWord corresponding to the Int/Word type on the target platform, and I moved the definitions of tARGET_INT_MAX and friends from Literal to Constants. Thanks to Barney Stratford <barney_stratford@fastmail.fm> for helping track down the problem and fix it. We now know that GHC can successfully cross-compile from 32-bit to 64-bit.
*	Make ghci work with libraries compiled with -ticky	simonpj@microsoft.com	2009-10-08	1	-6/+19
\| \| \| \| \| \| \|	This is a follow up to the patch tha fixes Trac #3439. We had forgotten the dynamic linker, which needs to know all these ticky symbols too.
*	Use "rep; nop" inside a spin-lock loop on x86/x86-64	Simon Marlow	2009-09-29	2	-0/+20
\| \| \| \| \|	This helps on a hyperthreaded CPU by yielding to the other thread in a spinlock loop.
*	remove TICK_GC_WORDS_COPIED, the GC stats give us the same thing	Simon Marlow	2009-09-28	1	-3/+0
\|
*	Add a way to generate tracing events programmatically	Simon Marlow	2009-09-25	3	-2/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	added: primop TraceEventOp "traceEvent#" GenPrimOp Addr# -> State# s -> State# s { Emits an event via the RTS tracing framework. The contents of the event is the zero-terminated byte string passed as the first argument. The event will be emitted either to the .eventlog file, or to stderr, depending on the runtime RTS flags. } and added the required RTS functionality to support it. Also a bit of refactoring in the RTS tracing code.
*	implement case-on-Word in the byte code generator/interpreter (#2881)	Simon Marlow	2009-09-18	1	-0/+2
\|
*	Fix #3439: -debug implies -ticky, and -ticky code links with any RTS	Simon Marlow	2009-09-18	2	-4/+2
\|
*	Event tracing: put the capability in the block marker, omit it from the events	Simon Marlow	2009-09-15	1	-19/+18
\| \| \| \| \| \| \| \| \| \| \| \|	This makes events smaller and tracing quicker, and speeds up reading and sorting the trace file. HEADS UP: this changes the format of event log files. Corresponding changes to the ghc-events package are required (and will be pushed soon). Normally we would make backwards-compatible changes, but this changes the format of every event (to remove the capability) so I'm breaking the rules this time. This will be the only time we can do this, since the format becomes public in 6.12.1.
*	Add event block markers	Simon Marlow	2009-09-13	1	-1/+2
\| \| \| \| \|	These indicate the size and time span of a sequence of events in the event log, to make it easier to sort and navigate a large event log.