summaryrefslogtreecommitdiff
path: root/rts/HeapStackCheck.cmm
Commit message (Collapse)AuthorAgeFilesLines
* NUMA supportSimon Marlow2016-06-101-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Summary: The aim here is to reduce the number of remote memory accesses on systems with a NUMA memory architecture, typically multi-socket servers. Linux provides a NUMA API for doing two things: * Allocating memory local to a particular node * Binding a thread to a particular node When given the +RTS --numa flag, the runtime will * Determine the number of NUMA nodes (N) by querying the OS * Assign capabilities to nodes, so cap C is on node C%N * Bind worker threads on a capability to the correct node * Keep a separate free lists in the block layer for each node * Allocate the nursery for a capability from node-local memory * Allocate blocks in the GC from node-local memory For example, using nofib/parallel/queens on a 24-core 2-socket machine: ``` $ ./Main 15 +RTS -N24 -s -A64m Total time 173.960s ( 7.467s elapsed) $ ./Main 15 +RTS -N24 -s -A64m --numa Total time 150.836s ( 6.423s elapsed) ``` The biggest win here is expected to be allocating from node-local memory, so that means programs using a large -A value (as here). According to perf, on this program the number of remote memory accesses were reduced by more than 50% by using `--numa`. Test Plan: * validate * There's a new flag --debug-numa=<n> that pretends to do NUMA without actually making the OS calls, which is useful for testing the code on non-NUMA systems. * TODO: I need to add some unit tests Reviewers: erikd, austin, rwbarton, ezyang, bgamari, hvr, niteria Subscribers: thomie Differential Revision: https://phabricator.haskell.org/D2199
* HeapStackCheck: Small refactoringBen Gamari2015-09-081-2/+2
| | | | | | | | Use modern Cmm argument syntax in stg_block_blackhole definition. Reviewed By: simonmar, austin Differential Revision: https://phabricator.haskell.org/D1210
* Fix offset calculation in __stg_gc_funSimon Marlow2015-07-061-2/+3
| | | | | | | | | | | | | | Summary: We were not treating the offset as a signed field in this rare case, so it would blow up if the offset was negative. Test Plan: Looked at the assembly Reviewers: austin, bgamari, rwbarton Subscribers: thomie Differential Revision: https://phabricator.haskell.org/D1042
* Make clearNursery freeSimon Marlow2014-11-251-0/+5
| | | | | | | | | | | | | | | | | | | | | Summary: clearNursery resets all the bd->free pointers of nursery blocks to make the blocks empty. In profiles we've seen clearNursery taking significant amounts of time particularly with large -N and -A values. This patch moves the work of clearNursery to the point at which we actually need the new block, thereby introducing an invariant that blocks to the right of the CurrentNursery pointer still need their bd->free pointer reset. This should make things faster overall, because we don't need to clear blocks that we don't use. Test Plan: validate Reviewers: AndreasVoellmy, ezyang, austin Subscribers: thomie, carter, ezyang, simonmar Differential Revision: https://phabricator.haskell.org/D318
* Per-thread allocation counters and limitsSimon Marlow2014-11-121-1/+3
| | | | | | | | This reverts commit f0fcc41d755876a1b02d1c7c79f57515059f6417. New changes: now works on 32-bit platforms too. I added some basic support for 64-bit subtraction and comparison operations to the x86 NCG.
* [skip ci] rts: Detabify HeapStackCheck.cmmAustin Seipp2014-10-211-37/+37
| | | | Signed-off-by: Austin Seipp <austin@well-typed.com>
* interruptible() was not returning true for BlockedOnSTM (#9379)Simon Marlow2014-08-011-7/+18
| | | | | | | | | | | | | | | | | | | Summary: There's an knock-on fix in HeapStackCheck.c which is potentially scary, but I'm pretty confident is OK. See comment for details. Test Plan: I've run all the STM tests I can find, including libraries/stm/tests/stm049 with +RTS -N8 and some of the constants bumped to make it more of a stress test. Reviewers: hvr, rwbarton, austin Subscribers: simonmar, relrod, ezyang, carter Differential Revision: https://phabricator.haskell.org/D104 GHC Trac Issues: #9379
* Revert "Per-thread allocation counters and limits"Simon Marlow2014-05-041-3/+1
| | | | | | | | Problems were found on 32-bit platforms, I'll commit again when I have a fix. This reverts the following commits: 54b31f744848da872c7c6366dea840748e01b5cf b0534f78a73f972e279eed4447a5687bd6a8308e
* Per-thread allocation counters and limitsSimon Marlow2014-05-021-1/+3
| | | | | | | | | | | | | | | | | | | | | | | This tracks the amount of memory allocation by each thread in a counter stored in the TSO. Optionally, when the counter drops below zero (it counts down), the thread can be sent an asynchronous exception: AllocationLimitExceeded. When this happens, given a small additional limit so that it can handle the exception. See documentation in GHC.Conc for more details. Allocation limits are similar to timeouts, but - timeouts use real time, not CPU time. Allocation limits do not count anything while the thread is blocked or in foreign code. - timeouts don't re-trigger if the thread catches the exception, allocation limits do. - timeouts can catch non-allocating loops, if you use -fno-omit-yields. This doesn't work for allocation limits. I couldn't measure any impact on benchmarks with these changes, even for nofib/smp.
* Fix scavenge_stack crash (#9045)Simon Marlow2014-04-291-2/+3
| | | | | | The new stg_gc_prim_p_ll stack frame was missing an info table. This is a regression since 7.6, because this stuff was part of a cleanup that happened in 7.7.
* Remove use of R9, and fix associated bugsSimon Marlow2013-10-011-12/+28
| | | | | | | | | | We were passing the function address to stg_gc_prim_p in R9, which was wrong because the call was a high-level call and didn't declare R9 as a parameter. Passing R9 as an argument is the right way, but unfortunately that exposed another bug: we were using the same macro in some low-level Cmm, where it is illegal to call functions with arguments (see Note [syntax of cmm files]). So we now have low-level variants of STK_CHK() and STK_CHK_P() for use in low-level Cmm code.
* Rename atomicReadMVar and friends to readMVar.Edward Z. Yang2013-07-121-8/+8
| | | | Signed-off-by: Edward Z. Yang <ezyang@mit.edu>
* Implement atomicReadMVar, fixing #4001.Edward Z. Yang2013-07-091-2/+29
| | | | | | | | | We add the invariant to the MVar blocked threads queue that threads blocked on an atomic read are always at the front of the queue. This invariant is easy to maintain, since takers are only ever added to the end of the queue. Signed-off-by: Edward Z. Yang <ezyang@mit.edu>
* Fix bug in stg_enter_checkbhSimon Marlow2012-11-011-1/+5
| | | | | This was causing crashes in stm050(ghci), throwto001(ghci), and possibly more.
* Save and restore registers across calls to unlockClosure.Geoffrey Mainland2012-10-301-0/+13
| | | | | We may not assume that registers are saved across calls to unlockClosure because it could call a C function on some platforms.
* profiling fixesSimon Marlow2012-10-191-1/+1
|
* profiling fixesSimon Marlow2012-10-091-1/+1
|
* Produce new-style Cmm from the Cmm parserSimon Marlow2012-10-081-303/+207
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The main change here is that the Cmm parser now allows high-level cmm code with argument-passing and function calls. For example: foo ( gcptr a, bits32 b ) { if (b > 0) { // we can make tail calls passing arguments: jump stg_ap_0_fast(a); } return (x,y); } More details on the new cmm syntax are in Note [Syntax of .cmm files] in CmmParse.y. The old syntax is still more-or-less supported for those occasional code fragments that really need to explicitly manipulate the stack. However there are a couple of differences: it is now obligatory to give a list of live GlobalRegs on every jump, e.g. jump %ENTRY_CODE(Sp(0)) [R1]; Again, more details in Note [Syntax of .cmm files]. I have rewritten most of the .cmm files in the RTS into the new syntax, except for AutoApply.cmm which is generated by the genapply program: this file could be generated in the new syntax instead and would probably be better off for it, but I ran out of enthusiasm. Some other changes in this batch: - The PrimOp calling convention is gone, primops now use the ordinary NativeNodeCall convention. This means that primops and "foreign import prim" code must be written in high-level cmm, but they can now take more than 10 arguments. - CmmSink now does constant-folding (should fix #7219) - .cmm files now go through the cmmPipeline, and as a result we generate better code in many cases. All the object files generated for the RTS .cmm files are now smaller. Performance should be better too, but I haven't measured it yet. - RET_DYN frames are removed from the RTS, lots of code goes away - we now have some more canned GC points to cover unboxed-tuples with 2-4 pointers, which will reduce code size a little.
* Code tidy-up: Use RET_NN in stg_block_asyncIan Lynagh2012-03-201-16/+1
|
* Fix stg_block_async on registerised Win64Ian Lynagh2012-03-201-2/+9
|
* Some Win64 fixesIan Lynagh2012-03-151-2/+2
| | | | Convert some sizes, as CLong is a different size to pointers
* Fix stg_block_async on unreg compilersIan Lynagh2012-03-151-0/+6
| | | | | This is only defined on Windows, so hadn't come up in our Linux unreg builds.
* (some) tabs -> spacesGabor Greif2012-02-271-1/+1
|
* Fix a scheduling bug in the threaded RTSSimon Marlow2011-12-011-1/+2
| | | | | | | | | | | | | | | The parallel GC was using setContextSwitches() to stop all the other threads, which sets the context_switch flag on every Capability. That had the side effect of causing every Capability to also switch threads, and since GCs can be much more frequent than context switches, this increased the context switch frequency. When context switches are expensive (because the switch is between two bound threads or a bound and unbound thread), the difference is quite noticeable. The fix is to have a separate flag to indicate that a Capability should stop and return to the scheduler, but not switch threads. I've called this the "interrupt" flag.
* Another fix to the stg_enter_checkbh frameSimon Marlow2011-11-291-1/+8
|
* stg_enter_checkbh: fix offsets for profilingSimon Marlow2011-11-291-2/+2
|
* Fix crash in non-threaded RTS on WindowsSimon Marlow2010-04-201-9/+8
| | | | | | | The tso->block_info field is now overwritten by pushOnRunQueue(), but stg_block_async_info was assuming that it still held a pointer to the StgAsyncIOResult. We must therefore save this value somewhere safe before putting the TSO on the run queue.
* Change the representation of the MVar blocked queueSimon Marlow2010-04-011-5/+5
| | | | | | | | | | | | | | | | | | | | | The list of threads blocked on an MVar is now represented as a list of separately allocated objects rather than being linked through the TSOs themselves. This lets us remove a TSO from the list in O(1) time rather than O(n) time, by marking the list object. Removing this linear component fixes some pathalogical performance cases where many threads were blocked on an MVar and became unreachable simultaneously (nofib/smp/threads007), or when sending an asynchronous exception to a TSO in a long list of thread blocked on an MVar. MVar performance has actually improved by a few percent as a result of this change, slightly to my surprise. This is the final cleanup in the sequence, which let me remove the old way of waking up threads (unblockOne(), MSG_WAKEUP) in favour of the new way (tryWakeupThread and MSG_TRY_WAKEUP, which is idempotent). It is now the case that only the Capability that owns a TSO may modify its state (well, almost), and this simplifies various things. More of the RTS is based on message-passing between Capabilities now.
* New implementation of BLACKHOLEsSimon Marlow2010-03-291-17/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This replaces the global blackhole_queue with a clever scheme that enables us to queue up blocked threads on the closure that they are blocked on, while still avoiding atomic instructions in the common case. Advantages: - gets rid of a locked global data structure and some tricky GC code (replacing it with some per-thread data structures and different tricky GC code :) - wakeups are more prompt: parallel/concurrent performance should benefit. I haven't seen anything dramatic in the parallel benchmarks so far, but a couple of threading benchmarks do improve a bit. - waking up a thread blocked on a blackhole is now O(1) (e.g. if it is the target of throwTo). - less sharing and better separation of Capabilities: communication is done with messages, the data structures are strictly owned by a Capability and cannot be modified except by sending messages. - this change will utlimately enable us to do more intelligent scheduling when threads block on each other. This is what started off the whole thing, but it isn't done yet (#3838). I'll be documenting all this on the wiki in due course.
* comments and formatting onlySimon Marlow2010-03-251-61/+71
|
* Fix a couple of bugs in the throwTo handling, exposed by conc016(threaded2)Simon Marlow2010-03-111-2/+6
|
* Use message-passing to implement throwTo in the RTSSimon Marlow2010-03-111-3/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | This replaces some complicated locking schemes with message-passing in the implementation of throwTo. The benefits are - previously it was impossible to guarantee that a throwTo from a thread running on one CPU to a thread running on another CPU would be noticed, and we had to rely on the GC to pick up these forgotten exceptions. This no longer happens. - the locking regime is simpler (though the code is about the same size) - threads can be unblocked from a blocked_exceptions queue without having to traverse the whole queue now. It's a rare case, but replaces an O(n) operation with an O(1). - generally we move in the direction of sharing less between Capabilities (aka HECs), which will become important with other changes we have planned. Also in this patch I replaced several STM-specific closure types with a generic MUT_PRIM closure type, which allowed a lot of code in the GC and other places to go away, hence the line-count reduction. The message-passing changes resulted in about a net zero line-count difference.
* Rename primops from foozh_fast to stg_foozhSimon Marlow2009-08-031-3/+3
| | | | For consistency with other RTS exported symbols
* Remove old GUM/GranSim codeSimon Marlow2009-06-021-290/+0
|
* Fixes to "Retract Hp *before* checking for HpLim==0"Simon Marlow2009-03-181-0/+12
|
* Retract Hp *before* checking for HpLim==0Simon Marlow2009-03-161-1/+1
| | | | | Fixes heapprof001(prof_hp) following the recent HpLim patch, which depended on the lack of slop in the heap.
* Instead of a separate context-switch flag, set HpLim to zeroSimon Marlow2009-03-131-2/+9
| | | | | | | | | | | | This reduces the latency between a context-switch being triggered and the thread returning to the scheduler, which in turn should reduce the cost of the GC barrier when there are many cores. We still retain the old context_switch flag which is checked at the end of each block of allocation. The idea is that setting HpLim may fail if the the target thread is modifying HpLim at the same time; the context_switch flag is a fallback. It also allows us to "context switch soon" without forcing an immediate switch, which can be costly.
* Merging in the new codegen branchdias@eecs.harvard.edu2008-08-141-6/+6
| | | | | | | | | | | | | | | | | | This merge does not turn on the new codegen (which only compiles a select few programs at this point), but it does introduce some changes to the old code generator. The high bits: 1. The Rep Swamp patch is finally here. The highlight is that the representation of types at the machine level has changed. Consequently, this patch contains updates across several back ends. 2. The new Stg -> Cmm path is here, although it appears to have a fair number of bugs lurking. 3. Many improvements along the CmmCPSZ path, including: o stack layout o some code for infotables, half of which is right and half wrong o proc-point splitting
* Move the context_switch flag into the CapabilitySimon Marlow2008-09-191-1/+1
| | | | | Fixes a long-standing bug that could in some cases cause sub-optimal scheduling behaviour.
* Add a proper write barrier for MVarsSimon Marlow2007-10-111-2/+6
| | | | | | | | | | | | Previously MVars were always on the mutable list of the old generation, which meant every MVar was visited during every minor GC. With lots of MVars hanging around, this gets expensive. We addressed this problem for MUT_VARs (aka IORefs) a while ago, the solution is to use a traditional GC write-barrier when the object is modified. This patch does the same thing for MVars. TVars are still done the old way, they could probably benefit from the same treatment too.
* {Enter,Leave}CriticalSection imports should be outside #ifdef __PIC__Simon Marlow2007-09-051-2/+2
|
* FIX: Correct Leave/EnterCriticalSection importsManuel M T Chakravarty2007-09-051-2/+2
|
* put the @N suffix on stdcall foreign calls in .cmm codeSimon Marlow2007-09-041-0/+2
| | | | This applies to EnterCriticalSection and LeaveCriticalSection in the RTS
* Windows: remove the {Enter,Leave}CricialSection wrappersSimon Marlow2007-08-291-1/+1
| | | | | | The C-- parser was missing the "stdcall" calling convention for foreign calls, but once added we can call {Enter,Leave}CricialSection directly.
* Properly guard imports because they have to be precise on Windows and Darwin ↵Clemens Fruhwirth2007-08-101-0/+2
| | | | sets __PIC__ automatically
* Add explicit imports for RTS-external variablesClemens Fruhwirth2007-08-061-0/+2
|
* FIX recent PPC crashes introduced by the pointer-tagging patch (I hope)Simon Marlow2007-08-011-19/+5
| | | | | | | | There was an accidental endian-dependency in changes related to RET_FUN. The changes in question weren't strictly necessary - they were left over from the original workaround for the compacting GC problems, so I've just reverted those changes in this patch, which should hopefully fix the PPC problems.
* Pointer TaggingSimon Marlow2007-07-271-9/+23
| | | | | | | | | | | | | | | | | | | | | | This patch implements pointer tagging as per our ICFP'07 paper "Faster laziness using dynamic pointer tagging". It improves performance by 10-15% for most workloads, including GHC itself. The original patches were by Alexey Rodriguez Yakushev <mrchebas@gmail.com>, with additions and improvements by me. I've re-recorded the development as a single patch. The basic idea is this: we use the low 2 bits of a pointer to a heap object (3 bits on a 64-bit architecture) to encode some information about the object pointed to. For a constructor, we encode the "tag" of the constructor (e.g. True vs. False), for a function closure its arity. This enables some decisions to be made without dereferencing the pointer, which speeds up some common operations. In particular it enables us to avoid costly indirect jumps in many cases. More information in the commentary: http://hackage.haskell.org/trac/ghc/wiki/Commentary/Rts/HaskellExecution/PointerTagging
* FIX BUILD (Windows): catch up with changes to .cmm syntaxSimon Marlow2007-07-031-1/+1
|
* Implemented and fixed bugs in CmmInfo handlingMichael D. Adams2007-06-271-34/+14
|