summaryrefslogtreecommitdiff
path: root/rts
Commit message (Collapse)AuthorAgeFilesLines
* Fix overflow.Tamar Christina2020-01-061-1/+1
|
* Fix typos, via a Levenshtein-style correctorBrian Wignall2020-01-0415-15/+15
|
* Fix some sloppy indentationKevin Buhr2019-12-311-3/+3
|
* Add additional Note explaining the -Iw flagKevin Buhr2019-12-311-2/+49
|
* Add "-Iw" RTS flag for minimum wait between idle GCs (#11134)Kevin Buhr2019-12-312-20/+42
|
* rts: Fix --debug-numa mode under DockerBen Gamari2019-12-302-0/+3
| | | | | | | | | As noted in #17606, Docker disallows the get_mempolicy syscall by default. This caused numerous tests to fail under CI in the `debug_numa` way. Avoid this by disabling the NUMA probing logic when --debug-numa is in use, instead setting n_numa_nodes in RtsFlags.c. Fixes #17606.
* rts: Error on invalid --numa flagsBen Gamari2019-12-301-1/+6
| | | | | Previously things like `+RTS --numa-debug` would enable NUMA support, despite being an invalid flag.
* rts: Ensure that nonmoving gc isn't used with profilingBen Gamari2019-12-301-0/+5
|
* Remove outdated commentSylvain Henry2019-12-241-4/+2
|
* Handle large ARR_WORDS in heap census (fix #17572)Sylvain Henry2019-12-191-0/+16
| | | | | | | | We can do a heap census with a non-profiling RTS. With a non-profiling RTS we don't zero superfluous bytes of shrunk arrays hence a need to handle the case specifically to avoid a crash. Revert part of a586b33f8e8ad60b5c5ef3501c89e9b71794bbed
* Revert "rts: Drop redundant flags for libffi"Ben Gamari2019-12-121-3/+8
| | | | | | | This seems to have regressed builds using `--with-system-libffi` (#17520). This reverts commit 3ce18700f80a12c48a029b49c6201ad2410071bb.
* rts: Specialize hashing at call site rather than in struct.Crazycolorz52019-12-118-84/+153
| | | | | | | | | | | | | | | | Separate word and string hash tables on the type level, and do not store the hashing function. Thus when a different hash function is desire it is provided upon accessing the table. This is worst case the same as before the change, and in the majority of cases is better. Also mark the functions for aggressive inlining to improve performance. {F1686506} Reviewers: bgamari, erikd, simonmar Subscribers: rwbarton, thomie, carter GHC Trac Issues: #13165 Differential Revision: https://phabricator.haskell.org/D4889
* rts: Add a long form flag to enable the non-moving GCBen Gamari2019-12-101-0/+5
| | | | | The old flag, `-xn`, was quite cryptic. Here we add `--nonmoving-gc` in addition.
* Fix comment typosGabor Greif2019-12-091-1/+1
| | | | | | | | | | | | | | | | The below is only necessary to fix the CI perf fluke that happened in 9897e8c8ef0b19a9571ef97a1d9bb050c1ee9121: ------------------------- Metric Decrease: T5837 T6048 T9020 T12425 T12234 T13035 T12150 Naperian -------------------------
* rts/NonMovingSweep: Fix locking of new mutable list allocationBen Gamari2019-12-051-1/+1
| | | | | | | | | | | | | Previously we used allocBlockOnNode_sync in nonmovingSweepMutLists despite the fact that we aren't in the GC and therefore the allocation spinlock isn't in use. This meant that sweep would end up spinning until the next minor GC, when the SM lock was moved away from the SM_MUTEX to the spinlock. This isn't a correctness issue but it sure isn't good for performance. Found thanks for Ward. Fixes #17539.
* nonmoving: Clear segment bitmaps during sweepBen Gamari2019-12-053-7/+4
| | | | | | | | | | Previously we would clear the bitmaps of segments which we are going to sweep during the preparatory pause. However, this is unnecessary: the existence of the mark epoch ensures that the sweep will correctly identify non-reachable objects, even if we do not clear the bitmap. We now defer clearing the bitmap to sweep, which happens concurrently with mutation.
* Fix more typosBrian Wignall2019-12-025-5/+5
|
* Fix typos, using Wikipedia list of common typosBrian Wignall2019-11-288-9/+9
|
* Fix typosBrian Wignall2019-11-232-2/+2
|
* rts: Expose interface for configuring EventLogWritersBen Gamari2019-11-234-48/+88
| | | | | | This exposes a set of interfaces from the GHC API for configuring EventLogWriters. These can be used by consumers like [ghc-eventlog-socket](https://github.com/bgamari/ghc-eventlog-socket).
* Use pointer equality in Eq/Ord for ThreadIdRoland Zumkeller2019-11-192-5/+21
| | | | | | | | | | Changes (==) to use only pointer equality. This is safe because two threads are the same iff they have the same id. Changes `compare` to check pointer equality first and fall back on ids only in case of inequality. See discussion in #16761.
* Changing Thread IDs from 32 bits to 64 bits.Roland Zumkeller2019-11-191-3/+3
|
* nonmoving: Drop redundant write barrier on stack underflowBen Gamari2019-11-191-10/+0
| | | | | | | | | | | | | Previously we would push stack-carried return values to the new stack on a stack overflow. While the precise reasoning for this barrier is unfortunately lost to history, in hindsight I suspect it was prompted by a missing barrier elsewhere (that has been since fixed). Moreover, there the redundant barrier is actively harmful: the stack may contain non-pointer values; blindly pushing these to the mark queue will result in a crash. This is precisely what happened in the `stack003` test. However, because of a (now fixed) deficiency in the test this crash did not trigger on amd64.
* nonmoving: Fix handling on large object marking on 32-bitBen Gamari2019-11-191-4/+7
| | | | | | | | | Previously we would reset the pointer pointing to the object to be marked to the beginning of the block when marking a large object. This did no harm on 64-bit but on 32-bit it broke, e.g. `arr020`, since we align pinned ByteArray allocations such that the payload is 8 byte-aligned. This means that the object might not begin at the beginning of the block.,
* nonmoving: Rework mark queue representationBen Gamari2019-11-192-23/+18
| | | | | The previous representation needlessly limited the array length to 16-bits on 32-bit platforms.
* nonmoving: Fix incorrect masking in mark queue type testBen Gamari2019-11-191-2/+2
| | | | | | We were using TAG_BITS instead of TAG_MASK. This happened to work on 64-bit platforms where TAG_BITS==3 since we only use tag values 0 and 3. However, this broken on 32-bit platforms where TAG_BITS==2.
* nonmoving: Use correct info table pointer accessorBen Gamari2019-11-191-5/+7
| | | | | | | | | | | Previously we used INFO_PTR_TO_STRUCT instead of THUNK_INFO_PTR_TO_STRUCT when looking at a thunk. These two happen to be equivalent on 64-bit architectures due to alignment considerations however they are different on 32-bit platforms. This lead to #17487. To fix this we also employ a small optimization: there is only one thunk of type WHITEHOLE (namely stg_WHITEHOLE_info). Consequently, we can just use a plain pointer comparison instead of testing against info->type.
* rts: Add missing include of SymbolExtras.hBen Gamari2019-11-191-0/+1
| | | | This broke the Windows build.
* Properly account for libdw paths in make build systemBen Gamari2019-11-192-1/+8
| | | | Should finally fix #17255.
* Enable USE_PTHREAD_FOR_ITIMER also on FreeBSDViktor Dukhovni2019-11-191-0/+3
| | | | | | | | | | | If using a pthread instead of a timer signal is more reliable, and has no known drawbacks, then FreeBSD is also capable of supporting this mode of operation (tested on FreeBSD 12 with GHC 8.8.1, but no reason why it would not also work on FreeBSD 11 or GHC 8.6). Proposed by Kevin Zhang in: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=241849
* rts/nonmoving: Catch failure of createOSThreadBen Gamari2019-11-081-2/+4
|
* rts/NonMoving: Fix various Windows build issuesBen Gamari2019-11-072-5/+6
| | | | | The Windows build seems to be stricter about not providing threading primitives in the non-threaded RTS.
* rts: Remove undesireable inline specifierBen Gamari2019-11-071-1/+1
| | | | | I have no idea why I marked this as inline originally but clearly it shouldn't be inlined.
* rts: Ensure that Rts.h is always included firstBen Gamari2019-11-077-6/+18
| | | | | | | | | | In general this is the convention that we use in the RTS. On Windows things actually fail if we break it. For instance, you see things like: includes\stg\Types.h:26:9: error: warning: #warning "Mismatch between __USE_MINGW_ANSI_STDIO definitions. If using Rts.h make sure it is the first header included." [-Wcpp]
* rts: Fix m32 allocator build on WindowsBen Gamari2019-11-073-5/+9
| | | | | An inconsistency in the name of m32_allocator_flush caused the build to fail with a missing prototype error.
* configure: Add --with-libdw-{includes,libraries} flagsBen Gamari2019-11-063-2/+9
| | | | Fixing #17255.
* rts: Drop redundant flags for libffiBen Gamari2019-11-061-8/+3
| | | | These are now handled in the cabal file's include-dirs field.
* rts: Add missing const in HEAP_ALLOCED_GCBen Gamari2019-11-051-1/+1
| | | | | This was previously unnoticed as this code-path is hit on very few platforms (e.g. OpenBSD).
* rts/linker: Ensure that code isn't writableBen Gamari2019-11-048-289/+368
| | | | | | | | | | | | | | | | | | | | | | | | | | For many years the linker would simply map all of its memory with PROT_READ|PROT_WRITE|PROT_EXEC. However operating systems have been becoming increasingly reluctant to accept this practice (e.g. #17353 and #12657) and for good reason: writable code is ripe for exploitation. Consequently mmapForLinker now maps its memory with PROT_READ|PROT_WRITE. After the linker has finished filling/relocating the mapping it must then call mmapForLinkerMarkExecutable on the sections of the mapping which contain executable code. Moreover, to make all of this possible it was necessary to redesign the m32 allocator. First, we gave (in an earlier commit) each ObjectCode its own m32_allocator. This was necessary since code loading and symbol resolution/relocation are currently interleaved, meaning that it is not possible to enforce W^X when symbols from different objects reside in the same page. We then redesigned the m32 allocator to take advantage of the fact that all of the pages allocated with the allocator die at the same time (namely, when the owning ObjectCode is unloaded). This makes a number of things simpler (e.g. no more page reference counting; the interface provided by the allocator for freeing is simpler). See Note [M32 Allocator] for details.
* Add +RTS --disable-delayed-os-memory-return. Fixes #17411.Niklas Hambüchen2019-11-012-13/+43
| | | | | | Sets `MiscFlags.disableDelayedOsMemoryReturn`. See the added `Note [MADV_FREE and MADV_DONTNEED]` for details.
* rts: Make m32 allocator per-ObjectCodeBen Gamari2019-11-017-53/+83
| | | | | | | | | | | | | | | | | MacOS Catalina is finally going to force our hand in forbidden writable exeutable mappings. Unfortunately, this is quite incompatible with the current global m32 allocator, which mixes symbols from various objects in a single page. The problem here is that some of these symbols may not yet be resolved (e.g. had relocations performed) as this happens lazily (and therefore we can't yet make the section read-only and therefore executable). The easiest way around this is to simply create one m32 allocator per ObjectCode. This may slightly increase fragmentation for short-running programs but I suspect will actually improve fragmentation for programs doing lots of loading/unloading since we can always free all of the pages allocated to an object when it is unloaded (although this ability will only be implemented in a later patch).
* mmap: Factor out protection flagsBen Gamari2019-11-011-4/+3
|
* rts: More aarch64 header fixesBen Gamari2019-10-305-7/+10
|
* Interpreter: initialize arity fields of AP_NOUPDsÖmer Sinan Ağacan2019-10-291-4/+4
| | | | | | AP_NOUPD entry code doesn't use the arity field, but not initializing this field confuses printers/debuggers, and also makes testing harder as the field's value changes randomly.
* rts: Fix ARM linker includesBen Gamari2019-10-266-17/+7
| | | | | | | * Prefer #pragma once over guard macros * Drop redundant #includes * Fix order to ensure that necessary macros are defined when we condition on them
* Implement shrinkSmallMutableArray# and resizeSmallMutableArray#.Andrew Martin2019-10-262-1/+19
| | | | | | | | | | | | | | | | | | | | | This is a part of GHC Proposal #25: "Offer more array resizing primitives". Resources related to the proposal: - Discussion: https://github.com/ghc-proposals/ghc-proposals/pull/121 - Proposal: https://github.com/ghc-proposals/ghc-proposals/blob/master/proposals/0025-resize-boxed.rst Only shrinkSmallMutableArray# is implemented as a primop since a library-space implementation of resizeSmallMutableArray# (in GHC.Exts) is no less efficient than a primop would be. This may be replaced by a primop in the future if someone devises a strategy for growing arrays in-place. The library-space implementation always copies the array when growing it. This commit also tweaks the documentation of the deprecated sizeofMutableByteArray#, removing the mention of concurrency. That primop is unsound even in single-threaded applications. Additionally, the non-negativity assertion on the existing shrinkMutableByteArray# primop has been removed since this predicate is trivially always true.
* configure: Drop GccLT46Ben Gamari2019-10-251-3/+0
| | | | | GCC 4.6 was released 7 years ago. I think we can finally assume that it's available. This is a simplification prompted by #15742.
* Merge non-moving garbage collectorBen Gamari2019-10-2350-330/+6737
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This introduces a concurrent mark & sweep garbage collector to manage the old generation. The concurrent nature of this collector typically results in significantly reduced maximum and mean pause times in applications with large working sets. Due to the large and intricate nature of the change I have opted to preserve the fully-buildable history, including merge commits, which is described in the "Branch overview" section below. Collector design ================ The full design of the collector implemented here is described in detail in a technical note > B. Gamari. "A Concurrent Garbage Collector For the Glasgow Haskell > Compiler" (2018) This document can be requested from @bgamari. The basic heap structure used in this design is heavily inspired by > K. Ueno & A. Ohori. "A fully concurrent garbage collector for > functional programs on multicore processors." /ACM SIGPLAN Notices/ > Vol. 51. No. 9 (presented at ICFP 2016) This design is intended to allow both marking and sweeping concurrent to execution of a multi-core mutator. Unlike the Ueno design, which requires no global synchronization pauses, the collector introduced here requires a stop-the-world pause at the beginning and end of the mark phase. To avoid heap fragmentation, the allocator consists of a number of fixed-size /sub-allocators/. Each of these sub-allocators allocators into its own set of /segments/, themselves allocated from the block allocator. Each segment is broken into a set of fixed-size allocation blocks (which back allocations) in addition to a bitmap (used to track the liveness of blocks) and some additional metadata (used also used to track liveness). This heap structure enables collection via mark-and-sweep, which can be performed concurrently via a snapshot-at-the-beginning scheme (although concurrent collection is not implemented in this patch). Implementation structure ======================== The majority of the collector is implemented in a handful of files: * `rts/Nonmoving.c` is the heart of the beast. It implements the entry-point to the nonmoving collector (`nonmoving_collect`), as well as the allocator (`nonmoving_allocate`) and a number of utilities for manipulating the heap. * `rts/NonmovingMark.c` implements the mark queue functionality, update remembered set, and mark loop. * `rts/NonmovingSweep.c` implements the sweep loop. * `rts/NonmovingScav.c` implements the logic necessary to scavenge the nonmoving heap. Branch overview =============== ``` * wip/gc/opt-pause: | A variety of small optimisations to further reduce pause times. | * wip/gc/compact-nfdata: | Introduce support for compact regions into the non-moving |\ collector | \ | \ | | * wip/gc/segment-header-to-bdescr: | | | Another optimization that we are considering, pushing | | | some segment metadata into the segment descriptor for | | | the sake of locality during mark | | | | * | wip/gc/shortcutting: | | | Support for indirection shortcutting and the selector optimization | | | in the non-moving heap. | | | * | | wip/gc/docs: | |/ Work on implementation documentation. | / |/ * wip/gc/everything: | A roll-up of everything below. |\ | \ | |\ | | \ | | * wip/gc/optimize: | | | A variety of optimizations, primarily to the mark loop. | | | Some of these are microoptimizations but a few are quite | | | significant. In particular, the prefetch patches have | | | produced a nontrivial improvement in mark performance. | | | | | * wip/gc/aging: | | | Enable support for aging in major collections. | | | | * | wip/gc/test: | | | Fix up the testsuite to more or less pass. | | | * | | wip/gc/instrumentation: | | | A variety of runtime instrumentation including statistics | | / support, the nonmoving census, and eventlog support. | |/ | / |/ * wip/gc/nonmoving-concurrent: | The concurrent write barriers. | * wip/gc/nonmoving-nonconcurrent: | The nonmoving collector without the write barriers necessary | for concurrent collection. | * wip/gc/preparation: | A merge of the various preparatory patches that aren't directly | implementing the GC. | | * GHC HEAD . . . ```
| * nonmoving: Upper-bound time we hold SM_MUTEX for during sweepwip/gc/opt-pauseBen Gamari2019-10-221-1/+25
| |
| * nonmoving: Don't do two passes over large and compact object listsBen Gamari2019-10-221-10/+14
| | | | | | | | | | | | | | | | | | | | | | Previously we would first move the new objects to their appropriate non-moving GC list, then do another pass over that list to clear their mark bits. This is needlessly expensive. First clear the mark bits of the existing objects, then add the newly evacuated objects and, at the same time, clear their mark bits. This cuts the preparatory GC time in half for the Pusher benchmark with a large queue size.