summaryrefslogtreecommitdiff
path: root/compiler/cmm
Commit message (Collapse)AuthorAgeFilesLines
* Module hierarchy: Cmm (cf #13009)Sylvain Henry2020-01-2540-18915/+0
|
* Fix more typos, via an improved Levenshtein-style correctorBrian Wignall2020-01-122-4/+4
|
* Module hierarchy: Iface (cf #13009)Sylvain Henry2020-01-061-2/+1
|
* Fix typos, via a Levenshtein-style correctorBrian Wignall2020-01-046-190/+190
|
* Simplify mrStrGabor Greif2020-01-031-8/+1
|
* Module hierarchy (#13009): StgSylvain Henry2019-12-311-1/+1
|
* Add GHC-API logging hooksSylvain Henry2019-12-181-12/+13
| | | | | | | | | | | | | | | | | | | | | | | * Add 'dumpAction' hook to DynFlags. It allows GHC API users to catch dumped intermediate codes and information. The format of the dump (Core, Stg, raw text, etc.) is now reported allowing easier automatic handling. * Add 'traceAction' hook to DynFlags. Some dumps go through the trace mechanism (for instance unfoldings that have been considered for inlining). This is problematic because: 1) dumps aren't written into files even with -ddump-to-file on 2) dumps are written on stdout even with GHC API 3) in this specific case, dumping depends on unsafe globally stored DynFlags which is bad for GHC API users We introduce 'traceAction' hook which allows GHC API to catch those traces and to avoid using globally stored DynFlags. * Avoid dumping empty logs via dumpAction/traceAction (but still write empty files to keep the existing behavior)
* Implement pointer tagging for big families (#14373)Gabor Greif2019-12-061-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Formerly we punted on these and evaluated constructors always got a tag of 1. We now cascade switches because we have to check the tag first and when it is MAX_PTR_TAG then get the precise tag from the info table and switch on that. The only technically tricky part is that the default case needs (logical) duplication. To do this we emit an extra label for it and branch to that from the second switch. This avoids duplicated codegen. Here's a simple example of the new code gen: data D = D1 | D2 | D3 | D4 | D5 | D6 | D7 | D8 On a 64-bit system previously all constructors would be tagged 1. With the new code gen D7 and D8 are tagged 7: [Lib.D7_con_entry() { ... {offset c1eu: // global R1 = R1 + 7; call (P64[Sp])(R1) args: 8, res: 0, upd: 8; } }] [Lib.D8_con_entry() { ... {offset c1ez: // global R1 = R1 + 7; call (P64[Sp])(R1) args: 8, res: 0, upd: 8; } }] When switching we now look at the info table only when the tag is 7. For example, if we derive Enum for the type above, the Cmm looks like this: c2Le: _s2Js::P64 = R1; _c2Lq::P64 = _s2Js::P64 & 7; switch [1 .. 7] _c2Lq::P64 { case 1 : goto c2Lk; case 2 : goto c2Ll; case 3 : goto c2Lm; case 4 : goto c2Ln; case 5 : goto c2Lo; case 6 : goto c2Lp; case 7 : goto c2Lj; } // Read info table for tag c2Lj: _c2Lv::I64 = %MO_UU_Conv_W32_W64(I32[I64[_s2Js::P64 & (-8)] - 4]); if (_c2Lv::I64 != 6) goto c2Lu; else goto c2Lt; Generated Cmm sizes do not change too much, but binaries are very slightly larger, due to the fact that the new instructions are longer in encoded form. E.g. previously entry code for D8 above would be 00000000000001c0 <Lib_D8_con_info>: 1c0: 48 ff c3 inc %rbx 1c3: ff 65 00 jmpq *0x0(%rbp) With this patch 00000000000001d0 <Lib_D8_con_info>: 1d0: 48 83 c3 07 add $0x7,%rbx 1d4: ff 65 00 jmpq *0x0(%rbp) This is one byte longer. Secondly, reading info table directly and then switching is shorter _c1co: movq -1(%rbx),%rax movl -4(%rax),%eax // Switch on info table tag jmp *_n1d5(,%rax,8) than doing the same switch, and then for the tag 7 doing another switch: // When tag is 7 _c1ct: andq $-8,%rbx movq (%rbx),%rax movl -4(%rax),%eax // Switch on info table tag ... Some changes of binary sizes in actual programs: - In NoFib the worst case is 0.1% increase in benchmark "parser" (see NoFib results below). All programs get slightly larger. - Stage 2 compiler size does not change. - In "containers" (the library) size of all object files increases 0.0005%. Size of the test program "bitqueue-properties" increases 0.03%. nofib benchmarks kindly provided by Ömer (@osa1): NoFib Results ============= -------------------------------------------------------------------------------- Program Size Allocs Instrs Reads Writes -------------------------------------------------------------------------------- CS +0.0% 0.0% -0.0% -0.0% -0.0% CSD +0.0% 0.0% 0.0% +0.0% +0.0% FS +0.0% 0.0% 0.0% +0.0% 0.0% S +0.0% 0.0% -0.0% 0.0% 0.0% VS +0.0% 0.0% -0.0% +0.0% +0.0% VSD +0.0% 0.0% -0.0% +0.0% -0.0% VSM +0.0% 0.0% 0.0% 0.0% 0.0% anna +0.0% 0.0% +0.1% -0.9% -0.0% ansi +0.0% 0.0% -0.0% +0.0% +0.0% atom +0.0% 0.0% 0.0% 0.0% 0.0% awards +0.0% 0.0% -0.0% +0.0% 0.0% banner +0.0% 0.0% -0.0% +0.0% 0.0% bernouilli +0.0% 0.0% +0.0% +0.0% +0.0% binary-trees +0.0% 0.0% -0.0% -0.0% -0.0% boyer +0.0% 0.0% +0.0% 0.0% -0.0% boyer2 +0.0% 0.0% +0.0% 0.0% -0.0% bspt +0.0% 0.0% +0.0% +0.0% 0.0% cacheprof +0.0% 0.0% +0.1% -0.8% 0.0% calendar +0.0% 0.0% -0.0% +0.0% -0.0% cichelli +0.0% 0.0% +0.0% 0.0% 0.0% circsim +0.0% 0.0% -0.0% -0.1% -0.0% clausify +0.0% 0.0% +0.0% +0.0% 0.0% comp_lab_zift +0.0% 0.0% +0.0% 0.0% -0.0% compress +0.0% 0.0% +0.0% +0.0% 0.0% compress2 +0.0% 0.0% 0.0% 0.0% 0.0% constraints +0.0% 0.0% -0.0% -0.0% -0.0% cryptarithm1 +0.0% 0.0% +0.0% 0.0% 0.0% cryptarithm2 +0.0% 0.0% +0.0% -0.0% 0.0% cse +0.0% 0.0% +0.0% +0.0% 0.0% digits-of-e1 +0.0% 0.0% -0.0% -0.0% -0.0% digits-of-e2 +0.0% 0.0% +0.0% -0.0% -0.0% dom-lt +0.0% 0.0% +0.0% +0.0% 0.0% eliza +0.0% 0.0% -0.0% +0.0% 0.0% event +0.0% 0.0% -0.0% -0.0% -0.0% exact-reals +0.0% 0.0% +0.0% +0.0% +0.0% exp3_8 +0.0% 0.0% -0.0% -0.0% -0.0% expert +0.0% 0.0% +0.0% +0.0% +0.0% fannkuch-redux +0.0% 0.0% +0.0% 0.0% 0.0% fasta +0.0% 0.0% -0.0% -0.0% -0.0% fem +0.0% 0.0% +0.0% +0.0% +0.0% fft +0.0% 0.0% +0.0% -0.0% -0.0% fft2 +0.0% 0.0% +0.0% +0.0% +0.0% fibheaps +0.0% 0.0% +0.0% +0.0% 0.0% fish +0.0% 0.0% +0.0% +0.0% 0.0% fluid +0.0% 0.0% +0.0% +0.0% +0.0% fulsom +0.0% 0.0% +0.0% -0.0% +0.0% gamteb +0.0% 0.0% +0.0% -0.0% -0.0% gcd +0.0% 0.0% +0.0% +0.0% 0.0% gen_regexps +0.0% 0.0% +0.0% -0.0% -0.0% genfft +0.0% 0.0% -0.0% -0.0% -0.0% gg +0.0% 0.0% 0.0% -0.0% 0.0% grep +0.0% 0.0% +0.0% +0.0% +0.0% hidden +0.0% 0.0% +0.0% -0.0% -0.0% hpg +0.0% 0.0% +0.0% -0.1% -0.0% ida +0.0% 0.0% +0.0% -0.0% -0.0% infer +0.0% 0.0% -0.0% -0.0% -0.0% integer +0.0% 0.0% -0.0% -0.0% -0.0% integrate +0.0% 0.0% 0.0% +0.0% 0.0% k-nucleotide +0.0% 0.0% -0.0% -0.0% -0.0% kahan +0.0% 0.0% -0.0% -0.0% -0.0% knights +0.0% 0.0% +0.0% -0.0% -0.0% lambda +0.0% 0.0% +1.2% -6.1% -0.0% last-piece +0.0% 0.0% +0.0% -0.0% -0.0% lcss +0.0% 0.0% +0.0% -0.0% -0.0% life +0.0% 0.0% +0.0% -0.0% -0.0% lift +0.0% 0.0% +0.0% +0.0% 0.0% linear +0.0% 0.0% +0.0% +0.0% +0.0% listcompr +0.0% 0.0% -0.0% -0.0% -0.0% listcopy +0.0% 0.0% -0.0% -0.0% -0.0% maillist +0.0% 0.0% +0.0% -0.0% -0.0% mandel +0.0% 0.0% +0.0% +0.0% +0.0% mandel2 +0.0% 0.0% +0.0% +0.0% -0.0% mate +0.0% 0.0% +0.0% +0.0% +0.0% minimax +0.0% 0.0% -0.0% +0.0% -0.0% mkhprog +0.0% 0.0% +0.0% +0.0% +0.0% multiplier +0.0% 0.0% 0.0% +0.0% -0.0% n-body +0.0% 0.0% +0.0% -0.0% -0.0% nucleic2 +0.0% 0.0% +0.0% +0.0% -0.0% para +0.0% 0.0% +0.0% +0.0% +0.0% paraffins +0.0% 0.0% +0.0% +0.0% +0.0% parser +0.1% 0.0% +0.4% -1.7% -0.0% parstof +0.0% 0.0% -0.0% -0.0% -0.0% pic +0.0% 0.0% +0.0% 0.0% -0.0% pidigits +0.0% 0.0% -0.0% -0.0% -0.0% power +0.0% 0.0% +0.0% -0.0% -0.0% pretty +0.0% 0.0% +0.0% +0.0% +0.0% primes +0.0% 0.0% +0.0% 0.0% 0.0% primetest +0.0% 0.0% +0.0% +0.0% +0.0% prolog +0.0% 0.0% +0.0% +0.0% +0.0% puzzle +0.0% 0.0% +0.0% +0.0% +0.0% queens +0.0% 0.0% 0.0% +0.0% +0.0% reptile +0.0% 0.0% +0.0% +0.0% 0.0% reverse-complem +0.0% 0.0% -0.0% -0.0% -0.0% rewrite +0.0% 0.0% +0.0% 0.0% -0.0% rfib +0.0% 0.0% +0.0% +0.0% +0.0% rsa +0.0% 0.0% +0.0% +0.0% +0.0% scc +0.0% 0.0% +0.0% +0.0% +0.0% sched +0.0% 0.0% +0.0% +0.0% +0.0% scs +0.0% 0.0% +0.0% +0.0% 0.0% simple +0.0% 0.0% +0.0% +0.0% +0.0% solid +0.0% 0.0% +0.0% +0.0% 0.0% sorting +0.0% 0.0% +0.0% -0.0% 0.0% spectral-norm +0.0% 0.0% -0.0% -0.0% -0.0% sphere +0.0% 0.0% +0.0% -1.0% 0.0% symalg +0.0% 0.0% +0.0% +0.0% +0.0% tak +0.0% 0.0% +0.0% +0.0% +0.0% transform +0.0% 0.0% +0.4% -1.3% +0.0% treejoin +0.0% 0.0% +0.0% -0.0% 0.0% typecheck +0.0% 0.0% -0.0% +0.0% 0.0% veritas +0.0% 0.0% +0.0% -0.1% +0.0% wang +0.0% 0.0% +0.0% +0.0% +0.0% wave4main +0.0% 0.0% +0.0% 0.0% -0.0% wheel-sieve1 +0.0% 0.0% +0.0% +0.0% +0.0% wheel-sieve2 +0.0% 0.0% +0.0% +0.0% 0.0% x2n1 +0.0% 0.0% +0.0% +0.0% 0.0% -------------------------------------------------------------------------------- Min +0.0% 0.0% -0.0% -6.1% -0.0% Max +0.1% 0.0% +1.2% +0.0% +0.0% Geometric Mean +0.0% -0.0% +0.0% -0.1% -0.0% NoFib GC Results ================ -------------------------------------------------------------------------------- Program Size Allocs Instrs Reads Writes -------------------------------------------------------------------------------- circsim +0.0% 0.0% -0.0% -0.0% -0.0% constraints +0.0% 0.0% -0.0% 0.0% -0.0% fibheaps +0.0% 0.0% 0.0% -0.0% -0.0% fulsom +0.0% 0.0% 0.0% -0.6% -0.0% gc_bench +0.0% 0.0% 0.0% 0.0% -0.0% hash +0.0% 0.0% -0.0% -0.0% -0.0% lcss +0.0% 0.0% 0.0% -0.0% 0.0% mutstore1 +0.0% 0.0% 0.0% -0.0% -0.0% mutstore2 +0.0% 0.0% +0.0% -0.0% -0.0% power +0.0% 0.0% -0.0% 0.0% -0.0% spellcheck +0.0% 0.0% -0.0% -0.0% -0.0% -------------------------------------------------------------------------------- Min +0.0% 0.0% -0.0% -0.6% -0.0% Max +0.0% 0.0% +0.0% 0.0% 0.0% Geometric Mean +0.0% +0.0% +0.0% -0.1% +0.0% Fixes #14373 These performance regressions appear to be a fluke in CI. See the discussion in !1742 for details. Metric Increase: T6048 T12234 T12425 Naperian T12150 T5837 T13035
* Add `timesInt2#` primopSylvain Henry2019-12-022-0/+2
|
* Fix typos, using Wikipedia list of common typosBrian Wignall2019-11-283-3/+3
|
* Fix typosBrian Wignall2019-11-231-1/+1
|
* Fix random typos [skip ci]nineonine2019-11-172-2/+2
|
* Document CmmTopInfo typeÖmer Sinan Ağacan2019-11-131-0/+2
| | | | [ci skip]
* Merge non-moving garbage collectorBen Gamari2019-10-231-1/+6
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This introduces a concurrent mark & sweep garbage collector to manage the old generation. The concurrent nature of this collector typically results in significantly reduced maximum and mean pause times in applications with large working sets. Due to the large and intricate nature of the change I have opted to preserve the fully-buildable history, including merge commits, which is described in the "Branch overview" section below. Collector design ================ The full design of the collector implemented here is described in detail in a technical note > B. Gamari. "A Concurrent Garbage Collector For the Glasgow Haskell > Compiler" (2018) This document can be requested from @bgamari. The basic heap structure used in this design is heavily inspired by > K. Ueno & A. Ohori. "A fully concurrent garbage collector for > functional programs on multicore processors." /ACM SIGPLAN Notices/ > Vol. 51. No. 9 (presented at ICFP 2016) This design is intended to allow both marking and sweeping concurrent to execution of a multi-core mutator. Unlike the Ueno design, which requires no global synchronization pauses, the collector introduced here requires a stop-the-world pause at the beginning and end of the mark phase. To avoid heap fragmentation, the allocator consists of a number of fixed-size /sub-allocators/. Each of these sub-allocators allocators into its own set of /segments/, themselves allocated from the block allocator. Each segment is broken into a set of fixed-size allocation blocks (which back allocations) in addition to a bitmap (used to track the liveness of blocks) and some additional metadata (used also used to track liveness). This heap structure enables collection via mark-and-sweep, which can be performed concurrently via a snapshot-at-the-beginning scheme (although concurrent collection is not implemented in this patch). Implementation structure ======================== The majority of the collector is implemented in a handful of files: * `rts/Nonmoving.c` is the heart of the beast. It implements the entry-point to the nonmoving collector (`nonmoving_collect`), as well as the allocator (`nonmoving_allocate`) and a number of utilities for manipulating the heap. * `rts/NonmovingMark.c` implements the mark queue functionality, update remembered set, and mark loop. * `rts/NonmovingSweep.c` implements the sweep loop. * `rts/NonmovingScav.c` implements the logic necessary to scavenge the nonmoving heap. Branch overview =============== ``` * wip/gc/opt-pause: | A variety of small optimisations to further reduce pause times. | * wip/gc/compact-nfdata: | Introduce support for compact regions into the non-moving |\ collector | \ | \ | | * wip/gc/segment-header-to-bdescr: | | | Another optimization that we are considering, pushing | | | some segment metadata into the segment descriptor for | | | the sake of locality during mark | | | | * | wip/gc/shortcutting: | | | Support for indirection shortcutting and the selector optimization | | | in the non-moving heap. | | | * | | wip/gc/docs: | |/ Work on implementation documentation. | / |/ * wip/gc/everything: | A roll-up of everything below. |\ | \ | |\ | | \ | | * wip/gc/optimize: | | | A variety of optimizations, primarily to the mark loop. | | | Some of these are microoptimizations but a few are quite | | | significant. In particular, the prefetch patches have | | | produced a nontrivial improvement in mark performance. | | | | | * wip/gc/aging: | | | Enable support for aging in major collections. | | | | * | wip/gc/test: | | | Fix up the testsuite to more or less pass. | | | * | | wip/gc/instrumentation: | | | A variety of runtime instrumentation including statistics | | / support, the nonmoving census, and eventlog support. | |/ | / |/ * wip/gc/nonmoving-concurrent: | The concurrent write barriers. | * wip/gc/nonmoving-nonconcurrent: | The nonmoving collector without the write barriers necessary | for concurrent collection. | * wip/gc/preparation: | A merge of the various preparatory patches that aren't directly | implementing the GC. | | * GHC HEAD . . . ```
| * rts: Implement concurrent collection in the nonmoving collectorBen Gamari2019-10-201-1/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This extends the non-moving collector to allow concurrent collection. The full design of the collector implemented here is described in detail in a technical note B. Gamari. "A Concurrent Garbage Collector For the Glasgow Haskell Compiler" (2018) This extension involves the introduction of a capability-local remembered set, known as the /update remembered set/, which tracks objects which may no longer be visible to the collector due to mutation. To maintain this remembered set we introduce a write barrier on mutations which is enabled while a concurrent mark is underway. The update remembered set representation is similar to that of the nonmoving mark queue, being a chunked array of `MarkEntry`s. Each `Capability` maintains a single accumulator chunk, which it flushed when it (a) is filled, or (b) when the nonmoving collector enters its post-mark synchronization phase. While the write barrier touches a significant amount of code it is conceptually straightforward: the mutator must ensure that the referee of any pointer it overwrites is added to the update remembered set. However, there are a few details: * In the case of objects with a dirty flag (e.g. `MVar`s) we can exploit the fact that only the *first* mutation requires a write barrier. * Weak references, as usual, complicate things. In particular, we must ensure that the referee of a weak object is marked if dereferenced by the mutator. For this we (unfortunately) must introduce a read barrier, as described in Note [Concurrent read barrier on deRefWeak#] (in `NonMovingMark.c`). * Stable names are also a bit tricky as described in Note [Sweeping stable names in the concurrent collector] (`NonMovingSweep.c`). We take quite some pains to ensure that the high thread count often seen in parallel Haskell applications doesn't affect pause times. To this end we allow thread stacks to be marked either by the thread itself (when it is executed or stack-underflows) or the concurrent mark thread (if the thread owning the stack is never scheduled). There is a non-trivial handshake to ensure that this happens without racing which is described in Note [StgStack dirtiness flags and concurrent marking]. Co-Authored-by: Ömer Sinan Ağacan <omer@well-typed.com>
* | Make dynflag argument for withTiming pure.Andreas Klebinger2019-10-233-22/+22
|/ | | | | | | | | | | | 19 times out of 20 we already have dynflags in scope. We could just always use `return dflags`. But this is in fact not free. When looking at some STG code I noticed that we always allocate a closure for this expression in the heap. Clearly a waste in these cases. For the other cases we can either just modify the callsite to get dynflags or use the _D variants of withTiming I added which will use getDynFlags under the hood.
* Add loop level analysis to the NCG backend.klebinger.andreas@gmx.at2019-10-161-3/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | For backends maintaining the CFG during codegen we can now find loops and their nesting level. This is based on the Cmm CFG and dominator analysis. As a result we can estimate edge frequencies a lot better for methods, resulting in far better code layout. Speedup on nofib: ~1.5% Increase in compile times: ~1.9% To make this feasible this commit adds: * Dominator analysis based on the Lengauer-Tarjan Algorithm. * An algorithm estimating global edge frequences from branch probabilities - In CFG.hs A few static branch prediction heuristics: * Expect to take the backedge in loops. * Expect to take the branch NOT exiting a loop. * Expect integer vs constant comparisons to be false. We also treat heap/stack checks special for branch prediction to avoid them being treated as loops.
* PmCheck: Only ever check constantly many models against a single patternSebastian Graf2019-09-251-4/+0
| | | | | | | | | | | | | | | | | | | | | | | | | Introduces a new flag `-fmax-pmcheck-deltas` to achieve that. Deprecates the old `-fmax-pmcheck-iter` mechanism in favor of this new flag. From the user's guide: Pattern match checking can be exponential in some cases. This limit makes sure we scale polynomially in the number of patterns, by forgetting refined information gained from a partially successful match. For example, when matching `x` against `Just 4`, we split each incoming matching model into two sub-models: One where `x` is not `Nothing` and one where `x` is `Just y` but `y` is not `4`. When the number of incoming models exceeds the limit, we continue checking the next clause with the original, unrefined model. This also retires the incredibly hard to understand "maximum number of refinements" mechanism, because the current mechanism is more general and should catch the same exponential cases like PrelRules at the same time. ------------------------- Metric Decrease: T11822 -------------------------
* ErrUtils: split withTiming into withTiming and withTimingSilentAlp Mestanogullari2019-09-192-3/+4
| | | | | | | | | | | | | | | | | | | 'withTiming' becomes a function that, when passed '-vN' (N >= 2) or '-ddump-timings', will print timing (and possibly allocations) related information. When additionally built with '-eventlog' and executed with '+RTS -l', 'withTiming' will also emit both 'traceMarker' and 'traceEvent' events to the eventlog. 'withTimingSilent' on the other hand will never print any timing information, under any circumstance, and will only emit 'traceEvent' events to the eventlog. As pointed out in !1672, 'traceMarker' is better suited for things that we might want to visualize in tools like eventlog2html, while 'traceEvent' is better suited for internal events that occur a lot more often and that we don't necessarily want to visualize. This addresses #17138 by using 'withTimingSilent' for all the codegen bits that are expressed as a bunch of small computations over streams of codegen ASTs.
* Module hierarchy: StgToCmm (#13009)Sylvain Henry2019-09-1011-131/+30
| | | | | | Add StgToCmm module hierarchy. Platform modules that are used in several other places (NCG, LLVM codegen, Cmm transformations) are put into GHC.Platform.
* Make the C-- O and C types constructors with DataKindsJohn Ericson2019-09-053-15/+23
| | | | | The tightens up the kinds a bit. I use type synnonyms to avoid adding promotion ticks everywhere.
* Few tweaks in -ddump-debug output, minor refactoringÖmer Sinan Ağacan2019-09-021-14/+11
| | | | | | | - Fixes crazy indentation in -ddump-debug output - We no longer dump empty sections in -ddump-debug when a code block does not have any generated debug info - Minor refactoring in Debug.hs and AsmCodeGen.hs
* Small optimization in the SRT algorithmÖmer Sinan Ağacan2019-08-291-1/+1
| | | | | | | | | | | | | | | | | | | | | Noticed by @simonmar in !1362: If the srtEntry is Nothing, then it should be safe to omit references to this SRT from other SRTs, even if it is a static function. When updating SRT map we don't omit references to static functions (see Note [Invalid optimisation: shortcutting]), but there's no reason to add an SRT entry for a static function if the function is not CAFFY. (Previously we'd add SRT entries for static functions even when they're not CAFFY) Using 9151b99e I checked sizes of all SRTs when building GHC and containers: - GHC: 583736 (HEAD), 581695 (this patch). 2041 less SRT entries. - containers: 2457 (HEAD), 2381 (this patch). 76 less SRT entries.
* Return results of Cmm streams in backendsÖmer Sinan Ağacan2019-08-281-4/+5
| | | | | | | | | | | | | | | | | | | This generalizes code generators (outputAsm, outputLlvm, outputC, and the call site codeOutput) so that they'll return the return values of the passed Cmm streams. This allows accumulating data during Cmm generation and returning it to the call site in HscMain. Previously the Cmm streams were assumed to return (), so the code generators returned () as well. This change is required by !1304 and !1530. Skipping CI as this was tested before and I only updated the commit message. [skip ci]
* Make non-streaming LLVM and C backends streamingÖmer Sinan Ağacan2019-08-231-9/+3
| | | | | | | | | This adds a Stream.consume function, uses it in LLVM and C code generators, and removes the use of Stream.collect function which was used to collect streaming Cmm generation results into a list. LLVM and C backends now properly use streamed Cmm generation, instead of collecting Cmm groups into a list before generating LLVM/C code.
* Remove unused imports of the form 'import foo ()' (Fixes #17065)James Foster2019-08-1510-12/+5
| | | | | | | | | | | These kinds of imports are necessary in some cases such as importing instances of typeclasses or intentionally creating dependencies in the build system, but '-Wunused-imports' can't detect when they are no longer needed. This commit removes the unused ones currently in the code base (not including test files or submodules), with the hope that doing so may increase parallelism in the build system by removing unnecessary dependencies.
* Introduce a type for "platform word size", use it instead of IntÖmer Sinan Ağacan2019-08-063-15/+15
| | | | | | | | We introduce a PlatformWordSize type and use it in platformWordSize field. This removes to panic/error calls called when platform word size is not 32 or 64. We now check for this when reading the platform config.
* compiler: emit finer grained codegen events to eventlogAlp Mestanogullari2019-08-022-5/+14
|
* Change behaviour of -ddump-cmm-verbose to dump each Cmm pass output to a ↵nineonine2019-07-261-6/+7
| | | | separate file and add -ddump-cmm-verbose-by-proc to keep old behaviour (#16930)
* Create {Int,Word}32RepJohn Ericson2019-07-171-0/+4
| | | | | | | This prepares the way for making Int32# and Word32# the actual size they claim to be. Updates binary submodule for (de)serializing the new runtime reps.
* Revert "Add support for SIMD operations in the NCG"Ben Gamari2019-07-167-122/+52
| | | | | | | Unfortunately this will require more work; register allocation is quite broken. This reverts commit acd795583625401c5554f8e04ec7efca18814011.
* Minor refactoring in CmmBuildInfoTablesÖmer Sinan Ağacan2019-07-131-5/+3
| | | | | - Replace `catMaybes (map ...)` with `mapMaybe ...` - Remove a list->set->list conversion
* Add two CmmSwitch optimizations.Andreas Klebinger2019-07-134-4/+47
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Move switch expressions into a local variable when generating switches. This avoids duplicating the expression if we translate the switch to a tree search. This fixes #16933. Further we now check if all branches of a switch have the same destination, replacing the switch with a direct branch if that is the case. Both of these patterns appear in the ENTER macro used by the RTS but are unlikely to occur in intermediate Cmm generated by GHC. Nofib result summary: -------------------------------------------------------------------------------- Program Size Allocs Runtime Elapsed TotalMem -------------------------------------------------------------------------------- Min -0.0% -0.0% -15.7% -15.6% 0.0% Max -0.0% 0.0% +5.4% +5.5% 0.0% Geometric Mean -0.0% -0.0% -1.0% -1.0% -0.0% Compiler allocations go up slightly: +0.2% Example output before and after the change taken from RTS code below. All but one of the memory loads `I32[_c3::I64 - 8]` are eliminated. Instead the data is loaded once from memory in block c6. Also the switch in block `ud` in the original code has been eliminated completely. Cmm without this commit: ``` stg_ap_0_fast() { // [R1] { [] } {offset ca: _c1::P64 = R1; // CmmAssign goto c2; // CmmBranch c2: if (_c1::P64 & 7 != 0) goto c4; else goto c6; c6: _c3::I64 = I64[_c1::P64]; if (I32[_c3::I64 - 8] < 26 :: W32) goto ub; else goto ug; ub: if (I32[_c3::I64 - 8] < 15 :: W32) goto uc; else goto ue; uc: if (I32[_c3::I64 - 8] < 8 :: W32) goto c7; else goto ud; ud: switch [8 .. 14] (%MO_SS_Conv_W32_W64(I32[_c3::I64 - 8])) { case 8, 9, 10, 11, 12, 13, 14 : goto c4; } ue: if (I32[_c3::I64 - 8] >= 25 :: W32) goto c4; else goto uf; uf: if (%MO_SS_Conv_W32_W64(I32[_c3::I64 - 8]) != 23) goto c7; else goto c4; c4: R1 = _c1::P64; call (P64[Sp])(R1) args: 8, res: 0, upd: 8; ug: if (I32[_c3::I64 - 8] < 28 :: W32) goto uh; else goto ui; uh: if (I32[_c3::I64 - 8] < 27 :: W32) goto c7; else goto c8; ui: if (I32[_c3::I64 - 8] < 29 :: W32) goto c8; else goto c7; c8: _c1::P64 = P64[_c1::P64 + 8]; goto c2; c7: R1 = _c1::P64; call (_c3::I64)(R1) args: 8, res: 0, upd: 8; } } ``` Cmm with this commit: ``` stg_ap_0_fast() { // [R1] { [] } {offset ca: _c1::P64 = R1; goto c2; c2: if (_c1::P64 & 7 != 0) goto c4; else goto c6; c6: _c3::I64 = I64[_c1::P64]; _ub::I64 = %MO_SS_Conv_W32_W64(I32[_c3::I64 - 8]); if (_ub::I64 < 26) goto uc; else goto uh; uc: if (_ub::I64 < 15) goto ud; else goto uf; ud: if (_ub::I64 < 8) goto c7; else goto c4; uf: if (_ub::I64 >= 25) goto c4; else goto ug; ug: if (_ub::I64 != 23) goto c7; else goto c4; c4: R1 = _c1::P64; call (P64[Sp])(R1) args: 8, res: 0, upd: 8; uh: if (_ub::I64 < 28) goto ui; else goto uj; ui: if (_ub::I64 < 27) goto c7; else goto c8; uj: if (_ub::I64 < 29) goto c8; else goto c7; c8: _c1::P64 = P64[_c1::P64 + 8]; goto c2; c7: R1 = _c1::P64; call (_c3::I64)(R1) args: 8, res: 0, upd: 8; } } ```
* Add support for SIMD operations in the NCGAbhiroop Sarkar2019-07-037-52/+122
| | | | | | | This adds support for constructing vector types from Float#, Double# etc and performing arithmetic operations on them Cleaned-Up-By: Ben Gamari <ben@well-typed.com>
* Correct closure observation, construction, and mutation on weak memory machines.Travis Whitaker2019-06-283-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Here the following changes are introduced: - A read barrier machine op is added to Cmm. - The order in which a closure's fields are read and written is changed. - Memory barriers are added to RTS code to ensure correctness on out-or-order machines with weak memory ordering. Cmm has a new CallishMachOp called MO_ReadBarrier. On weak memory machines, this is lowered to an instruction that ensures memory reads that occur after said instruction in program order are not performed before reads coming before said instruction in program order. On machines with strong memory ordering properties (e.g. X86, SPARC in TSO mode) no such instruction is necessary, so MO_ReadBarrier is simply erased. However, such an instruction is necessary on weakly ordered machines, e.g. ARM and PowerPC. Weam memory ordering has consequences for how closures are observed and mutated. For example, consider a closure that needs to be updated to an indirection. In order for the indirection to be safe for concurrent observers to enter, said observers must read the indirection's info table before they read the indirectee. Furthermore, the entering observer makes assumptions about the closure based on its info table contents, e.g. an INFO_TYPE of IND imples the closure has an indirectee pointer that is safe to follow. When a closure is updated with an indirection, both its info table and its indirectee must be written. With weak memory ordering, these two writes can be arbitrarily reordered, and perhaps even interleaved with other threads' reads and writes (in the absence of memory barrier instructions). Consider this example of a bad reordering: - An updater writes to a closure's info table (INFO_TYPE is now IND). - A concurrent observer branches upon reading the closure's INFO_TYPE as IND. - A concurrent observer reads the closure's indirectee and enters it. (!!!) - An updater writes the closure's indirectee. Here the update to the indirectee comes too late and the concurrent observer has jumped off into the abyss. Speculative execution can also cause us issues, consider: - An observer is about to case on a value in closure's info table. - The observer speculatively reads one or more of closure's fields. - An updater writes to closure's info table. - The observer takes a branch based on the new info table value, but with the old closure fields! - The updater writes to the closure's other fields, but its too late. Because of these effects, reads and writes to a closure's info table must be ordered carefully with respect to reads and writes to the closure's other fields, and memory barriers must be placed to ensure that reads and writes occur in program order. Specifically, updates to a closure must follow the following pattern: - Update the closure's (non-info table) fields. - Write barrier. - Update the closure's info table. Observing a closure's fields must follow the following pattern: - Read the closure's info pointer. - Read barrier. - Read the closure's (non-info table) fields. This patch updates RTS code to obey this pattern. This should fix long-standing SMP bugs on ARM (specifically newer aarch64 microarchitectures supporting out-of-order execution) and PowerPC. This fixes issue #15449. Co-Authored-By: Ben Gamari <ben@well-typed.com>
* Simplify link_caf and mkForeignLabel functionsÖmer Sinan Ağacan2019-06-251-2/+1
|
* Move 'Platform' to ghc-bootJohn Ericson2019-06-1911-11/+11
| | | | | | | ghc-pkg needs to be aware of platforms so it can figure out which subdire within the user package db to use. This is admittedly roundabout, but maybe Cabal could use the same notion of a platform as GHC to good affect too.
* Fix a Note name in CmmNodeÖmer Sinan Ağacan2019-06-191-1/+1
| | | | | | ("Continuation BlockIds" is referenced in CmmProcPoint) [skip ci]
* Use TupleSections in CmmParse.y, simplify a few exprsÖmer Sinan Ağacan2019-06-161-26/+28
|
* Use DeriveFunctor throughout the codebase (#15654)Krzysztof Gogolewski2019-06-123-18/+10
|
* Introduce log1p and expm1 primopschessai2019-06-092-0/+8
| | | | | Previously log and exp were primitives yet log1p and expm1 were FFI calls. Fix this non-uniformity.
* Remove trailing whitespaceMatthew Pickering2019-06-081-2/+2
| | | | | | [skip ci] This should really be caught by the linters! (#16711)
* Inline `Settings` into `DynFlags`John Ericson2019-05-293-12/+12
| | | | | | | | | | After the previous commit, `Settings` is just a thin wrapper around other groups of settings. While `Settings` is used by GHC-the-executable to initalize `DynFlags`, in principle another consumer of GHC-the-library could initialize `DynFlags` a different way. It therefore doesn't make sense for `DynFlags` itself (library code) to separate the settings that typically come from `Settings` from the settings that typically don't.
* Add missing opening braces in Cmm dumpsÖmer Sinan Ağacan2019-05-271-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Previously -ddump-cmm was generating code with unbalanced curly braces: stg_atomically_entry() // [R1] { info_tbls: [(cfl, label: stg_atomically_info rep: tag:16 HeapRep 1 ptrs { Thunk } srt: Nothing)] stack_info: arg_space: 8 updfr_space: Just 8 } {offset cfl: // cfk unwind Sp = Just Sp + 0; _cfk::P64 = R1; //tick src<rts/PrimOps.cmm:(1243,1)-(1245,1)> R1 = I64[_cfk::P64 + 8 + 8 + 0 * 8]; call stg_atomicallyzh(R1) args: 8, res: 0, upd: 8; } }, <---- OPENING BRACE MISSING After this patch: stg_atomically_entry() { // [R1] <---- MISSING OPENING BRACE HERE { info_tbls: [(cfl, label: stg_atomically_info rep: tag:16 HeapRep 1 ptrs { Thunk } srt: Nothing)] stack_info: arg_space: 8 updfr_space: Just 8 } {offset cfl: // cfk unwind Sp = Just Sp + 0; _cfk::P64 = R1; //tick src<rts/PrimOps.cmm:(1243,1)-(1245,1)> R1 = I64[_cfk::P64 + 8 + 8 + 0 * 8]; call stg_atomicallyzh(R1) args: 8, res: 0, upd: 8; } },
* Remove all target-specific portions of Config.hsJohn Ericson2019-05-141-30/+27
| | | | | | | | | | | | | | | | | | | 1. If GHC is to be multi-target, these cannot be baked in at compile time. 2. Compile-time flags have a higher maintenance than run-time flags. 3. The old way makes build system implementation (various bootstrapping details) with the thing being built. E.g. GHC doesn't need to care about which integer library *will* be used---this is purely a crutch so the build system doesn't need to pass flags later when using that library. 4. Experience with cross compilation in Nixpkgs has shown things work nicer when compiler's can *optionally* delegate the bootstrapping the package manager. The package manager knows the entire end-goal build plan, and thus can make top-down decisions on bootstrapping. GHC can just worry about GHC, not even core library like base and ghc-prim!
* asm-emit-time IND_STATIC eliminationGabor Greif2019-04-151-1/+137
| | | | | | | | | | | | When a new closure identifier is being established to a local or exported closure already emitted into the same module, refrain from adding an IND_STATIC closure, and instead emit an assembly-language alias. Inter-module IND_STATIC objects still remain, and need to be addressed by other measures. Binary-size savings on nofib are around 0.1%.
* codegen: unroll memcpy calls for small bytearraysArtem Pyanykh2019-04-141-1/+10
|
* removing x87 register support from native code genCarter Schonwald2019-04-103-9/+12
| | | | | | | | | | | | | | | | * simplifies registers to have GPR, Float and Double, by removing the SSE2 and X87 Constructors * makes -msse2 assumed/default for x86 platforms, fixing a long standing nondeterminism in rounding behavior in 32bit haskell code * removes the 80bit floating point representation from the supported float sizes * theres still 1 tiny bit of x87 support needed, for handling float and double return values in FFI calls wrt the C ABI on x86_32, but this one piece does not leak into the rest of NCG. * Lots of code thats not been touched in a long time got deleted as a consequence of all of this all in all, this change paves the way towards a lot of future further improvements in how GHC handles floating point computations, along with making the native code gen more accessible to a larger pool of contributors.
* Add support for bitreverse primopAlexandre2019-04-012-0/+2
| | | | | | This commit includes the necessary changes in code and documentation to support a primop that reverses a word's bits. It also includes a test.
* Update Wiki URLs to point to GitLabTakenobu Tani2019-03-253-5/+5
| | | | | | | | | | | | | | | | | | | | | | | This moves all URL references to Trac Wiki to their corresponding GitLab counterparts. This substitution is classified as follows: 1. Automated substitution using sed with Ben's mapping rule [1] Old: ghc.haskell.org/trac/ghc/wiki/XxxYyy... New: gitlab.haskell.org/ghc/ghc/wikis/xxx-yyy... 2. Manual substitution for URLs containing `#` index Old: ghc.haskell.org/trac/ghc/wiki/XxxYyy...#Zzz New: gitlab.haskell.org/ghc/ghc/wikis/xxx-yyy...#zzz 3. Manual substitution for strings starting with `Commentary` Old: Commentary/XxxYyy... New: commentary/xxx-yyy... See also !539 [1]: https://gitlab.haskell.org/bgamari/gitlab-migration/blob/master/wiki-mapping.json