delta/haskell.git - gitlab.haskell.org: ghc/ghc.git

	Commit message (Collapse)	Author	Age	Files	Lines
*	Demand: Format Call SubDemands `Cn(sd)` as `C(n,sd)` (#22231)wip/T22231	Sebastian Graf	2022-09-29	10	-14/+14
\| \| \| \| \| \| \| \|	Justification in #22231. Short form: In a demand like `1C1(C1(L))` it was too easy to confuse which `1` belongs to which `C`. Now that should be more obvious. Fixes #22231
*	Improve aggressive specialisation	Simon Peyton Jones	2022-09-28	3	-52/+21
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch fixes #21286, by not unboxing dictionaries in worker/wrapper (ever). The main payload is tiny: * In `GHC.Core.Opt.DmdAnal.finaliseArgBoxities`, do not unbox dictionaries in `get_dmd`. See Note [Do not unbox class dictionaries] in that module * I also found that imported wrappers were being fruitlessly specialised, so I fixed that too, in canSpecImport. See Note [Specialising imported functions] point (2). In doing due diligence in the testsuite I fixed a number of other things: * Improve Note [Specialising unfoldings] in GHC.Core.Unfold.Make, and Note [Inline specialisations] in GHC.Core.Opt.Specialise, and remove duplication between the two. The new Note describes how we specialise functions with an INLINABLE pragma. And simplify the defn of `spec_unf` in `GHC.Core.Opt.Specialise.specCalls`. * Improve Note [Worker/wrapper for INLINABLE functions] in GHC.Core.Opt.WorkWrap. And (critially) make an actual change which is to propagate the user-written pragma from the original function to the wrapper; see `mkStrWrapperInlinePrag`. * Write new Note [Specialising imported functions] in GHC.Core.Opt.Specialise All this has a big effect on some compile times. This is compiler/perf, showing only changes over 1%: Metrics: compile_time/bytes allocated ------------------------------------- LargeRecord(normal) -50.2% GOOD ManyConstructors(normal) +1.0% MultiLayerModulesTH_OneShot(normal) +2.6% PmSeriesG(normal) -1.1% T10547(normal) -1.2% T11195(normal) -1.2% T11276(normal) -1.0% T11303b(normal) -1.6% T11545(normal) -1.4% T11822(normal) -1.3% T12150(optasm) -1.0% T12234(optasm) -1.2% T13056(optasm) -9.3% GOOD T13253(normal) -3.8% GOOD T15164(normal) -3.6% GOOD T16190(normal) -2.1% T16577(normal) -2.8% GOOD T16875(normal) -1.6% T17836(normal) +2.2% T17977b(normal) -1.0% T18223(normal) -33.3% GOOD T18282(normal) -3.4% GOOD T18304(normal) -1.4% T18698a(normal) -1.4% GOOD T18698b(normal) -1.3% GOOD T19695(normal) -2.5% GOOD T5837(normal) -2.3% T9630(normal) -33.0% GOOD WWRec(normal) -9.7% GOOD hard_hole_fits(normal) -2.1% GOOD hie002(normal) +1.6% geo. mean -2.2% minimum -50.2% maximum +2.6% I diligently investigated some of the big drops. * Caused by not doing w/w for dictionaries: T13056, T15164, WWRec, T18223 * Caused by not fruitlessly specialising wrappers LargeRecord, T9630 For runtimes, here is perf/should+_run: Metrics: runtime/bytes allocated -------------------------------- T12990(normal) -3.8% T5205(normal) -1.3% T9203(normal) -10.7% GOOD haddock.Cabal(normal) +0.1% haddock.base(normal) -1.1% haddock.compiler(normal) -0.3% lazy-bs-alloc(normal) -0.2% ------------------------------------------ geo. mean -0.3% minimum -10.7% maximum +0.1% I did not investigate exactly what happens in T9203. Nofib is a wash: +-------------------------------++--+-----------+-----------+ \| \|\| \| tsv (rel) \| std. err. \| +===============================++==+===========+===========+ \| real/anna \|\| \| -0.13% \| 0.0% \| \| real/fem \|\| \| +0.13% \| 0.0% \| \| real/fulsom \|\| \| -0.16% \| 0.0% \| \| real/lift \|\| \| -1.55% \| 0.0% \| \| real/reptile \|\| \| -0.11% \| 0.0% \| \| real/smallpt \|\| \| +0.51% \| 0.0% \| \| spectral/constraints \|\| \| +0.20% \| 0.0% \| \| spectral/dom-lt \|\| \| +1.80% \| 0.0% \| \| spectral/expert \|\| \| +0.33% \| 0.0% \| +===============================++==+===========+===========+ \| geom mean \|\| \| \| \| +-------------------------------++--+-----------+-----------+ I spent quite some time investigating dom-lt, but it's pretty complicated. See my note on !7847. Conclusion: it's just a delicate inlining interaction, and we have plenty of those. Metric Decrease: LargeRecord T13056 T13253 T15164 T16577 T18223 T18282 T18698a T18698b T19695 T9630 WWRec hard_hole_fits T9203
*	Demand: Clear distinction between Call SubDmd and eval Dmd (#21717)wip/T21717	Sebastian Graf	2022-09-27	3	-6/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In #21717 we saw a reportedly unsound strictness signature due to an unsound definition of plusSubDmd on Calls. This patch contains a description and the fix to the unsoundness as outlined in `Note [Call SubDemand vs. evaluation Demand]`. This fix means we also get rid of the special handling of `-fpedantic-bottoms` in eta-reduction. Thanks to less strict and actually sound strictness results, we will no longer eta-reduce the problematic cases in the first place, even without `-fpedantic-bottoms`. So fixing the unsoundness also makes our eta-reduction code simpler with less hacks to explain. But there is another, more unfortunate side-effect: We unfix #21085, but fortunately we have a new fix ready: See `Note [mkCall and plusSubDmd]`. There's another change: I decided to make `Note [SubDemand denotes at least one evaluation]` a lot simpler by using `plusSubDmd` (instead of `lubPlusSubDmd`) even if both argument demands are lazy. That leads to less precise results, but in turn rids ourselves from the need for 4 different `OpMode`s and the complication of `Note [Manual specialisation of lubDmd/plusDmd]`. The result is simpler code that is in line with the paper draft on Demand Analysis. I left the abandoned idea in `Note [Unrealised opportunity in plusDmd]` for posterity. The fallout in terms of regressions is negligible, as the testsuite and NoFib shows. ``` Program Allocs Instrs -------------------------------------------------------------------------------- hidden +0.2% -0.2% linear -0.0% -0.7% -------------------------------------------------------------------------------- Min -0.0% -0.7% Max +0.2% +0.0% Geometric Mean +0.0% -0.0% ``` Fixes #21717.
*	Fix arityType: -fpedantic-bottoms, join points, etcwip/T21694a	Simon Peyton Jones	2022-08-25	6	-0/+47
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This MR fixes #21694, #21755. It also makes sure that #21948 and fix to #21694. * For #21694 the underlying problem was that we were calling arityType on an expression that had free join points. This is a Bad Bad Idea. See Note [No free join points in arityType]. * To make "no free join points in arityType" work out I had to avoid trying to use eta-expansion for runRW#. This entailed a few changes in the Simplifier's treatment of runRW#. See GHC.Core.Opt.Simplify.Iteration Note [No eta-expansion in runRW#] * I also made andArityType work correctly with -fpedantic-bottoms; see Note [Combining case branches: andWithTail]. * Rewrote Note [Combining case branches: optimistic one-shot-ness] * arityType previously treated join points differently to other let-bindings. This patch makes them unform; arityType analyses the RHS of all bindings to get its ArityType, and extends am_sigs. I realised that, now we have am_sigs giving the ArityType for let-bound Ids, we don't need the (pre-dating) special code in arityType for join points. But instead we need to extend the env for Rec bindings, which weren't doing before. More uniform now. See Note [arityType for let-bindings]. This meant we could get rid of ae_joins, and in fact get rid of EtaExpandArity altogether. Simpler. * And finally, it was the strange treatment of join-point Ids in arityType (involving a fake ABot type) that led to a serious bug: #21755. Fixed by this refactoring, which treats them uniformly; but without breaking #18328. In fact, the arity for recursive join bindings is pretty tricky; see the long Note [Arity for recursive join bindings] in GHC.Core.Opt.Simplify.Utils. That led to more refactoring, including deciding that an Id could have an Arity that is bigger than its JoinArity; see Note [Invariants on join points], item 2(b) in GHC.Core * Make sure that the "demand threshold" for join points in DmdAnal is no bigger than the join-arity. In GHC.Core.Opt.DmdAnal see Note [Demand signatures are computed for a threshold arity based on idArity] * I moved GHC.Core.Utils.exprIsDeadEnd into GHC.Core.Opt.Arity, where it more properly belongs. * Remove an old, redundant hack in FloatOut. The old Note was Note [Bottoming floats: eta expansion] in GHC.Core.Opt.SetLevels. Compile time improves very slightly on average: Metrics: compile_time/bytes allocated --------------------------------------------------------------------------------------- T18223(normal) ghc/alloc 725,808,720 747,839,216 +3.0% BAD T6048(optasm) ghc/alloc 105,006,104 101,599,472 -3.2% GOOD geo. mean -0.2% minimum -3.2% maximum +3.0% For some reason Windows was better T10421(normal) ghc/alloc 125,888,360 124,129,168 -1.4% GOOD T18140(normal) ghc/alloc 85,974,520 83,884,224 -2.4% GOOD T18698b(normal) ghc/alloc 236,764,568 234,077,288 -1.1% GOOD T18923(normal) ghc/alloc 75,660,528 73,994,512 -2.2% GOOD T6048(optasm) ghc/alloc 112,232,512 108,182,520 -3.6% GOOD geo. mean -0.6% I had a quick look at T18223 but it is knee deep in coercions and the size of everything looks similar before and after. I decided to accept that 3% increase in exchange for goodness elsewhere. Metric Decrease: T10421 T18140 T18698b T18923 T6048 Metric Increase: T18223
*	Don't mark lambda binders as OtherCon	Andreas Klebinger	2022-06-27	2	-10/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We used to put OtherCon unfoldings on lambda binders of workers and sometimes also join points/specializations with with the assumption that since the wrapper would force these arguments once we execute the RHS they would indeed be in WHNF. This was wrong for reasons detailed in #21472. So now we purge evaluated unfoldings from all lambda binders. This fixes #21472, but at the cost of sometimes not using as efficient a calling convention. It can also change inlining behaviour as some occurances will no longer look like value arguments when they did before. As consequence we also change how we compute CBV information for arguments slightly. We now always determine the CBV convention for arguments during tidy. Earlier in the pipeline we merely mark functions as candidates for having their arguments treated as CBV. As before the process is described in the relevant notes: Note [CBV Function Ids] Note [Attaching CBV Marks to ids] Note [Never put `OtherCon` unfoldigns on lambda binders] ------------------------- Metric Decrease: T12425 T13035 T18223 T18223 T18923 MultiLayerModulesTH_OneShot Metric Increase: WWRec -------------------------
*	Simplify: Take care with eta reduction in recursive RHSs (#21652)	Sebastian Graf	2022-06-20	5	-6/+17
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Similar to the fix to #20836 in CorePrep, we now track the set of enclosing recursive binders in the SimplEnv and SimpleOptEnv. See Note [Eta reduction in recursive RHSs] for details. I also updated Note [Arity robustness] with the insights Simon and I had in a call discussing the issue. Fixes #21652. Unfortunately, we get a 5% ghc/alloc regression in T16577. That is due to additional eta reduction in GHC.Read.choose1 and the resulting ANF-isation of a large list literal at the top-level that didn't happen before (presumably because it was too interesting to float to the top-level). There's not much we can do about that. Metric Increase: T16577
*	A bunch of changes related to eta reduction	Simon Peyton Jones	2022-05-30	2	-6/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This is a large collection of changes all relating to eta reduction, originally triggered by #18993, but there followed a long saga. Specifics: * Move state-hack stuff from GHC.Types.Id (where it never belonged) to GHC.Core.Opt.Arity (which seems much more appropriate). * Add a crucial mkCast in the Cast case of GHC.Core.Opt.Arity.eta_expand; helps with T18223 * Add clarifying notes about eta-reducing to PAPs. See Note [Do not eta reduce PAPs] * I moved tryEtaReduce from GHC.Core.Utils to GHC.Core.Opt.Arity, where it properly belongs. See Note [Eta reduce PAPs] * In GHC.Core.Opt.Simplify.Utils.tryEtaExpandRhs, pull out the code for when eta-expansion is wanted, to make wantEtaExpansion, and all that same function in GHC.Core.Opt.Simplify.simplStableUnfolding. It was previously inconsistent, but it's doing the same thing. * I did a substantial refactor of ArityType; see Note [ArityType]. This allowed me to do away with the somewhat mysterious takeOneShots; more generally it allows arityType to describe the function, leaving its clients to decide how to use that information. I made ArityType abstract, so that clients have to use functions to access it. * Make GHC.Core.Opt.Simplify.Utils.rebuildLam (was stupidly called mkLam before) aware of the floats that the simplifier builds up, so that it can still do eta-reduction even if there are some floats. (Previously that would not happen.) That means passing the floats to rebuildLam, and an extra check when eta-reducting (etaFloatOk). * In GHC.Core.Opt.Simplify.Utils.tryEtaExpandRhs, make use of call-info in the idDemandInfo of the binder, as well as the CallArity info. The occurrence analyser did this but we were failing to take advantage here. In the end I moved the heavy lifting to GHC.Core.Opt.Arity.findRhsArity; see Note [Combining arityType with demand info], and functions idDemandOneShots and combineWithDemandOneShots. (These changes partly drove my refactoring of ArityType.) * In GHC.Core.Opt.Arity.findRhsArity * I'm now taking account of the demand on the binder to give extra one-shot info. E.g. if the fn is always called with two args, we can give better one-shot info on the binders than if we just look at the RHS. * Don't do any fixpointing in the non-recursive case -- simple short cut. * Trim arity inside the loop. See Note [Trim arity inside the loop] * Make SimpleOpt respect the eta-reduction flag (Some associated refactoring here.) * I made the CallCtxt which the Simplifier uses distinguish between recursive and non-recursive right-hand sides. data CallCtxt = ... \| RhsCtxt RecFlag \| ... It affects only one thing: - We call an RHS context interesting only if it is non-recursive see Note [RHS of lets] in GHC.Core.Unfold * Remove eta-reduction in GHC.CoreToStg.Prep, a welcome simplification. See Note [No eta reduction needed in rhsToBody] in GHC.CoreToStg.Prep. Other incidental changes * Fix a fairly long-standing outright bug in the ApplyToVal case of GHC.Core.Opt.Simplify.mkDupableContWithDmds. I was failing to take the tail of 'dmds' in the recursive call, which meant the demands were All Wrong. I have no idea why this has not caused problems before now. * Delete dead function GHC.Core.Opt.Simplify.Utils.contIsRhsOrArg Metrics: compile_time/bytes allocated Test Metric Baseline New value Change --------------------------------------------------------------------------------------- MultiLayerModulesTH_OneShot(normal) ghc/alloc 2,743,297,692 2,619,762,992 -4.5% GOOD T18223(normal) ghc/alloc 1,103,161,360 972,415,992 -11.9% GOOD T3064(normal) ghc/alloc 201,222,500 184,085,360 -8.5% GOOD T8095(normal) ghc/alloc 3,216,292,528 3,254,416,960 +1.2% T9630(normal) ghc/alloc 1,514,131,032 1,557,719,312 +2.9% BAD parsing001(normal) ghc/alloc 530,409,812 525,077,696 -1.0% geo. mean -0.1% Nofib: Program Size Allocs Runtime Elapsed TotalMem -------------------------------------------------------------------------------- banner +0.0% +0.4% -8.9% -8.7% 0.0% exact-reals +0.0% -7.4% -36.3% -37.4% 0.0% fannkuch-redux +0.0% -0.1% -1.0% -1.0% 0.0% fft2 -0.1% -0.2% -17.8% -19.2% 0.0% fluid +0.0% -1.3% -2.1% -2.1% 0.0% gg -0.0% +2.2% -0.2% -0.1% 0.0% spectral-norm +0.1% -0.2% 0.0% 0.0% 0.0% tak +0.0% -0.3% -9.8% -9.8% 0.0% x2n1 +0.0% -0.2% -3.2% -3.2% 0.0% -------------------------------------------------------------------------------- Min -3.5% -7.4% -58.7% -59.9% 0.0% Max +0.1% +2.2% +32.9% +32.9% 0.0% Geometric Mean -0.0% -0.1% -14.2% -14.8% -0.0% Metric Decrease: MultiLayerModulesTH_OneShot T18223 T3064 T15185 T14766 Metric Increase: T9630
*	Assume at least one evaluation for nested SubDemands (#21081, #21133)wip/T21081	Sebastian Graf	2022-05-03	6	-104/+122
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	See the new `Note [SubDemand denotes at least one evaluation]`. A demand `n :* sd` on a let binder `x=e` now means > "`x` was evaluated `n` times and in any program trace it is evaluated, `e` is > evaluated deeply in sub-demand `sd`." The "any time it is evaluated" premise is what this patch adds. As a result, we get better nested strictness. For example (T21081) ```hs f :: (Bool, Bool) -> (Bool, Bool) f pr = (case pr of (a,b) -> a /= b, True) -- before: <MP(L,L)> -- after: <MP(SL,SL)> g :: Int -> (Bool, Bool) g x = let y = let z = odd x in (z,z) in f y ``` The change in demand signature "before" to "after" allows us to case-bind `z` here. Similarly good things happen for the `sd` in call sub-demands `Cn(sd)`, which allows for more eta-reduction (which is only sound with `-fno-pedantic-bottoms`, albeit). We also fix #21085, a surprising inconsistency with `Poly` to `Call` sub-demand expansion. In an attempt to fix a regression caused by less inlining due to eta-reduction in T15426, I eta-expanded the definition of `elemIndex` and `elemIndices`, thus fixing #21345 on the go. The main point of this patch is that it fixes #21081 and #21133. Annoyingly, I discovered that more precise demand signatures for join points can transform a program into a lazier program if that join point gets floated to the top-level, see #21392. There is no simple fix at the moment, but !5349 might. Thus, we accept a ~5% regression in `MultiLayerModulesTH_OneShot`, where #21392 bites us in `addListToUniqDSet`. T21392 reliably reproduces the issue. Surprisingly, ghc/alloc perf on Windows improves much more than on other jobs, by 0.4% in the geometric mean and by 2% in T16875. Metric Increase: MultiLayerModulesTH_OneShot Metric Decrease: T16875
*	Demand: Let `Boxed` win in `lubBoxity` (#21119)	Sebastian Graf	2022-03-16	2	-26/+19
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Previously, we let `Unboxed` win in `lubBoxity`, which is unsoundly optimistic in terms ob Boxity analysis. "Unsoundly" in the sense that we sometimes unbox parameters that we better shouldn't unbox. Examples are #18907 and T19871.absent. Until now, we thought that this hack pulled its weight becuase it worked around some shortcomings of the phase separation between Boxity analysis and CPR analysis. But it is a gross hack which caused regressions itself that needed all kinds of fixes and workarounds. See for example #20767. It became impossible to work with in !7599, so I want to remove it. For example, at the moment, `lubDmd B dmd` will not unbox `dmd`, but `lubDmd A dmd` will. Given that `B` is supposed to be the bottom element of the lattice, it's hardly justifiable to get a better demand when `lub`bing with `A`. The consequence of letting `Boxed` win in `lubBoxity` is that we would regress #2387, #16040 and parts of #5075 and T19871.sumIO, until Boxity and CPR are able to communicate better. Fortunately, that is not the case since I could tweak the other source of optimism in Boxity analysis that is described in `Note [Unboxed demand on function bodies returning small products]` so that we recursively assume unboxed demands on function bodies returning small products. See the updated Note. `Note [Boxity for bottoming functions]` describes why we need bottoming functions to have signatures that say that they deeply unbox their arguments. In so doing, I had to tweak `finaliseArgBoxities` so that it will never unbox recursive data constructors. This is in line with our handling of them in CPR. I updated `Note [Which types are unboxed?]` to reflect that. In turn we fix #21119, #20767, #18907, T19871.absent and get a much simpler implementation (at least to think about). We can also drop the very ad-hoc definition of `deferAfterPreciseException` and its Note in favor of the simple, intuitive definition we used to have. Metric Decrease: T16875 T18223 T18698a T18698b hard_hole_fits Metric Increase: LargeRecord MultiComponentModulesRecomp T15703 T8095 T9872d Out of all the regresions, only the one in T9872d doesn't vanish in a perf build, where the compiler is bootstrapped with -O2 and thus SpecConstr. Reason for regressions: * T9872d is due to `ty_co_subst` taking its `LiftingContext` boxed. That is because the context is passed to a function argument, for example in `liftCoSubstTyVarBndrUsing`. * In T15703, LargeRecord and T8095, we get a bit more allocations in `expand_syn` and `piResultTys`, because a `TCvSubst` isn't unboxed. In both cases that guards against reboxing in some code paths. * The same is true for MultiComponentModulesRecomp, where we get less unboxing in `GHC.Unit.Finder.$wfindInstalledHomeModule`. In a perf build, allocations actually improve by over 4%! Results on NoFib: -------------------------------------------------------------------------------- Program Allocs Instrs -------------------------------------------------------------------------------- awards -0.4% +0.3% cacheprof -0.3% +2.4% fft -1.5% -5.1% fibheaps +1.2% +0.8% fluid -0.3% -0.1% ida +0.4% +0.9% k-nucleotide +0.4% -0.1% last-piece +10.5% +13.9% lift -4.4% +3.5% mandel2 -99.7% -99.8% mate -0.4% +3.6% parser -1.0% +0.1% puzzle -11.6% +6.5% reverse-complem -3.0% +2.0% scs -0.5% +0.1% sphere -0.4% -0.2% wave4main -8.2% -0.3% -------------------------------------------------------------------------------- Summary excludes mandel2 because of excessive bias Min -11.6% -5.1% Max +10.5% +13.9% Geometric Mean -0.2% +0.3% -------------------------------------------------------------------------------- Not bad for a bug fix. The regression in `last-piece` could become a win if SpecConstr would work on non-recursive functions. The regression in `fibheaps` is due to `Note [Reboxed crud for bottoming calls]`, e.g., #21128.
*	Tag inference work.	Andreas Klebinger	2022-02-12	1	-2/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This does three major things: * Enforce the invariant that all strict fields must contain tagged pointers. * Try to predict the tag on bindings in order to omit tag checks. * Allows functions to pass arguments unlifted (call-by-value). The former is "simply" achieved by wrapping any constructor allocations with a case which will evaluate the respective strict bindings. The prediction is done by a new data flow analysis based on the STG representation of a program. This also helps us to avoid generating redudant cases for the above invariant. StrictWorkers are created by W/W directly and SpecConstr indirectly. See the Note [Strict Worker Ids] Other minor changes: * Add StgUtil module containing a few functions needed by, but not specific to the tag analysis. ------------------------- Metric Decrease: T12545 T18698b T18140 T18923 LargeRecord Metric Increase: LargeRecord ManyAlternatives ManyConstructors T10421 T12425 T12707 T13035 T13056 T13253 T13253-spj T13379 T15164 T18282 T18304 T18698a T1969 T20049 T3294 T4801 T5321FD T5321Fun T783 T9233 T9675 T9961 T19695 WWRec -------------------------
*	DmdAnal: Implement Boxity Analysis (#19871)	Sebastian Graf	2021-10-24	4	-31/+27
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch fixes some abundant reboxing of `DynFlags` in `GHC.HsToCore.Match.Literal.warnAboutOverflowedLit` (which was the topic of #19407) by introducing a Boxity analysis to GHC, done as part of demand analysis. This allows to accurately capture ad-hoc unboxing decisions previously made in worker/wrapper in demand analysis now, where the boxity info can propagate through demand signatures. See the new `Note [Boxity analysis]`. The actual fix for #19407 is described in `Note [No lazy, Unboxed demand in demand signature]`, but `Note [Finalising boxity for demand signature]` is probably a better entry-point. To support the fix for #19407, I had to change (what was) `Note [Add demands for strict constructors]` a bit (now `Note [Unboxing evaluated arguments]`). In particular, we now take care of it in `finaliseBoxity` (which is only called from demand analaysis) instead of `wantToUnboxArg`. I also had to resurrect `Note [Product demands for function body]` and rename it to `Note [Unboxed demand on function bodies returning small products]` to avoid huge regressions in `join004` and `join007`, thereby fixing #4267 again. See the updated Note for details. A nice side-effect is that the worker/wrapper transformation no longer needs to look at strictness info and other bits such as `InsideInlineableFun` flags (needed for `Note [Do not unbox class dictionaries]`) at all. It simply collects boxity info from argument demands and interprets them with a severely simplified `wantToUnboxArg`. All the smartness is in `finaliseBoxity`, which could be moved to DmdAnal completely, if it wasn't for the call to `dubiousDataConInstArgTys` which would be awkward to export. I spent some time figuring out the reason for why `T16197` failed prior to my amendments to `Note [Unboxing evaluated arguments]`. After having it figured out, I minimised it a bit and added `T16197b`, which simply compares computed strictness signatures and thus should be far simpler to eyeball. The 12% ghc/alloc regression in T11545 is because of the additional `Boxity` field in `Poly` and `Prod` that results in more allocation during `lubSubDmd` and `plusSubDmd`. I made sure in the ticky profiles that the number of calls to those functions stayed the same. We can bear such an increase here, as we recently improved it by -68% (in b760c1f). T18698* regress slightly because there is more unboxing of dictionaries happening and that causes Lint (mostly) to allocate more. Fixes #19871, #19407, #4267, #16859, #18907 and #13331. Metric Increase: T11545 T18698a T18698b Metric Decrease: T12425 T16577 T18223 T18282 T4267 T9961
*	Bignum: allow Integer predicates to inline (#20361)	Sylvain Henry	2021-10-20	2	-30/+35
\| \| \| \| \| \| \| \| \|	T17516 allocations increase by 48% because Integer's predicates are inlined in some Ord instance methods. These methods become too big to be inlined while they probably should: this is tracked in #20516. Metric Increase: T17516
*	Nested CPR light unleashed (#18174)	Sebastian Graf	2021-09-30	2	-18/+12
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch enables worker/wrapper for nested constructed products, as described in `Note [Nested CPR]`. The machinery for expressing Nested CPR was already there, since !5054. Worker/wrapper is equipped to exploit Nested CPR annotations since !5338. CPR analysis already handles applications in batches since !5753. This patch just needs to flip a few more switches: 1. In `cprTransformDataConWork`, we need to look at the field expressions and their `CprType`s to see whether the evaluation of the expressions terminates quickly (= is in HNF) or if they are put in strict fields. If that is the case, then we retain their CPR info and may unbox nestedly later on. More details in `Note [Nested CPR]`. 2. Enable nested `ConCPR` signatures in `GHC.Types.Cpr`. 3. In the `asConCpr` call in `GHC.Core.Opt.WorkWrap.Utils`, pass CPR info of fields to the `Unbox`. 4. Instead of giving CPR signatures to DataCon workers and wrappers, we now have `cprTransformDataConWork` for workers and treat wrappers by analysing their unfolding. As a result, the code from GHC.Types.Id.Make went away completely. 5. I deactivated worker/wrappering for recursive DataCons and wrote a function `isRecDataCon` to detect them. We really don't want to give `repeat` or `replicate` the Nested CPR property. See Note [CPR for recursive data structures] for which kind of recursive DataCons we target. 6. Fix a couple of tests and their outputs. I also documented that CPR can destroy sharing and lead to asymptotic increase in allocations (which is tracked by #13331/#19326) in `Note [CPR for data structures can destroy sharing]`. Nofib results: ``` -------------------------------------------------------------------------------- Program Allocs Instrs -------------------------------------------------------------------------------- ben-raytrace -3.1% -0.4% binary-trees +0.8% -2.9% digits-of-e2 +5.8% +1.2% event +0.8% -2.1% fannkuch-redux +0.0% -1.4% fish 0.0% -1.5% gamteb -1.4% -0.3% mkhprog +1.4% +0.8% multiplier +0.0% -1.9% pic -0.6% -0.1% reptile -20.9% -17.8% wave4main +4.8% +0.4% x2n1 -100.0% -7.6% -------------------------------------------------------------------------------- Min -95.0% -17.8% Max +5.8% +1.2% Geometric Mean -2.9% -0.4% ``` The huge wins in x2n1 (loopy list) and reptile (see #19970) are due to refraining from unboxing (:). Other benchmarks like digits-of-e2 or wave4main regress because of that. Ultimately there are no great improvements due to Nested CPR alone, but at least it's a win. Binary sizes decrease by 0.6%. There are a significant number of metric decreases. The most notable ones (>1%): ``` ManyAlternatives(normal) ghc/alloc 771656002.7 762187472.0 -1.2% ManyConstructors(normal) ghc/alloc 4191073418.7 4114369216.0 -1.8% MultiLayerModules(normal) ghc/alloc 3095678333.3 3128720704.0 +1.1% PmSeriesG(normal) ghc/alloc 50096429.3 51495664.0 +2.8% PmSeriesS(normal) ghc/alloc 63512989.3 64681600.0 +1.8% PmSeriesV(normal) ghc/alloc 62575424.0 63767208.0 +1.9% T10547(normal) ghc/alloc 29347469.3 29944240.0 +2.0% T11303b(normal) ghc/alloc 46018752.0 47367576.0 +2.9% T12150(optasm) ghc/alloc 81660890.7 82547696.0 +1.1% T12234(optasm) ghc/alloc 59451253.3 60357952.0 +1.5% T12545(normal) ghc/alloc 1705216250.7 1751278952.0 +2.7% T12707(normal) ghc/alloc 981000472.0 968489800.0 -1.3% GOOD T13056(optasm) ghc/alloc 389322664.0 372495160.0 -4.3% GOOD T13253(normal) ghc/alloc 337174229.3 341954576.0 +1.4% T13701(normal) ghc/alloc 2381455173.3 2439790328.0 +2.4% BAD T14052(ghci) ghc/alloc 2162530642.7 2139108784.0 -1.1% T14683(normal) ghc/alloc 3049744728.0 2977535064.0 -2.4% GOOD T14697(normal) ghc/alloc 362980213.3 369304512.0 +1.7% T15164(normal) ghc/alloc 1323102752.0 1307480600.0 -1.2% T15304(normal) ghc/alloc 1304607429.3 1291024568.0 -1.0% T16190(normal) ghc/alloc 281450410.7 284878048.0 +1.2% T16577(normal) ghc/alloc 7984960789.3 7811668768.0 -2.2% GOOD T17516(normal) ghc/alloc 1171051192.0 1153649664.0 -1.5% T17836(normal) ghc/alloc 1115569746.7 1098197592.0 -1.6% T17836b(normal) ghc/alloc 54322597.3 55518216.0 +2.2% T17977(normal) ghc/alloc 47071754.7 48403408.0 +2.8% T17977b(normal) ghc/alloc 42579133.3 43977392.0 +3.3% T18923(normal) ghc/alloc 71764237.3 72566240.0 +1.1% T1969(normal) ghc/alloc 784821002.7 773971776.0 -1.4% GOOD T3294(normal) ghc/alloc 1634913973.3 1614323584.0 -1.3% GOOD T4801(normal) ghc/alloc 295619648.0 292776440.0 -1.0% T5321FD(normal) ghc/alloc 278827858.7 276067280.0 -1.0% T5631(normal) ghc/alloc 586618202.7 577579960.0 -1.5% T5642(normal) ghc/alloc 494923048.0 487927208.0 -1.4% T5837(normal) ghc/alloc 37758061.3 39261608.0 +4.0% T9020(optasm) ghc/alloc 257362077.3 254672416.0 -1.0% T9198(normal) ghc/alloc 49313365.3 50603936.0 +2.6% BAD T9233(normal) ghc/alloc 704944258.7 685692712.0 -2.7% GOOD T9630(normal) ghc/alloc 1476621560.0 1455192784.0 -1.5% T9675(optasm) ghc/alloc 443183173.3 433859696.0 -2.1% GOOD T9872a(normal) ghc/alloc 1720926653.3 1693190072.0 -1.6% GOOD T9872b(normal) ghc/alloc 2185618061.3 2162277568.0 -1.1% GOOD T9872c(normal) ghc/alloc 1765842405.3 1733618088.0 -1.8% GOOD TcPlugin_RewritePerf(normal) ghc/alloc 2388882730.7 2365504696.0 -1.0% WWRec(normal) ghc/alloc 607073186.7 597512216.0 -1.6% T9203(normal) run/alloc 107284064.0 102881832.0 -4.1% haddock.Cabal(normal) run/alloc 24025329589.3 23768382560.0 -1.1% haddock.base(normal) run/alloc 25660521653.3 25370321824.0 -1.1% haddock.compiler(normal) run/alloc 74064171706.7 73358712280.0 -1.0% ``` The biggest exception to the rule is T13701 which seems to fluctuate as usual (not unlike T12545). T14697 has a similar quality, being a generated multi-module test. T5837 is small enough that it similarly doesn't measure anything significant besides module loading overhead. T13253 simply does one additional round of Simplification due to Nested CPR. There are also some apparent regressions in T9198, T12234 and PmSeriesG that we (@mpickering and I) were simply unable to reproduce locally. @mpickering tried to run the CI script in a local Docker container and actually found that T9198 and PmSeriesG improved. In MRs that were rebased on top this one, like !4229, I did not experience such increases. Let's not get hung up on these regression tests, they were meant to test for asymptotic regressions. The build-cabal test improves by 1.2% in -O0. Metric Increase: T10421 T12234 T12545 T13035 T13056 T13701 T14697 T18923 T5837 T9198 Metric Decrease: ManyConstructors T12545 T12707 T13056 T14683 T16577 T18223 T1969 T3294 T9203 T9233 T9675 T9872a T9872b T9872c T9961 TcPlugin_RewritePerf
*	Avoid useless w/w split, take 2	Simon Peyton Jones	2021-06-05	11	-141/+64
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This commit: commit c6faa42bfb954445c09c5680afd4fb875ef03758 Author: Simon Peyton Jones <simonpj@microsoft.com> Date: Mon Mar 9 10:20:42 2020 +0000 Avoid useless w/w split This patch is just a tidy-up for the post-strictness-analysis worker wrapper split. Consider f x = x Strictnesss analysis does not lead to a w/w split, so the obvious thing is to leave it 100% alone. But actually, because the RHS is small, we ended up adding a StableUnfolding for it. There is some reason to do this if we choose /not/ do to w/w on the grounds that the function is small. See Note [Don't w/w inline small non-loop-breaker things] But there is no reason if we would not have done w/w anyway. This patch just moves the conditional to later. Easy. turns out to have a bug in it. Instead of /moving/ the conditional, I /duplicated/ it. Then in a subsequent unrelated tidy-up (087ac4eb) I removed the second (redundant) test! This patch does what I originally intended. There is also a small refactoring in GHC.Core.Unfold, to make the code clearer, but with no change in behaviour. It does, however, have a generally good effect on compile times, because we aren't dealing with so many silly stable unfoldings. Here are the non-zero changes: Metrics: compile_time/bytes allocated ------------------------------------- Baseline Test Metric value New value Change --------------------------------------------------------------------------- ManyAlternatives(normal) ghc/alloc 791969344.0 792665048.0 +0.1% ManyConstructors(normal) ghc/alloc 4351126824.0 4358303528.0 +0.2% PmSeriesG(normal) ghc/alloc 50362552.0 50482208.0 +0.2% PmSeriesS(normal) ghc/alloc 63733024.0 63619912.0 -0.2% T10421(normal) ghc/alloc 121224624.0 119695448.0 -1.3% GOOD T10421a(normal) ghc/alloc 85256392.0 83714224.0 -1.8% T10547(normal) ghc/alloc 29253072.0 29258256.0 +0.0% T10858(normal) ghc/alloc 189343152.0 187972328.0 -0.7% T11195(normal) ghc/alloc 281208248.0 279727584.0 -0.5% T11276(normal) ghc/alloc 141966952.0 142046224.0 +0.1% T11303b(normal) ghc/alloc 46228360.0 46259024.0 +0.1% T11545(normal) ghc/alloc 2663128768.0 2667412656.0 +0.2% T11822(normal) ghc/alloc 138686944.0 138760176.0 +0.1% T12227(normal) ghc/alloc 482836000.0 475421056.0 -1.5% GOOD T12234(optasm) ghc/alloc 60710520.0 60781808.0 +0.1% T12425(optasm) ghc/alloc 104089000.0 104022424.0 -0.1% T12545(normal) ghc/alloc 1711759416.0 1705711528.0 -0.4% T12707(normal) ghc/alloc 991541120.0 991921776.0 +0.0% T13035(normal) ghc/alloc 108199872.0 108370704.0 +0.2% T13056(optasm) ghc/alloc 414642544.0 412580384.0 -0.5% T13253(normal) ghc/alloc 361701272.0 355838624.0 -1.6% T13253-spj(normal) ghc/alloc 157710168.0 157397768.0 -0.2% T13379(normal) ghc/alloc 370984400.0 371345888.0 +0.1% T13701(normal) ghc/alloc 2439764144.0 2441351984.0 +0.1% T14052(ghci) ghc/alloc 2154090896.0 2156671400.0 +0.1% T15164(normal) ghc/alloc 1478517688.0 1440317696.0 -2.6% GOOD T15630(normal) ghc/alloc 178053912.0 172489808.0 -3.1% T16577(normal) ghc/alloc 7859948896.0 7854524080.0 -0.1% T17516(normal) ghc/alloc 1271520128.0 1202096488.0 -5.5% GOOD T17836(normal) ghc/alloc 1123320632.0 1123922480.0 +0.1% T17836b(normal) ghc/alloc 54526280.0 54576776.0 +0.1% T17977b(normal) ghc/alloc 42706752.0 42730544.0 +0.1% T18140(normal) ghc/alloc 108834568.0 108693816.0 -0.1% T18223(normal) ghc/alloc 5539629264.0 5579500872.0 +0.7% T18304(normal) ghc/alloc 97589720.0 97196944.0 -0.4% T18478(normal) ghc/alloc 770755472.0 771232888.0 +0.1% T18698a(normal) ghc/alloc 408691160.0 374364992.0 -8.4% GOOD T18698b(normal) ghc/alloc 492419768.0 458809408.0 -6.8% GOOD T18923(normal) ghc/alloc 72177032.0 71368824.0 -1.1% T1969(normal) ghc/alloc 803523496.0 804655112.0 +0.1% T3064(normal) ghc/alloc 198411784.0 198608512.0 +0.1% T4801(normal) ghc/alloc 312416688.0 312874976.0 +0.1% T5321Fun(normal) ghc/alloc 325230680.0 325474448.0 +0.1% T5631(normal) ghc/alloc 592064448.0 593518968.0 +0.2% T5837(normal) ghc/alloc 37691496.0 37710904.0 +0.1% T783(normal) ghc/alloc 404629536.0 405064432.0 +0.1% T9020(optasm) ghc/alloc 266004608.0 266375592.0 +0.1% T9198(normal) ghc/alloc 49221336.0 49268648.0 +0.1% T9233(normal) ghc/alloc 913464984.0 742680256.0 -18.7% GOOD T9675(optasm) ghc/alloc 552296608.0 466322000.0 -15.6% GOOD T9872a(normal) ghc/alloc 1789910616.0 1793924472.0 +0.2% T9872b(normal) ghc/alloc 2315141376.0 2310338056.0 -0.2% T9872c(normal) ghc/alloc 1840422424.0 1841567224.0 +0.1% T9872d(normal) ghc/alloc 556713248.0 556838432.0 +0.0% T9961(normal) ghc/alloc 383809160.0 384601600.0 +0.2% WWRec(normal) ghc/alloc 773751272.0 753949608.0 -2.6% GOOD Residency goes down too: Metrics: compile_time/max_bytes_used ------------------------------------ Baseline Test Metric value New value Change ----------------------------------------------------------- T10370(optasm) ghc/max 42058448.0 39481672.0 -6.1% T11545(normal) ghc/max 43641392.0 43634752.0 -0.0% T15304(normal) ghc/max 29895824.0 29439032.0 -1.5% T15630(normal) ghc/max 8822568.0 8772328.0 -0.6% T18698a(normal) ghc/max 13882536.0 13787112.0 -0.7% T18698b(normal) ghc/max 14714112.0 13836408.0 -6.0% T1969(normal) ghc/max 24724128.0 24733496.0 +0.0% T3064(normal) ghc/max 14041152.0 14034768.0 -0.0% T3294(normal) ghc/max 32769248.0 32760312.0 -0.0% T9630(normal) ghc/max 41605120.0 41572184.0 -0.1% T9675(optasm) ghc/max 18652296.0 17253480.0 -7.5% Metric Decrease: T10421 T12227 T15164 T17516 T18698a T18698b T9233 T9675 WWRec Metric Increase: T12545
*	DmdAnal: Better syntax for demand signatures (#19016)	Sebastian Graf	2021-03-03	10	-72/+72
\| \| \| \| \| \| \| \| \|	The update of the Outputable instance resulted in a slew of documentation changes within Notes that used the old syntax. The most important doc changes are to `Note [Demand notation]` and the user's guide. Fixes #19016.
*	Demand: Simplify `CU(U)` to `U` (#19005)	Sebastian Graf	2020-12-12	2	-3/+3
\| \| \| \| \| \|	Both sub-demands encode the same information. This is a trivial change and already affects a few regression tests (e.g. `T5075`), so no separate regression test is necessary.
*	Demand: Interleave usage and strictness demands (#18903)	Sebastian Graf	2020-11-20	10	-48/+48
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	As outlined in #18903, interleaving usage and strictness demands not only means a more compact demand representation, but also allows us to express demands that we weren't easily able to express before. Call demands are relative in the sense that a call demand `Cn(cd)` on `g` says "`g` is called `n` times. Whenever `g` is called, the result is used according to `cd`". Example from #18903: ```hs h :: Int -> Int h m = let g :: Int -> (Int,Int) g 1 = (m, 0) g n = (2 * n, 2 `div` n) {-# NOINLINE g #-} in case m of 1 -> 0 2 -> snd (g m) _ -> uncurry (+) (g m) ``` Without the interleaved representation, we would just get `L` for the strictness demand on `g`. Now we are able to express that whenever `g` is called, its second component is used strictly in denoting `g` by `1C1(P(1P(U),SP(U)))`. This would allow Nested CPR to unbox the division, for example. Fixes #18903. While fixing regressions, I also discovered and fixed #18957. Metric Decrease: T13253-spj
*	Arity: Emit "Exciting arity" warning only after second iteration (#18937)	Sebastian Graf	2020-11-13	1	-1/+1
\| \| \| \| \| \| \|	See Note [Exciting arity] why we emit the warning at all and why we only do after the second iteration now. Fixes #18937.
*	Arity: Rework `ArityType` to fix monotonicity (#18870)	Sebastian Graf	2020-11-13	3	-0/+22
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	As we found out in #18870, `andArityType` is not monotone, with potentially severe consequences for termination of fixed-point iteration. That showed in an abundance of "Exciting arity" DEBUG messages that are emitted whenever we do more than one step in fixed-point iteration. The solution necessitates also recording `OneShotInfo` info for `ABot` arity type. Thus we get the following definition for `ArityType`: ``` data ArityType = AT [OneShotInfo] Divergence ``` The majority of changes in this patch are the result of refactoring use sites of `ArityType` to match the new definition. The regression test `T18870` asserts that we indeed don't emit any DEBUG output anymore for a function where we previously would have. Similarly, there's a regression test `T18937` for #18937, which we expect to be broken for now. Fixes #18870.
*	Testsuite: Add dead arity analysis tests	Sebastian Graf	2020-10-17	55	-2537/+809
\| \| \| \| \|	We didn't seem to test these old tests at all, judging from their expected output.
*	Arity: Record arity types for non-recursive lets	Sebastian Graf	2020-10-17	3	-0/+107
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In #18793, we saw a compelling example which requires us to look at non-recursive let-bindings during arity analysis and unleash their arity types at use sites. After the refactoring in the previous patch, the needed change is quite simple and very local to `arityType`'s defn for non-recurisve `Let`. Apart from that, we had to get rid of the second item of `Note [Dealing with bottoms]`, which was entirely a safety measure and hindered optimistic fixed-point iteration. Fixes #18793. The following metric increases are all caused by this commit and a result of the fact that we just do more work now: Metric Increase: T3294 T12545 T12707
*	Testsuite: delete Windows line endings [skip ci] (#11631)	Thomas Miedema	2016-02-23	17	-112/+112
\|
*	Delete __GLASGOW_HASKELL__ ifdefs for stage0 < 7.6.	Thomas Miedema	2014-10-07	1	-19/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Summary: My understanding is that ghc 7.10 should be buildable with the last 3 versions of ghc, i.e 7.6, 7.8 and 7.10 itself. Test Plan: x Reviewers: austin Reviewed By: austin Subscribers: hvr, simonmar, ezyang, carter, thomie Differential Revision: https://phabricator.haskell.org/D254
*	Move tests from tests/ghc-regress/* to just tests/*	David Terei	2011-07-20	37	-0/+2640

| | | | | | | |

Justification in #22231. Short form: In a demand like `1C1(C1(L))` it was too easy to confuse which `1` belongs to which `C`. Now that should be more obvious. Fixes #22231

| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |

This patch fixes #21286, by not unboxing dictionaries in worker/wrapper (ever). The main payload is tiny: * In `GHC.Core.Opt.DmdAnal.finaliseArgBoxities`, do not unbox dictionaries in `get_dmd`. See Note [Do not unbox class dictionaries] in that module * I also found that imported wrappers were being fruitlessly specialised, so I fixed that too, in canSpecImport. See Note [Specialising imported functions] point (2). In doing due diligence in the testsuite I fixed a number of other things: * Improve Note [Specialising unfoldings] in GHC.Core.Unfold.Make, and Note [Inline specialisations] in GHC.Core.Opt.Specialise, and remove duplication between the two. The new Note describes how we specialise functions with an INLINABLE pragma. And simplify the defn of `spec_unf` in `GHC.Core.Opt.Specialise.specCalls`. * Improve Note [Worker/wrapper for INLINABLE functions] in GHC.Core.Opt.WorkWrap. And (critially) make an actual change which is to propagate the user-written pragma from the original function to the wrapper; see `mkStrWrapperInlinePrag`. * Write new Note [Specialising imported functions] in GHC.Core.Opt.Specialise All this has a big effect on some compile times. This is compiler/perf, showing only changes over 1%: Metrics: compile_time/bytes allocated ------------------------------------- LargeRecord(normal) -50.2% GOOD ManyConstructors(normal) +1.0% MultiLayerModulesTH_OneShot(normal) +2.6% PmSeriesG(normal) -1.1% T10547(normal) -1.2% T11195(normal) -1.2% T11276(normal) -1.0% T11303b(normal) -1.6% T11545(normal) -1.4% T11822(normal) -1.3% T12150(optasm) -1.0% T12234(optasm) -1.2% T13056(optasm) -9.3% GOOD T13253(normal) -3.8% GOOD T15164(normal) -3.6% GOOD T16190(normal) -2.1% T16577(normal) -2.8% GOOD T16875(normal) -1.6% T17836(normal) +2.2% T17977b(normal) -1.0% T18223(normal) -33.3% GOOD T18282(normal) -3.4% GOOD T18304(normal) -1.4% T18698a(normal) -1.4% GOOD T18698b(normal) -1.3% GOOD T19695(normal) -2.5% GOOD T5837(normal) -2.3% T9630(normal) -33.0% GOOD WWRec(normal) -9.7% GOOD hard_hole_fits(normal) -2.1% GOOD hie002(normal) +1.6% geo. mean -2.2% minimum -50.2% maximum +2.6% I diligently investigated some of the big drops. * Caused by not doing w/w for dictionaries: T13056, T15164, WWRec, T18223 * Caused by not fruitlessly specialising wrappers LargeRecord, T9630 For runtimes, here is perf/should+_run: Metrics: runtime/bytes allocated -------------------------------- T12990(normal) -3.8% T5205(normal) -1.3% T9203(normal) -10.7% GOOD haddock.Cabal(normal) +0.1% haddock.base(normal) -1.1% haddock.compiler(normal) -0.3% lazy-bs-alloc(normal) -0.2% ------------------------------------------ geo. mean -0.3% minimum -10.7% maximum +0.1% I did not investigate exactly what happens in T9203. Nofib is a wash: +-------------------------------++--+-----------+-----------+ | || | tsv (rel) | std. err. | +===============================++==+===========+===========+ | real/anna || | -0.13% | 0.0% | | real/fem || | +0.13% | 0.0% | | real/fulsom || | -0.16% | 0.0% | | real/lift || | -1.55% | 0.0% | | real/reptile || | -0.11% | 0.0% | | real/smallpt || | +0.51% | 0.0% | | spectral/constraints || | +0.20% | 0.0% | | spectral/dom-lt || | +1.80% | 0.0% | | spectral/expert || | +0.33% | 0.0% | +===============================++==+===========+===========+ | geom mean || | | | +-------------------------------++--+-----------+-----------+ I spent quite some time investigating dom-lt, but it's pretty complicated. See my note on !7847. Conclusion: it's just a delicate inlining interaction, and we have plenty of those. Metric Decrease: LargeRecord T13056 T13253 T15164 T16577 T18223 T18282 T18698a T18698b T19695 T9630 WWRec hard_hole_fits T9203

| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |

In #21717 we saw a reportedly unsound strictness signature due to an unsound definition of plusSubDmd on Calls. This patch contains a description and the fix to the unsoundness as outlined in `Note [Call SubDemand vs. evaluation Demand]`. This fix means we also get rid of the special handling of `-fpedantic-bottoms` in eta-reduction. Thanks to less strict and actually sound strictness results, we will no longer eta-reduce the problematic cases in the first place, even without `-fpedantic-bottoms`. So fixing the unsoundness also makes our eta-reduction code simpler with less hacks to explain. But there is another, more unfortunate side-effect: We *unfix* #21085, but fortunately we have a new fix ready: See `Note [mkCall and plusSubDmd]`. There's another change: I decided to make `Note [SubDemand denotes at least one evaluation]` a lot simpler by using `plusSubDmd` (instead of `lubPlusSubDmd`) even if both argument demands are lazy. That leads to less precise results, but in turn rids ourselves from the need for 4 different `OpMode`s and the complication of `Note [Manual specialisation of lub*Dmd/plus*Dmd]`. The result is simpler code that is in line with the paper draft on Demand Analysis. I left the abandoned idea in `Note [Unrealised opportunity in plusDmd]` for posterity. The fallout in terms of regressions is negligible, as the testsuite and NoFib shows. ``` Program Allocs Instrs -------------------------------------------------------------------------------- hidden +0.2% -0.2% linear -0.0% -0.7% -------------------------------------------------------------------------------- Min -0.0% -0.7% Max +0.2% +0.0% Geometric Mean +0.0% -0.0% ``` Fixes #21717.

| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |

This MR fixes #21694, #21755. It also makes sure that #21948 and fix to #21694. * For #21694 the underlying problem was that we were calling arityType on an expression that had free join points. This is a Bad Bad Idea. See Note [No free join points in arityType]. * To make "no free join points in arityType" work out I had to avoid trying to use eta-expansion for runRW#. This entailed a few changes in the Simplifier's treatment of runRW#. See GHC.Core.Opt.Simplify.Iteration Note [No eta-expansion in runRW#] * I also made andArityType work correctly with -fpedantic-bottoms; see Note [Combining case branches: andWithTail]. * Rewrote Note [Combining case branches: optimistic one-shot-ness] * arityType previously treated join points differently to other let-bindings. This patch makes them unform; arityType analyses the RHS of all bindings to get its ArityType, and extends am_sigs. I realised that, now we have am_sigs giving the ArityType for let-bound Ids, we don't need the (pre-dating) special code in arityType for join points. But instead we need to extend the env for Rec bindings, which weren't doing before. More uniform now. See Note [arityType for let-bindings]. This meant we could get rid of ae_joins, and in fact get rid of EtaExpandArity altogether. Simpler. * And finally, it was the strange treatment of join-point Ids in arityType (involving a fake ABot type) that led to a serious bug: #21755. Fixed by this refactoring, which treats them uniformly; but without breaking #18328. In fact, the arity for recursive join bindings is pretty tricky; see the long Note [Arity for recursive join bindings] in GHC.Core.Opt.Simplify.Utils. That led to more refactoring, including deciding that an Id could have an Arity that is bigger than its JoinArity; see Note [Invariants on join points], item 2(b) in GHC.Core * Make sure that the "demand threshold" for join points in DmdAnal is no bigger than the join-arity. In GHC.Core.Opt.DmdAnal see Note [Demand signatures are computed for a threshold arity based on idArity] * I moved GHC.Core.Utils.exprIsDeadEnd into GHC.Core.Opt.Arity, where it more properly belongs. * Remove an old, redundant hack in FloatOut. The old Note was Note [Bottoming floats: eta expansion] in GHC.Core.Opt.SetLevels. Compile time improves very slightly on average: Metrics: compile_time/bytes allocated --------------------------------------------------------------------------------------- T18223(normal) ghc/alloc 725,808,720 747,839,216 +3.0% BAD T6048(optasm) ghc/alloc 105,006,104 101,599,472 -3.2% GOOD geo. mean -0.2% minimum -3.2% maximum +3.0% For some reason Windows was better T10421(normal) ghc/alloc 125,888,360 124,129,168 -1.4% GOOD T18140(normal) ghc/alloc 85,974,520 83,884,224 -2.4% GOOD T18698b(normal) ghc/alloc 236,764,568 234,077,288 -1.1% GOOD T18923(normal) ghc/alloc 75,660,528 73,994,512 -2.2% GOOD T6048(optasm) ghc/alloc 112,232,512 108,182,520 -3.6% GOOD geo. mean -0.6% I had a quick look at T18223 but it is knee deep in coercions and the size of everything looks similar before and after. I decided to accept that 3% increase in exchange for goodness elsewhere. Metric Decrease: T10421 T18140 T18698b T18923 T6048 Metric Increase: T18223

| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |

We used to put OtherCon unfoldings on lambda binders of workers and sometimes also join points/specializations with with the assumption that since the wrapper would force these arguments once we execute the RHS they would indeed be in WHNF. This was wrong for reasons detailed in #21472. So now we purge evaluated unfoldings from *all* lambda binders. This fixes #21472, but at the cost of sometimes not using as efficient a calling convention. It can also change inlining behaviour as some occurances will no longer look like value arguments when they did before. As consequence we also change how we compute CBV information for arguments slightly. We now *always* determine the CBV convention for arguments during tidy. Earlier in the pipeline we merely mark functions as candidates for having their arguments treated as CBV. As before the process is described in the relevant notes: Note [CBV Function Ids] Note [Attaching CBV Marks to ids] Note [Never put `OtherCon` unfoldigns on lambda binders] ------------------------- Metric Decrease: T12425 T13035 T18223 T18223 T18923 MultiLayerModulesTH_OneShot Metric Increase: WWRec -------------------------

| | | | | | | | | | | | | | | | | | | |

Similar to the fix to #20836 in CorePrep, we now track the set of enclosing recursive binders in the SimplEnv and SimpleOptEnv. See Note [Eta reduction in recursive RHSs] for details. I also updated Note [Arity robustness] with the insights Simon and I had in a call discussing the issue. Fixes #21652. Unfortunately, we get a 5% ghc/alloc regression in T16577. That is due to additional eta reduction in GHC.Read.choose1 and the resulting ANF-isation of a large list literal at the top-level that didn't happen before (presumably because it was too interesting to float to the top-level). There's not much we can do about that. Metric Increase: T16577

| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |

This is a large collection of changes all relating to eta reduction, originally triggered by #18993, but there followed a long saga. Specifics: * Move state-hack stuff from GHC.Types.Id (where it never belonged) to GHC.Core.Opt.Arity (which seems much more appropriate). * Add a crucial mkCast in the Cast case of GHC.Core.Opt.Arity.eta_expand; helps with T18223 * Add clarifying notes about eta-reducing to PAPs. See Note [Do not eta reduce PAPs] * I moved tryEtaReduce from GHC.Core.Utils to GHC.Core.Opt.Arity, where it properly belongs. See Note [Eta reduce PAPs] * In GHC.Core.Opt.Simplify.Utils.tryEtaExpandRhs, pull out the code for when eta-expansion is wanted, to make wantEtaExpansion, and all that same function in GHC.Core.Opt.Simplify.simplStableUnfolding. It was previously inconsistent, but it's doing the same thing. * I did a substantial refactor of ArityType; see Note [ArityType]. This allowed me to do away with the somewhat mysterious takeOneShots; more generally it allows arityType to describe the function, leaving its clients to decide how to use that information. I made ArityType abstract, so that clients have to use functions to access it. * Make GHC.Core.Opt.Simplify.Utils.rebuildLam (was stupidly called mkLam before) aware of the floats that the simplifier builds up, so that it can still do eta-reduction even if there are some floats. (Previously that would not happen.) That means passing the floats to rebuildLam, and an extra check when eta-reducting (etaFloatOk). * In GHC.Core.Opt.Simplify.Utils.tryEtaExpandRhs, make use of call-info in the idDemandInfo of the binder, as well as the CallArity info. The occurrence analyser did this but we were failing to take advantage here. In the end I moved the heavy lifting to GHC.Core.Opt.Arity.findRhsArity; see Note [Combining arityType with demand info], and functions idDemandOneShots and combineWithDemandOneShots. (These changes partly drove my refactoring of ArityType.) * In GHC.Core.Opt.Arity.findRhsArity * I'm now taking account of the demand on the binder to give extra one-shot info. E.g. if the fn is always called with two args, we can give better one-shot info on the binders than if we just look at the RHS. * Don't do any fixpointing in the non-recursive case -- simple short cut. * Trim arity inside the loop. See Note [Trim arity inside the loop] * Make SimpleOpt respect the eta-reduction flag (Some associated refactoring here.) * I made the CallCtxt which the Simplifier uses distinguish between recursive and non-recursive right-hand sides. data CallCtxt = ... | RhsCtxt RecFlag | ... It affects only one thing: - We call an RHS context interesting only if it is non-recursive see Note [RHS of lets] in GHC.Core.Unfold * Remove eta-reduction in GHC.CoreToStg.Prep, a welcome simplification. See Note [No eta reduction needed in rhsToBody] in GHC.CoreToStg.Prep. Other incidental changes * Fix a fairly long-standing outright bug in the ApplyToVal case of GHC.Core.Opt.Simplify.mkDupableContWithDmds. I was failing to take the tail of 'dmds' in the recursive call, which meant the demands were All Wrong. I have no idea why this has not caused problems before now. * Delete dead function GHC.Core.Opt.Simplify.Utils.contIsRhsOrArg Metrics: compile_time/bytes allocated Test Metric Baseline New value Change --------------------------------------------------------------------------------------- MultiLayerModulesTH_OneShot(normal) ghc/alloc 2,743,297,692 2,619,762,992 -4.5% GOOD T18223(normal) ghc/alloc 1,103,161,360 972,415,992 -11.9% GOOD T3064(normal) ghc/alloc 201,222,500 184,085,360 -8.5% GOOD T8095(normal) ghc/alloc 3,216,292,528 3,254,416,960 +1.2% T9630(normal) ghc/alloc 1,514,131,032 1,557,719,312 +2.9% BAD parsing001(normal) ghc/alloc 530,409,812 525,077,696 -1.0% geo. mean -0.1% Nofib: Program Size Allocs Runtime Elapsed TotalMem -------------------------------------------------------------------------------- banner +0.0% +0.4% -8.9% -8.7% 0.0% exact-reals +0.0% -7.4% -36.3% -37.4% 0.0% fannkuch-redux +0.0% -0.1% -1.0% -1.0% 0.0% fft2 -0.1% -0.2% -17.8% -19.2% 0.0% fluid +0.0% -1.3% -2.1% -2.1% 0.0% gg -0.0% +2.2% -0.2% -0.1% 0.0% spectral-norm +0.1% -0.2% 0.0% 0.0% 0.0% tak +0.0% -0.3% -9.8% -9.8% 0.0% x2n1 +0.0% -0.2% -3.2% -3.2% 0.0% -------------------------------------------------------------------------------- Min -3.5% -7.4% -58.7% -59.9% 0.0% Max +0.1% +2.2% +32.9% +32.9% 0.0% Geometric Mean -0.0% -0.1% -14.2% -14.8% -0.0% Metric Decrease: MultiLayerModulesTH_OneShot T18223 T3064 T15185 T14766 Metric Increase: T9630

| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |

See the new `Note [SubDemand denotes at least one evaluation]`. A demand `n :* sd` on a let binder `x=e` now means > "`x` was evaluated `n` times and in any program trace it is evaluated, `e` is > evaluated deeply in sub-demand `sd`." The "any time it is evaluated" premise is what this patch adds. As a result, we get better nested strictness. For example (T21081) ```hs f :: (Bool, Bool) -> (Bool, Bool) f pr = (case pr of (a,b) -> a /= b, True) -- before: <MP(L,L)> -- after: <MP(SL,SL)> g :: Int -> (Bool, Bool) g x = let y = let z = odd x in (z,z) in f y ``` The change in demand signature "before" to "after" allows us to case-bind `z` here. Similarly good things happen for the `sd` in call sub-demands `Cn(sd)`, which allows for more eta-reduction (which is only sound with `-fno-pedantic-bottoms`, albeit). We also fix #21085, a surprising inconsistency with `Poly` to `Call` sub-demand expansion. In an attempt to fix a regression caused by less inlining due to eta-reduction in T15426, I eta-expanded the definition of `elemIndex` and `elemIndices`, thus fixing #21345 on the go. The main point of this patch is that it fixes #21081 and #21133. Annoyingly, I discovered that more precise demand signatures for join points can transform a program into a lazier program if that join point gets floated to the top-level, see #21392. There is no simple fix at the moment, but !5349 might. Thus, we accept a ~5% regression in `MultiLayerModulesTH_OneShot`, where #21392 bites us in `addListToUniqDSet`. T21392 reliably reproduces the issue. Surprisingly, ghc/alloc perf on Windows improves much more than on other jobs, by 0.4% in the geometric mean and by 2% in T16875. Metric Increase: MultiLayerModulesTH_OneShot Metric Decrease: T16875

| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |

Previously, we let `Unboxed` win in `lubBoxity`, which is unsoundly optimistic in terms ob Boxity analysis. "Unsoundly" in the sense that we sometimes unbox parameters that we better shouldn't unbox. Examples are #18907 and T19871.absent. Until now, we thought that this hack pulled its weight becuase it worked around some shortcomings of the phase separation between Boxity analysis and CPR analysis. But it is a gross hack which caused regressions itself that needed all kinds of fixes and workarounds. See for example #20767. It became impossible to work with in !7599, so I want to remove it. For example, at the moment, `lubDmd B dmd` will not unbox `dmd`, but `lubDmd A dmd` will. Given that `B` is supposed to be the bottom element of the lattice, it's hardly justifiable to get a better demand when `lub`bing with `A`. The consequence of letting `Boxed` win in `lubBoxity` is that we *would* regress #2387, #16040 and parts of #5075 and T19871.sumIO, until Boxity and CPR are able to communicate better. Fortunately, that is not the case since I could tweak the other source of optimism in Boxity analysis that is described in `Note [Unboxed demand on function bodies returning small products]` so that we *recursively* assume unboxed demands on function bodies returning small products. See the updated Note. `Note [Boxity for bottoming functions]` describes why we need bottoming functions to have signatures that say that they deeply unbox their arguments. In so doing, I had to tweak `finaliseArgBoxities` so that it will never unbox recursive data constructors. This is in line with our handling of them in CPR. I updated `Note [Which types are unboxed?]` to reflect that. In turn we fix #21119, #20767, #18907, T19871.absent and get a much simpler implementation (at least to think about). We can also drop the very ad-hoc definition of `deferAfterPreciseException` and its Note in favor of the simple, intuitive definition we used to have. Metric Decrease: T16875 T18223 T18698a T18698b hard_hole_fits Metric Increase: LargeRecord MultiComponentModulesRecomp T15703 T8095 T9872d Out of all the regresions, only the one in T9872d doesn't vanish in a perf build, where the compiler is bootstrapped with -O2 and thus SpecConstr. Reason for regressions: * T9872d is due to `ty_co_subst` taking its `LiftingContext` boxed. That is because the context is passed to a function argument, for example in `liftCoSubstTyVarBndrUsing`. * In T15703, LargeRecord and T8095, we get a bit more allocations in `expand_syn` and `piResultTys`, because a `TCvSubst` isn't unboxed. In both cases that guards against reboxing in some code paths. * The same is true for MultiComponentModulesRecomp, where we get less unboxing in `GHC.Unit.Finder.$wfindInstalledHomeModule`. In a perf build, allocations actually *improve* by over 4%! Results on NoFib: -------------------------------------------------------------------------------- Program Allocs Instrs -------------------------------------------------------------------------------- awards -0.4% +0.3% cacheprof -0.3% +2.4% fft -1.5% -5.1% fibheaps +1.2% +0.8% fluid -0.3% -0.1% ida +0.4% +0.9% k-nucleotide +0.4% -0.1% last-piece +10.5% +13.9% lift -4.4% +3.5% mandel2 -99.7% -99.8% mate -0.4% +3.6% parser -1.0% +0.1% puzzle -11.6% +6.5% reverse-complem -3.0% +2.0% scs -0.5% +0.1% sphere -0.4% -0.2% wave4main -8.2% -0.3% -------------------------------------------------------------------------------- Summary excludes mandel2 because of excessive bias Min -11.6% -5.1% Max +10.5% +13.9% Geometric Mean -0.2% +0.3% -------------------------------------------------------------------------------- Not bad for a bug fix. The regression in `last-piece` could become a win if SpecConstr would work on non-recursive functions. The regression in `fibheaps` is due to `Note [Reboxed crud for bottoming calls]`, e.g., #21128.

| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |

This does three major things: * Enforce the invariant that all strict fields must contain tagged pointers. * Try to predict the tag on bindings in order to omit tag checks. * Allows functions to pass arguments unlifted (call-by-value). The former is "simply" achieved by wrapping any constructor allocations with a case which will evaluate the respective strict bindings. The prediction is done by a new data flow analysis based on the STG representation of a program. This also helps us to avoid generating redudant cases for the above invariant. StrictWorkers are created by W/W directly and SpecConstr indirectly. See the Note [Strict Worker Ids] Other minor changes: * Add StgUtil module containing a few functions needed by, but not specific to the tag analysis. ------------------------- Metric Decrease: T12545 T18698b T18140 T18923 LargeRecord Metric Increase: LargeRecord ManyAlternatives ManyConstructors T10421 T12425 T12707 T13035 T13056 T13253 T13253-spj T13379 T15164 T18282 T18304 T18698a T1969 T20049 T3294 T4801 T5321FD T5321Fun T783 T9233 T9675 T9961 T19695 WWRec -------------------------

| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |

This patch fixes some abundant reboxing of `DynFlags` in `GHC.HsToCore.Match.Literal.warnAboutOverflowedLit` (which was the topic of #19407) by introducing a Boxity analysis to GHC, done as part of demand analysis. This allows to accurately capture ad-hoc unboxing decisions previously made in worker/wrapper in demand analysis now, where the boxity info can propagate through demand signatures. See the new `Note [Boxity analysis]`. The actual fix for #19407 is described in `Note [No lazy, Unboxed demand in demand signature]`, but `Note [Finalising boxity for demand signature]` is probably a better entry-point. To support the fix for #19407, I had to change (what was) `Note [Add demands for strict constructors]` a bit (now `Note [Unboxing evaluated arguments]`). In particular, we now take care of it in `finaliseBoxity` (which is only called from demand analaysis) instead of `wantToUnboxArg`. I also had to resurrect `Note [Product demands for function body]` and rename it to `Note [Unboxed demand on function bodies returning small products]` to avoid huge regressions in `join004` and `join007`, thereby fixing #4267 again. See the updated Note for details. A nice side-effect is that the worker/wrapper transformation no longer needs to look at strictness info and other bits such as `InsideInlineableFun` flags (needed for `Note [Do not unbox class dictionaries]`) at all. It simply collects boxity info from argument demands and interprets them with a severely simplified `wantToUnboxArg`. All the smartness is in `finaliseBoxity`, which could be moved to DmdAnal completely, if it wasn't for the call to `dubiousDataConInstArgTys` which would be awkward to export. I spent some time figuring out the reason for why `T16197` failed prior to my amendments to `Note [Unboxing evaluated arguments]`. After having it figured out, I minimised it a bit and added `T16197b`, which simply compares computed strictness signatures and thus should be far simpler to eyeball. The 12% ghc/alloc regression in T11545 is because of the additional `Boxity` field in `Poly` and `Prod` that results in more allocation during `lubSubDmd` and `plusSubDmd`. I made sure in the ticky profiles that the number of calls to those functions stayed the same. We can bear such an increase here, as we recently improved it by -68% (in b760c1f). T18698* regress slightly because there is more unboxing of dictionaries happening and that causes Lint (mostly) to allocate more. Fixes #19871, #19407, #4267, #16859, #18907 and #13331. Metric Increase: T11545 T18698a T18698b Metric Decrease: T12425 T16577 T18223 T18282 T4267 T9961

| | | | | | | | |

T17516 allocations increase by 48% because Integer's predicates are inlined in some Ord instance methods. These methods become too big to be inlined while they probably should: this is tracked in #20516. Metric Increase: T17516

| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |

This patch enables worker/wrapper for nested constructed products, as described in `Note [Nested CPR]`. The machinery for expressing Nested CPR was already there, since !5054. Worker/wrapper is equipped to exploit Nested CPR annotations since !5338. CPR analysis already handles applications in batches since !5753. This patch just needs to flip a few more switches: 1. In `cprTransformDataConWork`, we need to look at the field expressions and their `CprType`s to see whether the evaluation of the expressions terminates quickly (= is in HNF) or if they are put in strict fields. If that is the case, then we retain their CPR info and may unbox nestedly later on. More details in `Note [Nested CPR]`. 2. Enable nested `ConCPR` signatures in `GHC.Types.Cpr`. 3. In the `asConCpr` call in `GHC.Core.Opt.WorkWrap.Utils`, pass CPR info of fields to the `Unbox`. 4. Instead of giving CPR signatures to DataCon workers and wrappers, we now have `cprTransformDataConWork` for workers and treat wrappers by analysing their unfolding. As a result, the code from GHC.Types.Id.Make went away completely. 5. I deactivated worker/wrappering for recursive DataCons and wrote a function `isRecDataCon` to detect them. We really don't want to give `repeat` or `replicate` the Nested CPR property. See Note [CPR for recursive data structures] for which kind of recursive DataCons we target. 6. Fix a couple of tests and their outputs. I also documented that CPR can destroy sharing and lead to asymptotic increase in allocations (which is tracked by #13331/#19326) in `Note [CPR for data structures can destroy sharing]`. Nofib results: ``` -------------------------------------------------------------------------------- Program Allocs Instrs -------------------------------------------------------------------------------- ben-raytrace -3.1% -0.4% binary-trees +0.8% -2.9% digits-of-e2 +5.8% +1.2% event +0.8% -2.1% fannkuch-redux +0.0% -1.4% fish 0.0% -1.5% gamteb -1.4% -0.3% mkhprog +1.4% +0.8% multiplier +0.0% -1.9% pic -0.6% -0.1% reptile -20.9% -17.8% wave4main +4.8% +0.4% x2n1 -100.0% -7.6% -------------------------------------------------------------------------------- Min -95.0% -17.8% Max +5.8% +1.2% Geometric Mean -2.9% -0.4% ``` The huge wins in x2n1 (loopy list) and reptile (see #19970) are due to refraining from unboxing (:). Other benchmarks like digits-of-e2 or wave4main regress because of that. Ultimately there are no great improvements due to Nested CPR alone, but at least it's a win. Binary sizes decrease by 0.6%. There are a significant number of metric decreases. The most notable ones (>1%): ``` ManyAlternatives(normal) ghc/alloc 771656002.7 762187472.0 -1.2% ManyConstructors(normal) ghc/alloc 4191073418.7 4114369216.0 -1.8% MultiLayerModules(normal) ghc/alloc 3095678333.3 3128720704.0 +1.1% PmSeriesG(normal) ghc/alloc 50096429.3 51495664.0 +2.8% PmSeriesS(normal) ghc/alloc 63512989.3 64681600.0 +1.8% PmSeriesV(normal) ghc/alloc 62575424.0 63767208.0 +1.9% T10547(normal) ghc/alloc 29347469.3 29944240.0 +2.0% T11303b(normal) ghc/alloc 46018752.0 47367576.0 +2.9% T12150(optasm) ghc/alloc 81660890.7 82547696.0 +1.1% T12234(optasm) ghc/alloc 59451253.3 60357952.0 +1.5% T12545(normal) ghc/alloc 1705216250.7 1751278952.0 +2.7% T12707(normal) ghc/alloc 981000472.0 968489800.0 -1.3% GOOD T13056(optasm) ghc/alloc 389322664.0 372495160.0 -4.3% GOOD T13253(normal) ghc/alloc 337174229.3 341954576.0 +1.4% T13701(normal) ghc/alloc 2381455173.3 2439790328.0 +2.4% BAD T14052(ghci) ghc/alloc 2162530642.7 2139108784.0 -1.1% T14683(normal) ghc/alloc 3049744728.0 2977535064.0 -2.4% GOOD T14697(normal) ghc/alloc 362980213.3 369304512.0 +1.7% T15164(normal) ghc/alloc 1323102752.0 1307480600.0 -1.2% T15304(normal) ghc/alloc 1304607429.3 1291024568.0 -1.0% T16190(normal) ghc/alloc 281450410.7 284878048.0 +1.2% T16577(normal) ghc/alloc 7984960789.3 7811668768.0 -2.2% GOOD T17516(normal) ghc/alloc 1171051192.0 1153649664.0 -1.5% T17836(normal) ghc/alloc 1115569746.7 1098197592.0 -1.6% T17836b(normal) ghc/alloc 54322597.3 55518216.0 +2.2% T17977(normal) ghc/alloc 47071754.7 48403408.0 +2.8% T17977b(normal) ghc/alloc 42579133.3 43977392.0 +3.3% T18923(normal) ghc/alloc 71764237.3 72566240.0 +1.1% T1969(normal) ghc/alloc 784821002.7 773971776.0 -1.4% GOOD T3294(normal) ghc/alloc 1634913973.3 1614323584.0 -1.3% GOOD T4801(normal) ghc/alloc 295619648.0 292776440.0 -1.0% T5321FD(normal) ghc/alloc 278827858.7 276067280.0 -1.0% T5631(normal) ghc/alloc 586618202.7 577579960.0 -1.5% T5642(normal) ghc/alloc 494923048.0 487927208.0 -1.4% T5837(normal) ghc/alloc 37758061.3 39261608.0 +4.0% T9020(optasm) ghc/alloc 257362077.3 254672416.0 -1.0% T9198(normal) ghc/alloc 49313365.3 50603936.0 +2.6% BAD T9233(normal) ghc/alloc 704944258.7 685692712.0 -2.7% GOOD T9630(normal) ghc/alloc 1476621560.0 1455192784.0 -1.5% T9675(optasm) ghc/alloc 443183173.3 433859696.0 -2.1% GOOD T9872a(normal) ghc/alloc 1720926653.3 1693190072.0 -1.6% GOOD T9872b(normal) ghc/alloc 2185618061.3 2162277568.0 -1.1% GOOD T9872c(normal) ghc/alloc 1765842405.3 1733618088.0 -1.8% GOOD TcPlugin_RewritePerf(normal) ghc/alloc 2388882730.7 2365504696.0 -1.0% WWRec(normal) ghc/alloc 607073186.7 597512216.0 -1.6% T9203(normal) run/alloc 107284064.0 102881832.0 -4.1% haddock.Cabal(normal) run/alloc 24025329589.3 23768382560.0 -1.1% haddock.base(normal) run/alloc 25660521653.3 25370321824.0 -1.1% haddock.compiler(normal) run/alloc 74064171706.7 73358712280.0 -1.0% ``` The biggest exception to the rule is T13701 which seems to fluctuate as usual (not unlike T12545). T14697 has a similar quality, being a generated multi-module test. T5837 is small enough that it similarly doesn't measure anything significant besides module loading overhead. T13253 simply does one additional round of Simplification due to Nested CPR. There are also some apparent regressions in T9198, T12234 and PmSeriesG that we (@mpickering and I) were simply unable to reproduce locally. @mpickering tried to run the CI script in a local Docker container and actually found that T9198 and PmSeriesG *improved*. In MRs that were rebased on top this one, like !4229, I did not experience such increases. Let's not get hung up on these regression tests, they were meant to test for asymptotic regressions. The build-cabal test improves by 1.2% in -O0. Metric Increase: T10421 T12234 T12545 T13035 T13056 T13701 T14697 T18923 T5837 T9198 Metric Decrease: ManyConstructors T12545 T12707 T13056 T14683 T16577 T18223 T1969 T3294 T9203 T9233 T9675 T9872a T9872b T9872c T9961 TcPlugin_RewritePerf

| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |

This commit: commit c6faa42bfb954445c09c5680afd4fb875ef03758 Author: Simon Peyton Jones <simonpj@microsoft.com> Date: Mon Mar 9 10:20:42 2020 +0000 Avoid useless w/w split This patch is just a tidy-up for the post-strictness-analysis worker wrapper split. Consider f x = x Strictnesss analysis does not lead to a w/w split, so the obvious thing is to leave it 100% alone. But actually, because the RHS is small, we ended up adding a StableUnfolding for it. There is some reason to do this if we choose /not/ do to w/w on the grounds that the function is small. See Note [Don't w/w inline small non-loop-breaker things] But there is no reason if we would not have done w/w anyway. This patch just moves the conditional to later. Easy. turns out to have a bug in it. Instead of /moving/ the conditional, I /duplicated/ it. Then in a subsequent unrelated tidy-up (087ac4eb) I removed the second (redundant) test! This patch does what I originally intended. There is also a small refactoring in GHC.Core.Unfold, to make the code clearer, but with no change in behaviour. It does, however, have a generally good effect on compile times, because we aren't dealing with so many silly stable unfoldings. Here are the non-zero changes: Metrics: compile_time/bytes allocated ------------------------------------- Baseline Test Metric value New value Change --------------------------------------------------------------------------- ManyAlternatives(normal) ghc/alloc 791969344.0 792665048.0 +0.1% ManyConstructors(normal) ghc/alloc 4351126824.0 4358303528.0 +0.2% PmSeriesG(normal) ghc/alloc 50362552.0 50482208.0 +0.2% PmSeriesS(normal) ghc/alloc 63733024.0 63619912.0 -0.2% T10421(normal) ghc/alloc 121224624.0 119695448.0 -1.3% GOOD T10421a(normal) ghc/alloc 85256392.0 83714224.0 -1.8% T10547(normal) ghc/alloc 29253072.0 29258256.0 +0.0% T10858(normal) ghc/alloc 189343152.0 187972328.0 -0.7% T11195(normal) ghc/alloc 281208248.0 279727584.0 -0.5% T11276(normal) ghc/alloc 141966952.0 142046224.0 +0.1% T11303b(normal) ghc/alloc 46228360.0 46259024.0 +0.1% T11545(normal) ghc/alloc 2663128768.0 2667412656.0 +0.2% T11822(normal) ghc/alloc 138686944.0 138760176.0 +0.1% T12227(normal) ghc/alloc 482836000.0 475421056.0 -1.5% GOOD T12234(optasm) ghc/alloc 60710520.0 60781808.0 +0.1% T12425(optasm) ghc/alloc 104089000.0 104022424.0 -0.1% T12545(normal) ghc/alloc 1711759416.0 1705711528.0 -0.4% T12707(normal) ghc/alloc 991541120.0 991921776.0 +0.0% T13035(normal) ghc/alloc 108199872.0 108370704.0 +0.2% T13056(optasm) ghc/alloc 414642544.0 412580384.0 -0.5% T13253(normal) ghc/alloc 361701272.0 355838624.0 -1.6% T13253-spj(normal) ghc/alloc 157710168.0 157397768.0 -0.2% T13379(normal) ghc/alloc 370984400.0 371345888.0 +0.1% T13701(normal) ghc/alloc 2439764144.0 2441351984.0 +0.1% T14052(ghci) ghc/alloc 2154090896.0 2156671400.0 +0.1% T15164(normal) ghc/alloc 1478517688.0 1440317696.0 -2.6% GOOD T15630(normal) ghc/alloc 178053912.0 172489808.0 -3.1% T16577(normal) ghc/alloc 7859948896.0 7854524080.0 -0.1% T17516(normal) ghc/alloc 1271520128.0 1202096488.0 -5.5% GOOD T17836(normal) ghc/alloc 1123320632.0 1123922480.0 +0.1% T17836b(normal) ghc/alloc 54526280.0 54576776.0 +0.1% T17977b(normal) ghc/alloc 42706752.0 42730544.0 +0.1% T18140(normal) ghc/alloc 108834568.0 108693816.0 -0.1% T18223(normal) ghc/alloc 5539629264.0 5579500872.0 +0.7% T18304(normal) ghc/alloc 97589720.0 97196944.0 -0.4% T18478(normal) ghc/alloc 770755472.0 771232888.0 +0.1% T18698a(normal) ghc/alloc 408691160.0 374364992.0 -8.4% GOOD T18698b(normal) ghc/alloc 492419768.0 458809408.0 -6.8% GOOD T18923(normal) ghc/alloc 72177032.0 71368824.0 -1.1% T1969(normal) ghc/alloc 803523496.0 804655112.0 +0.1% T3064(normal) ghc/alloc 198411784.0 198608512.0 +0.1% T4801(normal) ghc/alloc 312416688.0 312874976.0 +0.1% T5321Fun(normal) ghc/alloc 325230680.0 325474448.0 +0.1% T5631(normal) ghc/alloc 592064448.0 593518968.0 +0.2% T5837(normal) ghc/alloc 37691496.0 37710904.0 +0.1% T783(normal) ghc/alloc 404629536.0 405064432.0 +0.1% T9020(optasm) ghc/alloc 266004608.0 266375592.0 +0.1% T9198(normal) ghc/alloc 49221336.0 49268648.0 +0.1% T9233(normal) ghc/alloc 913464984.0 742680256.0 -18.7% GOOD T9675(optasm) ghc/alloc 552296608.0 466322000.0 -15.6% GOOD T9872a(normal) ghc/alloc 1789910616.0 1793924472.0 +0.2% T9872b(normal) ghc/alloc 2315141376.0 2310338056.0 -0.2% T9872c(normal) ghc/alloc 1840422424.0 1841567224.0 +0.1% T9872d(normal) ghc/alloc 556713248.0 556838432.0 +0.0% T9961(normal) ghc/alloc 383809160.0 384601600.0 +0.2% WWRec(normal) ghc/alloc 773751272.0 753949608.0 -2.6% GOOD Residency goes down too: Metrics: compile_time/max_bytes_used ------------------------------------ Baseline Test Metric value New value Change ----------------------------------------------------------- T10370(optasm) ghc/max 42058448.0 39481672.0 -6.1% T11545(normal) ghc/max 43641392.0 43634752.0 -0.0% T15304(normal) ghc/max 29895824.0 29439032.0 -1.5% T15630(normal) ghc/max 8822568.0 8772328.0 -0.6% T18698a(normal) ghc/max 13882536.0 13787112.0 -0.7% T18698b(normal) ghc/max 14714112.0 13836408.0 -6.0% T1969(normal) ghc/max 24724128.0 24733496.0 +0.0% T3064(normal) ghc/max 14041152.0 14034768.0 -0.0% T3294(normal) ghc/max 32769248.0 32760312.0 -0.0% T9630(normal) ghc/max 41605120.0 41572184.0 -0.1% T9675(optasm) ghc/max 18652296.0 17253480.0 -7.5% Metric Decrease: T10421 T12227 T15164 T17516 T18698a T18698b T9233 T9675 WWRec Metric Increase: T12545

| | | | | | | | |

The update of the Outputable instance resulted in a slew of documentation changes within Notes that used the old syntax. The most important doc changes are to `Note [Demand notation]` and the user's guide. Fixes #19016.

| | | | | |

Both sub-demands encode the same information. This is a trivial change and already affects a few regression tests (e.g. `T5075`), so no separate regression test is necessary.

| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |

As outlined in #18903, interleaving usage and strictness demands not only means a more compact demand representation, but also allows us to express demands that we weren't easily able to express before. Call demands are *relative* in the sense that a call demand `Cn(cd)` on `g` says "`g` is called `n` times. *Whenever `g` is called*, the result is used according to `cd`". Example from #18903: ```hs h :: Int -> Int h m = let g :: Int -> (Int,Int) g 1 = (m, 0) g n = (2 * n, 2 `div` n) {-# NOINLINE g #-} in case m of 1 -> 0 2 -> snd (g m) _ -> uncurry (+) (g m) ``` Without the interleaved representation, we would just get `L` for the strictness demand on `g`. Now we are able to express that whenever `g` is called, its second component is used strictly in denoting `g` by `1C1(P(1P(U),SP(U)))`. This would allow Nested CPR to unbox the division, for example. Fixes #18903. While fixing regressions, I also discovered and fixed #18957. Metric Decrease: T13253-spj

| | | | | | |

See Note [Exciting arity] why we emit the warning at all and why we only do after the second iteration now. Fixes #18937.

| | | | | | | | | | | | | | | | | | | | | | | | |

As we found out in #18870, `andArityType` is not monotone, with potentially severe consequences for termination of fixed-point iteration. That showed in an abundance of "Exciting arity" DEBUG messages that are emitted whenever we do more than one step in fixed-point iteration. The solution necessitates also recording `OneShotInfo` info for `ABot` arity type. Thus we get the following definition for `ArityType`: ``` data ArityType = AT [OneShotInfo] Divergence ``` The majority of changes in this patch are the result of refactoring use sites of `ArityType` to match the new definition. The regression test `T18870` asserts that we indeed don't emit any DEBUG output anymore for a function where we previously would have. Similarly, there's a regression test `T18937` for #18937, which we expect to be broken for now. Fixes #18870.

| | | | |

We didn't seem to test these old tests at all, judging from their expected output.

| | | | | | | | | | | | | | | | | | | | | | |

In #18793, we saw a compelling example which requires us to look at non-recursive let-bindings during arity analysis and unleash their arity types at use sites. After the refactoring in the previous patch, the needed change is quite simple and very local to `arityType`'s defn for non-recurisve `Let`. Apart from that, we had to get rid of the second item of `Note [Dealing with bottoms]`, which was entirely a safety measure and hindered optimistic fixed-point iteration. Fixes #18793. The following metric increases are all caused by this commit and a result of the fact that we just do more work now: Metric Increase: T3294 T12545 T12707

|

| | | | | | | | | | | | | | | |

Summary: My understanding is that ghc 7.10 should be buildable with the last 3 versions of ghc, i.e 7.6, 7.8 and 7.10 itself. Test Plan: x Reviewers: austin Reviewed By: austin Subscribers: hvr, simonmar, ezyang, carter, thomie Differential Revision: https://phabricator.haskell.org/D254