| Commit message (Collapse) | Author | Age | Files | Lines |
| |
|
|
|
|
| |
submodule updates: nofib, haddock
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch removes all CafInfo predictions and various hacks to preserve
predicted CafInfos from the compiler and assigns final CafInfos to
interface Ids after code generation. SRT analysis is extended to support
static data, and Cmm generator is modified to allow generating
static_link fields after SRT analysis.
This also fixes `-fcatch-bottoms`, which introduces error calls in case
expressions in CorePrep, which runs *after* CoreTidy (which is where we
decide on CafInfos) and turns previously non-CAFFY things into CAFFY.
Fixes #17648
Fixes #9718
Evaluation
==========
NoFib
-----
Boot with: `make boot mode=fast`
Run: `make mode=fast EXTRA_RUNTEST_OPTS="-cachegrind" NoFibRuns=1`
--------------------------------------------------------------------------------
Program Size Allocs Instrs Reads Writes
--------------------------------------------------------------------------------
CS -0.0% 0.0% -0.0% -0.0% -0.0%
CSD -0.0% 0.0% -0.0% -0.0% -0.0%
FS -0.0% 0.0% -0.0% -0.0% -0.0%
S -0.0% 0.0% -0.0% -0.0% -0.0%
VS -0.0% 0.0% -0.0% -0.0% -0.0%
VSD -0.0% 0.0% -0.0% -0.0% -0.5%
VSM -0.0% 0.0% -0.0% -0.0% -0.0%
anna -0.1% 0.0% -0.0% -0.0% -0.0%
ansi -0.0% 0.0% -0.0% -0.0% -0.0%
atom -0.0% 0.0% -0.0% -0.0% -0.0%
awards -0.0% 0.0% -0.0% -0.0% -0.0%
banner -0.0% 0.0% -0.0% -0.0% -0.0%
bernouilli -0.0% 0.0% -0.0% -0.0% -0.0%
binary-trees -0.0% 0.0% -0.0% -0.0% -0.0%
boyer -0.0% 0.0% -0.0% -0.0% -0.0%
boyer2 -0.0% 0.0% -0.0% -0.0% -0.0%
bspt -0.0% 0.0% -0.0% -0.0% -0.0%
cacheprof -0.0% 0.0% -0.0% -0.0% -0.0%
calendar -0.0% 0.0% -0.0% -0.0% -0.0%
cichelli -0.0% 0.0% -0.0% -0.0% -0.0%
circsim -0.0% 0.0% -0.0% -0.0% -0.0%
clausify -0.0% 0.0% -0.0% -0.0% -0.0%
comp_lab_zift -0.0% 0.0% -0.0% -0.0% -0.0%
compress -0.0% 0.0% -0.0% -0.0% -0.0%
compress2 -0.0% 0.0% -0.0% -0.0% -0.0%
constraints -0.0% 0.0% -0.0% -0.0% -0.0%
cryptarithm1 -0.0% 0.0% -0.0% -0.0% -0.0%
cryptarithm2 -0.0% 0.0% -0.0% -0.0% -0.0%
cse -0.0% 0.0% -0.0% -0.0% -0.0%
digits-of-e1 -0.0% 0.0% -0.0% -0.0% -0.0%
digits-of-e2 -0.0% 0.0% -0.0% -0.0% -0.0%
dom-lt -0.0% 0.0% -0.0% -0.0% -0.0%
eliza -0.0% 0.0% -0.0% -0.0% -0.0%
event -0.0% 0.0% -0.0% -0.0% -0.0%
exact-reals -0.0% 0.0% -0.0% -0.0% -0.0%
exp3_8 -0.0% 0.0% -0.0% -0.0% -0.0%
expert -0.0% 0.0% -0.0% -0.0% -0.0%
fannkuch-redux -0.0% 0.0% -0.0% -0.0% -0.0%
fasta -0.0% 0.0% -0.0% -0.0% -0.0%
fem -0.0% 0.0% -0.0% -0.0% -0.0%
fft -0.0% 0.0% -0.0% -0.0% -0.0%
fft2 -0.0% 0.0% -0.0% -0.0% -0.0%
fibheaps -0.0% 0.0% -0.0% -0.0% -0.0%
fish -0.0% 0.0% -0.0% -0.0% -0.0%
fluid -0.1% 0.0% -0.0% -0.0% -0.0%
fulsom -0.0% 0.0% -0.0% -0.0% -0.0%
gamteb -0.0% 0.0% -0.0% -0.0% -0.0%
gcd -0.0% 0.0% -0.0% -0.0% -0.0%
gen_regexps -0.0% 0.0% -0.0% -0.0% -0.0%
genfft -0.0% 0.0% -0.0% -0.0% -0.0%
gg -0.0% 0.0% -0.0% -0.0% -0.0%
grep -0.0% 0.0% -0.0% -0.0% -0.0%
hidden -0.0% 0.0% -0.0% -0.0% -0.0%
hpg -0.1% 0.0% -0.0% -0.0% -0.0%
ida -0.0% 0.0% -0.0% -0.0% -0.0%
infer -0.0% 0.0% -0.0% -0.0% -0.0%
integer -0.0% 0.0% -0.0% -0.0% -0.0%
integrate -0.0% 0.0% -0.0% -0.0% -0.0%
k-nucleotide -0.0% 0.0% -0.0% -0.0% -0.0%
kahan -0.0% 0.0% -0.0% -0.0% -0.0%
knights -0.0% 0.0% -0.0% -0.0% -0.0%
lambda -0.0% 0.0% -0.0% -0.0% -0.0%
last-piece -0.0% 0.0% -0.0% -0.0% -0.0%
lcss -0.0% 0.0% -0.0% -0.0% -0.0%
life -0.0% 0.0% -0.0% -0.0% -0.0%
lift -0.0% 0.0% -0.0% -0.0% -0.0%
linear -0.1% 0.0% -0.0% -0.0% -0.0%
listcompr -0.0% 0.0% -0.0% -0.0% -0.0%
listcopy -0.0% 0.0% -0.0% -0.0% -0.0%
maillist -0.0% 0.0% -0.0% -0.0% -0.0%
mandel -0.0% 0.0% -0.0% -0.0% -0.0%
mandel2 -0.0% 0.0% -0.0% -0.0% -0.0%
mate -0.0% 0.0% -0.0% -0.0% -0.0%
minimax -0.0% 0.0% -0.0% -0.0% -0.0%
mkhprog -0.0% 0.0% -0.0% -0.0% -0.0%
multiplier -0.0% 0.0% -0.0% -0.0% -0.0%
n-body -0.0% 0.0% -0.0% -0.0% -0.0%
nucleic2 -0.0% 0.0% -0.0% -0.0% -0.0%
para -0.0% 0.0% -0.0% -0.0% -0.0%
paraffins -0.0% 0.0% -0.0% -0.0% -0.0%
parser -0.1% 0.0% -0.0% -0.0% -0.0%
parstof -0.1% 0.0% -0.0% -0.0% -0.0%
pic -0.0% 0.0% -0.0% -0.0% -0.0%
pidigits -0.0% 0.0% -0.0% -0.0% -0.0%
power -0.0% 0.0% -0.0% -0.0% -0.0%
pretty -0.0% 0.0% -0.3% -0.4% -0.4%
primes -0.0% 0.0% -0.0% -0.0% -0.0%
primetest -0.0% 0.0% -0.0% -0.0% -0.0%
prolog -0.0% 0.0% -0.0% -0.0% -0.0%
puzzle -0.0% 0.0% -0.0% -0.0% -0.0%
queens -0.0% 0.0% -0.0% -0.0% -0.0%
reptile -0.0% 0.0% -0.0% -0.0% -0.0%
reverse-complem -0.0% 0.0% -0.0% -0.0% -0.0%
rewrite -0.0% 0.0% -0.0% -0.0% -0.0%
rfib -0.0% 0.0% -0.0% -0.0% -0.0%
rsa -0.0% 0.0% -0.0% -0.0% -0.0%
scc -0.0% 0.0% -0.3% -0.5% -0.4%
sched -0.0% 0.0% -0.0% -0.0% -0.0%
scs -0.0% 0.0% -0.0% -0.0% -0.0%
simple -0.1% 0.0% -0.0% -0.0% -0.0%
solid -0.0% 0.0% -0.0% -0.0% -0.0%
sorting -0.0% 0.0% -0.0% -0.0% -0.0%
spectral-norm -0.0% 0.0% -0.0% -0.0% -0.0%
sphere -0.0% 0.0% -0.0% -0.0% -0.0%
symalg -0.0% 0.0% -0.0% -0.0% -0.0%
tak -0.0% 0.0% -0.0% -0.0% -0.0%
transform -0.0% 0.0% -0.0% -0.0% -0.0%
treejoin -0.0% 0.0% -0.0% -0.0% -0.0%
typecheck -0.0% 0.0% -0.0% -0.0% -0.0%
veritas -0.0% 0.0% -0.0% -0.0% -0.0%
wang -0.0% 0.0% -0.0% -0.0% -0.0%
wave4main -0.0% 0.0% -0.0% -0.0% -0.0%
wheel-sieve1 -0.0% 0.0% -0.0% -0.0% -0.0%
wheel-sieve2 -0.0% 0.0% -0.0% -0.0% -0.0%
x2n1 -0.0% 0.0% -0.0% -0.0% -0.0%
--------------------------------------------------------------------------------
Min -0.1% 0.0% -0.3% -0.5% -0.5%
Max -0.0% 0.0% -0.0% -0.0% -0.0%
Geometric Mean -0.0% -0.0% -0.0% -0.0% -0.0%
--------------------------------------------------------------------------------
Program Size Allocs Instrs Reads Writes
--------------------------------------------------------------------------------
circsim -0.1% 0.0% -0.0% -0.0% -0.0%
constraints -0.0% 0.0% -0.0% -0.0% -0.0%
fibheaps -0.0% 0.0% -0.0% -0.0% -0.0%
gc_bench -0.0% 0.0% -0.0% -0.0% -0.0%
hash -0.0% 0.0% -0.0% -0.0% -0.0%
lcss -0.0% 0.0% -0.0% -0.0% -0.0%
power -0.0% 0.0% -0.0% -0.0% -0.0%
spellcheck -0.0% 0.0% -0.0% -0.0% -0.0%
--------------------------------------------------------------------------------
Min -0.1% 0.0% -0.0% -0.0% -0.0%
Max -0.0% 0.0% -0.0% -0.0% -0.0%
Geometric Mean -0.0% +0.0% -0.0% -0.0% -0.0%
Manual inspection of programs in testsuite/tests/programs
---------------------------------------------------------
I built these programs with a bunch of dump flags and `-O` and compared
STG, Cmm, and Asm dumps and file sizes.
(Below the numbers in parenthesis show number of modules in the program)
These programs have identical compiler (same .hi and .o sizes, STG, and
Cmm and Asm dumps):
- Queens (1), andre_monad (1), cholewo-eval (2), cvh_unboxing (3),
andy_cherry (7), fun_insts (1), hs-boot (4), fast2haskell (2),
jl_defaults (1), jq_readsPrec (1), jules_xref (1), jtod_circint (4),
jules_xref2 (1), lennart_range (1), lex (1), life_space_leak (1),
bargon-mangler-bug (7), record_upd (1), rittri (1), sanders_array (1),
strict_anns (1), thurston-module-arith (2), okeefe_neural (1),
joao-circular (6), 10queens (1)
Programs with different compiler outputs:
- jl_defaults (1): For some reason GHC HEAD marks a lot of top-level
`[Int]` closures as CAFFY for no reason. With this patch we no longer
make them CAFFY and generate less SRT entries. For some reason Main.o
is slightly larger with this patch (1.3%) and the executable sizes are
the same. (I'd expect both to be smaller)
- launchbury (1): Same as jl_defaults: top-level `[Int]` closures marked
as CAFFY for no reason. Similarly `Main.o` is 1.4% larger but the
executable sizes are the same.
- galois_raytrace (13): Differences are in the Parse module. There are a
lot, but some of the changes are caused by the fact that for some
reason (I think a bug) GHC HEAD marks the dictionary for `Functor
Identity` as CAFFY. Parse.o is 0.4% larger, the executable size is the
same.
- north_array: We now generate less SRT entries because some of array
primops used in this program like `NewArrayOp` get eliminated during
Stg-to-Cmm and turn some CAFFY things into non-CAFFY. Main.o gets 24%
larger (9224 bytes from 9000 bytes), executable sizes are the same.
- seward-space-leak: Difference in this program is better shown by this
smaller example:
module Lib where
data CDS
= Case [CDS] [(Int, CDS)]
| Call CDS CDS
instance Eq CDS where
Case sels1 rets1 == Case sels2 rets2 =
sels1 == sels2 && rets1 == rets2
Call a1 b1 == Call a2 b2 =
a1 == a2 && b1 == b2
_ == _ =
False
In this program GHC HEAD builds a new SRT for the recursive group of
`(==)`, `(/=)` and the dictionary closure. Then `/=` points to `==`
in its SRT field, and `==` uses the SRT object as its SRT. With this
patch we use the closure for `/=` as the SRT and add `==` there. Then
`/=` gets an empty SRT field and `==` points to `/=` in its SRT
field.
This change looks fine to me.
Main.o gets 0.07% larger, executable sizes are identical.
head.hackage
------------
head.hackage's CI script builds 428 packages from Hackage using this
patch with no failures.
Compiler performance
--------------------
The compiler perf tests report that the compiler allocates slightly more
(worst case observed so far is 4%). However most programs in the test
suite are small, single file programs. To benchmark compiler performance
on something more realistic I build Cabal (the library, 236 modules)
with different optimisation levels. For the "max residency" row I run
GHC with `+RTS -s -A100k -i0 -h` for more accurate numbers. Other rows
are generated with just `-s`. (This is because `-i0` causes running GC
much more frequently and as a result "bytes copied" gets inflated by
more than 25x in some cases)
* -O0
| | GHC HEAD | This MR | Diff |
| --------------- | -------------- | -------------- | ------ |
| Bytes allocated | 54,413,350,872 | 54,701,099,464 | +0.52% |
| Bytes copied | 4,926,037,184 | 4,990,638,760 | +1.31% |
| Max residency | 421,225,624 | 424,324,264 | +0.73% |
* -O1
| | GHC HEAD | This MR | Diff |
| --------------- | --------------- | --------------- | ------ |
| Bytes allocated | 245,849,209,992 | 246,562,088,672 | +0.28% |
| Bytes copied | 26,943,452,560 | 27,089,972,296 | +0.54% |
| Max residency | 982,643,440 | 991,663,432 | +0.91% |
* -O2
| | GHC HEAD | This MR | Diff |
| --------------- | --------------- | --------------- | ------ |
| Bytes allocated | 291,044,511,408 | 291,863,910,912 | +0.28% |
| Bytes copied | 37,044,237,616 | 36,121,690,472 | -2.49% |
| Max residency | 1,071,600,328 | 1,086,396,256 | +1.38% |
Extra compiler allocations
--------------------------
Runtime allocations of programs are as reported above (NoFib section).
The compiler now allocates more than before. Main source of allocation
in this patch compared to base commit is the new SRT algorithm
(GHC.Cmm.Info.Build). Below is some of the extra work we do with this
patch, numbers generated by profiled stage 2 compiler when building a
pathological case (the test 'ManyConstructors') with '-O2':
- We now sort the final STG for a module, which means traversing the
entire program, generating free variable set for each top-level
binding, doing SCC analysis, and re-ordering the program. In
ManyConstructors this step allocates 97,889,952 bytes.
- We now do SRT analysis on static data, which in a program like
ManyConstructors causes analysing 10,000 bindings that we would
previously just skip. This step allocates 70,898,352 bytes.
- We now maintain an SRT map for the entire module as we compile Cmm
groups:
data ModuleSRTInfo = ModuleSRTInfo
{ ...
, moduleSRTMap :: SRTMap
}
(SRTMap is just a strict Map from the 'containers' library)
This map gets an entry for most bindings in a module (exceptions are
THUNKs and CAFFY static functions). For ManyConstructors this map
gets 50015 entries.
- Once we're done with code generation we generate a NameSet from SRTMap
for the non-CAFFY names in the current module. This set gets the same
number of entries as the SRTMap.
- Finally we update CafInfos in ModDetails for the non-CAFFY Ids, using
the NameSet generated in the previous step. This usually does the
least amount of allocation among the work listed here.
Only place with this patch where we do less work in the CAF analysis in
the tidying pass (CoreTidy). However that doesn't save us much, as the
pass still needs to traverse the whole program and update IdInfos for
other reasons. Only thing we don't here do is the `hasCafRefs` pass over
the RHS of bindings, which is a stateless pass that returns a boolean
value, so it doesn't allocate much.
(Metric changes blow are all increased allocations)
Metric changes
--------------
Metric Increase:
ManyAlternatives
ManyConstructors
T13035
T14683
T1969
T9961
|
|
|
|
|
|
| |
incomplete-uni-patterns and incomplete-record-updates will be in -Wall at a
future date, so prepare for that by disabling those warnings on files that
trigger them.
|
| |
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is part two of fixing #17334.
There are two parts to this commit:
- A bugfix for computing loop levels
- A bugfix of basic block invariants in the NCG.
-----------------------------------------------------------
In the first bug we ended up with a CFG of the sort: [A -> B -> C]
This was represented via maps as fromList [(A,B),(B,C)] and later
transformed into a adjacency array. However the transformation did
not include block C in the array (since we only looked at the keys of
the map).
This was still fine until we tried to look up successors for C and tried
to read outside of the array bounds when accessing C.
In order to prevent this in the future I refactored to code to include
all nodes as keys in the map representation. And make this a invariant
which is checked in a few places.
Overall I expect this to make the code more robust as now any failed
lookup will represent an error, versus failed lookups sometimes being
expected and sometimes not.
In terms of performance this makes some things cheaper (getting a list
of all nodes) and others more expensive (adding a new edge). Overall
this adds up to no noteable performance difference.
-----------------------------------------------------------
Part 2: When the NCG generated a new basic block, it did
not always insert a NEWBLOCK meta instruction in the stream which
caused a quite subtle bug.
During instruction selection a statement `s`
in a block B with control of the sort: B -> C
will sometimes result in control
flow of the sort:
┌ < ┐
v ^
B -> B1 ┴ -> C
as is the case for some atomic operations.
Now to keep the CFG in sync when introducing B1 we clearly
want to insert it between B and C. However there is
a catch when we have to deal with self loops.
We might start with code and a CFG of these forms:
loop:
stmt1 ┌ < ┐
.... v ^
stmtX loop ┘
stmtY
....
goto loop:
Now we introduce B1:
┌ ─ ─ ─ ─ ─┐
loop: │ ┌ < ┐ │
instrs v │ │ ^
.... loop ┴ B1 ┴ ┘
instrsFromX
stmtY
goto loop:
This is simple, all outgoing edges from loop now simply
start from B1 instead and the code generator knows which
new edges it introduced for the self loop of B1.
Disaster strikes if the statement Y follows the same pattern.
If we apply the same rule that all outgoing edges change then
we end up with:
loop ─> B1 ─> B2 ┬─┐
│ │ └─<┤ │
│ └───<───┘ │
└───────<────────┘
This is problematic. The edge B1->B1 is modified as expected.
However the modification is wrong!
The assembly in this case looked like this:
_loop:
<instrs>
_B1:
...
cmpxchgq ...
jne _B1
<instrs>
<end _B1>
_B2:
...
cmpxchgq ...
jne _B2
<instrs>
jmp loop
There is no edge _B2 -> _B1 here. It's still a self loop onto _B1.
The problem here is that really B1 should be two basic blocks.
Otherwise we have control flow in the *middle* of a basic block.
A contradiction!
So to account for this we add yet another basic block marker:
_B:
<instrs>
_B1:
...
cmpxchgq ...
jne _B1
jmp _B1'
_B1':
<instrs>
<end _B1>
_B2:
...
Now when inserting B2 we will only look at the outgoing edges of B1' and
everything will work out nicely.
You might also wonder why we don't insert jumps at the end of _B1'. There is
no way another block ends up jumping to the labels _B1 or _B2 since they are
essentially invisible to other blocks. View them as control flow labels local
to the basic block if you'd like.
Not doing this ultimately caused (part 2 of) #17334.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
For backends maintaining the CFG during codegen
we can now find loops and their nesting level.
This is based on the Cmm CFG and dominator analysis.
As a result we can estimate edge frequencies a lot better
for methods, resulting in far better code layout.
Speedup on nofib: ~1.5%
Increase in compile times: ~1.9%
To make this feasible this commit adds:
* Dominator analysis based on the Lengauer-Tarjan Algorithm.
* An algorithm estimating global edge frequences from branch
probabilities - In CFG.hs
A few static branch prediction heuristics:
* Expect to take the backedge in loops.
* Expect to take the branch NOT exiting a loop.
* Expect integer vs constant comparisons to be false.
We also treat heap/stack checks special for branch prediction
to avoid them being treated as loops.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Statements can change the basic block in which instructions
are placed during instruction selection.
We have to keep track of this switch of the current basic block
as we need this information in order to properly update the CFG.
This commit implements this change and fixes #17334.
We do so by having stmtToInstr return the new block id
if a statement changed the basic block.
|
|
|
|
|
|
|
|
| |
- Remove unneeded ones
- Use <..> for inter-package.
Besides general clean up, helps distinguish between the RTS we link
against vs the RTS we compile for.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Introduces a new flag `-fmax-pmcheck-deltas` to achieve that. Deprecates
the old `-fmax-pmcheck-iter` mechanism in favor of this new flag.
From the user's guide:
Pattern match checking can be exponential in some cases. This limit makes sure
we scale polynomially in the number of patterns, by forgetting refined
information gained from a partially successful match. For example, when
matching `x` against `Just 4`, we split each incoming matching model into two
sub-models: One where `x` is not `Nothing` and one where `x` is `Just y` but
`y` is not `4`. When the number of incoming models exceeds the limit, we
continue checking the next clause with the original, unrefined model.
This also retires the incredibly hard to understand "maximum number of
refinements" mechanism, because the current mechanism is more general
and should catch the same exponential cases like PrelRules at the same
time.
-------------------------
Metric Decrease:
T11822
-------------------------
|
| |
|
|
|
|
|
|
| |
Add StgToCmm module hierarchy. Platform modules that are used in several
other places (NCG, LLVM codegen, Cmm transformations) are put into
GHC.Platform.
|
|
|
|
|
|
|
|
|
|
|
| |
These kinds of imports are necessary in some cases such as
importing instances of typeclasses or intentionally creating
dependencies in the build system, but '-Wunused-imports' can't
detect when they are no longer needed. This commit removes the
unused ones currently in the code base (not including test files
or submodules), with the hope that doing so may increase
parallelism in the build system by removing unnecessary
dependencies.
|
|
|
|
|
|
|
| |
Unfortunately this will require more work; register allocation is
quite broken.
This reverts commit acd795583625401c5554f8e04ec7efca18814011.
|
|
|
|
|
|
|
| |
This adds support for constructing vector types from Float#, Double# etc
and performing arithmetic operations on them
Cleaned-Up-By: Ben Gamari <ben@well-typed.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Here the following changes are introduced:
- A read barrier machine op is added to Cmm.
- The order in which a closure's fields are read and written is changed.
- Memory barriers are added to RTS code to ensure correctness on
out-or-order machines with weak memory ordering.
Cmm has a new CallishMachOp called MO_ReadBarrier. On weak memory machines, this
is lowered to an instruction that ensures memory reads that occur after said
instruction in program order are not performed before reads coming before said
instruction in program order. On machines with strong memory ordering properties
(e.g. X86, SPARC in TSO mode) no such instruction is necessary, so
MO_ReadBarrier is simply erased. However, such an instruction is necessary on
weakly ordered machines, e.g. ARM and PowerPC.
Weam memory ordering has consequences for how closures are observed and mutated.
For example, consider a closure that needs to be updated to an indirection. In
order for the indirection to be safe for concurrent observers to enter, said
observers must read the indirection's info table before they read the
indirectee. Furthermore, the entering observer makes assumptions about the
closure based on its info table contents, e.g. an INFO_TYPE of IND imples the
closure has an indirectee pointer that is safe to follow.
When a closure is updated with an indirection, both its info table and its
indirectee must be written. With weak memory ordering, these two writes can be
arbitrarily reordered, and perhaps even interleaved with other threads' reads
and writes (in the absence of memory barrier instructions). Consider this
example of a bad reordering:
- An updater writes to a closure's info table (INFO_TYPE is now IND).
- A concurrent observer branches upon reading the closure's INFO_TYPE as IND.
- A concurrent observer reads the closure's indirectee and enters it. (!!!)
- An updater writes the closure's indirectee.
Here the update to the indirectee comes too late and the concurrent observer has
jumped off into the abyss. Speculative execution can also cause us issues,
consider:
- An observer is about to case on a value in closure's info table.
- The observer speculatively reads one or more of closure's fields.
- An updater writes to closure's info table.
- The observer takes a branch based on the new info table value, but with the
old closure fields!
- The updater writes to the closure's other fields, but its too late.
Because of these effects, reads and writes to a closure's info table must be
ordered carefully with respect to reads and writes to the closure's other
fields, and memory barriers must be placed to ensure that reads and writes occur
in program order. Specifically, updates to a closure must follow the following
pattern:
- Update the closure's (non-info table) fields.
- Write barrier.
- Update the closure's info table.
Observing a closure's fields must follow the following pattern:
- Read the closure's info pointer.
- Read barrier.
- Read the closure's (non-info table) fields.
This patch updates RTS code to obey this pattern. This should fix long-standing
SMP bugs on ARM (specifically newer aarch64 microarchitectures supporting
out-of-order execution) and PowerPC. This fixes issue #15449.
Co-Authored-By: Ben Gamari <ben@well-typed.com>
|
|
|
|
|
|
|
| |
ghc-pkg needs to be aware of platforms so it can figure out which
subdire within the user package db to use. This is admittedly
roundabout, but maybe Cabal could use the same notion of a platform as
GHC to good affect too.
|
|
|
|
|
| |
Previously log and exp were primitives yet log1p and expm1 were FFI
calls. Fix this non-uniformity.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* simplifies registers to have GPR, Float and Double, by removing the SSE2 and X87 Constructors
* makes -msse2 assumed/default for x86 platforms, fixing a long standing nondeterminism in rounding
behavior in 32bit haskell code
* removes the 80bit floating point representation from the supported float sizes
* theres still 1 tiny bit of x87 support needed,
for handling float and double return values in FFI calls wrt the C ABI on x86_32,
but this one piece does not leak into the rest of NCG.
* Lots of code thats not been touched in a long time got deleted as a
consequence of all of this
all in all, this change paves the way towards a lot of future further
improvements in how GHC handles floating point computations, along with
making the native code gen more accessible to a larger pool of contributors.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Fixes #16052
When the offset in `setByteArray#` is statically known, we can provide
better alignment guarantees then just 1 byte.
Also, memset can now do 64-bit wide sets.
The current memset intrinsic is not optimal however and can be
improved for the case when we know that we deal with
(baseAddress at known alignment) + offset
For instance, on 64-bit
`setByteArray# s 1# 23# 0#`
given that bytearray is 8 bytes aligned could be unrolled into
`movb, movw, movl, movq, movq`; but currently it is
`movb x23` since alignment of 1 is all we can embed into MO_Memset op.
|
|
|
|
|
|
| |
This commit includes the necessary changes in code and
documentation to support a primop that reverses a word's
bits. It also includes a test.
|
| |
|
|
|
|
|
|
|
| |
under -mbmi2
This works similarly to existing implementation for popCount.
Trac ticket: #16086.
|
|
|
|
| |
This reverts commit 76c8fd674435a652c75a96c85abbf26f1f221876.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary:
This patch implements a new code layout algorithm.
It has been tested for x86 and is disabled on other platforms.
Performance varies slightly be CPU/Machine but in general seems to be better
by around 2%.
Nofib shows only small differences of about +/- ~0.5% overall depending on
flags/machine performance in other benchmarks improved significantly.
Other benchmarks includes at least the benchmarks of: aeson, vector, megaparsec, attoparsec,
containers, text and xeno.
While the magnitude of gains differed three different CPUs where tested with
all getting faster although to differing degrees. I tested: Sandy Bridge(Xeon), Haswell,
Skylake
* Library benchmark results summarized:
* containers: ~1.5% faster
* aeson: ~2% faster
* megaparsec: ~2-5% faster
* xml library benchmarks: 0.2%-1.1% faster
* vector-benchmarks: 1-4% faster
* text: 5.5% faster
On average GHC compile times go down, as GHC compiled with the new layout
is faster than the overhead introduced by using the new layout algorithm,
Things this patch does:
* Move code responsilbe for block layout in it's own module.
* Move the NcgImpl Class into the NCGMonad module.
* Extract a control flow graph from the input cmm.
* Update this cfg to keep it in sync with changes during
asm codegen. This has been tested on x64 but should work on x86.
Other platforms still use the old codelayout.
* Assign weights to the edges in the CFG based on type and limited static
analysis which are then used for block layout.
* Once we have the final code layout eliminate some redundant jumps.
In particular turn a sequences of:
jne .foo
jmp .bar
foo:
into
je bar
foo:
..
Test Plan: ci
Reviewers: bgamari, jmct, jrtc27, simonmar, simonpj, RyanGlScott
Reviewed By: RyanGlScott
Subscribers: RyanGlScott, trommler, jmct, carter, thomie, rwbarton
GHC Trac Issues: #15124
Differential Revision: https://phabricator.haskell.org/D4726
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is the first step of implementing:
https://github.com/ghc-proposals/ghc-proposals/pull/74
The main highlights/changes:
primops.txt.pp gets two new sections for two new primitive types for
signed and unsigned 8-bit integers (Int8# and Word8 respectively) along
with basic arithmetic and comparison operations. PrimRep/RuntimeRep get
two new constructors for them. All of the primops translate into the
existing MachOPs.
For CmmCalls the codegen will now zero-extend the values at call
site (so that they can be moved to the right register) and then truncate
them back their original width.
x86 native codegen needed some updates, since it wasn't able to deal
with the new widths, but all the changes are quite localized. LLVM
backend seems to just work.
This is the second attempt at merging this, after the first attempt in
D4475 had to be backed out due to regressions on i386.
Bumps binary submodule.
Signed-off-by: Michal Terepeta <michal.terepeta@gmail.com>
Test Plan: ./validate (on both x86-{32,64})
Reviewers: bgamari, hvr, goldfire, simonmar
Subscribers: rwbarton, carter
Differential Revision: https://phabricator.haskell.org/D5258
|
|
|
|
|
|
|
|
|
| |
This unfortunately broke i386 support since it introduced references to
byte-sized registers that don't exist on that architecture.
Reverts binary submodule
This reverts commit 5d5307f943d7581d7013ffe20af22233273fba06.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is the first step of implementing:
https://github.com/ghc-proposals/ghc-proposals/pull/74
The main highlights/changes:
- `primops.txt.pp` gets two new sections for two new primitive types
for signed and unsigned 8-bit integers (`Int8#` and `Word8`
respectively) along with basic arithmetic and comparison
operations. `PrimRep`/`RuntimeRep` get two new constructors for
them. All of the primops translate into the existing `MachOP`s.
- For `CmmCall`s the codegen will now zero-extend the values at call
site (so that they can be moved to the right register) and then
truncate them back their original width.
- x86 native codegen needed some updates, since it wasn't able to deal
with the new widths, but all the changes are quite localized. LLVM
backend seems to just work.
Bumps binary submodule.
Signed-off-by: Michal Terepeta <michal.terepeta@gmail.com>
Test Plan: ./validate with new tests
Reviewers: hvr, goldfire, bgamari, simonmar
Subscribers: Abhiroop, dfeuer, rwbarton, thomie, carter
Differential Revision: https://phabricator.haskell.org/D4475
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary:
Optimisation: we don't have to test the parity flag if we
know the test has already excluded the unordered case: eg >
and >= test for a zero carry flag, which can only occur for
ordered operands.
By reversing comparisons we can avoid testing the parity
for < and <= as well. This works since:
* If any of the arguments is an NaN CF gets set. Resulting in a false result.
* Since this allows us to rule out NaN we can exchange the arguments and invert the
direction of the arrows.
Test Plan: ci/nofib
Reviewers: carter, bgamari, alpmestan
Reviewed By: alpmestan
Subscribers: alpmestan, simonpj, jmct, rwbarton, thomie
GHC Trac Issues: #15196
Differential Revision: https://phabricator.haskell.org/D4990
|
|
|
|
|
|
|
|
|
|
| |
Reviewers: hvr, bgamari, simonmar, jrtc27
Reviewed By: bgamari
Subscribers: alpmestan, rwbarton, thomie, carter
Differential Revision: https://phabricator.haskell.org/D5034
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Jump tables always point to blocks when we first generate them. However
there are rare situations where we can shortcut one of these blocks to a
static address during the asm shortcutting pass.
While we already updated the data section accordingly this patch also
extends this to the references stored in JMP_TBL.
Test Plan: ci
Reviewers: bgamari
Reviewed By: bgamari
Subscribers: thomie, carter
GHC Trac Issues: #15104
Differential Revision: https://phabricator.haskell.org/D4595
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary:
This change makes it possible to generate a static 32-bit relative label
offset on x86_64. Currently we can only generate word-sized label
offsets.
This will be used in D4634 to shrink info tables. See D4632 for more
details.
Test Plan: See D4632
Reviewers: bgamari, niteria, michalt, erikd, jrtc27, osa1
Subscribers: thomie, carter
Differential Revision: https://phabricator.haskell.org/D4633
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is mostly for congruence with 'subWordC#' and '{add,sub}IntC#'.
I found 'plusWord2#' while implementing this, which both lacks
documentation and has a slightly different specification than
'addWordC#', which means the generic implementation is unnecessarily
complex.
While I was at it, I also added lacking meta-information on PrimOps
and refactored 'subWordC#'s generic implementation to be branchless.
Reviewers: bgamari, simonmar, jrtc27, dfeuer
Reviewed By: bgamari, dfeuer
Subscribers: dfeuer, thomie, carter
Differential Revision: https://phabricator.haskell.org/D4592
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is required by D4288. However, this only handles i386; we will
likely also need to do the same for PPC and SPARC, lest they break when
D4288 is re-merged.
Test Plan: Validate
Reviewers: simonmar
Reviewed By: simonmar
Subscribers: rwbarton, thomie, carter
Differential Revision: https://phabricator.haskell.org/D4362
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
It's better to fall through to the likely case than to jump to it.
We optimize for this in CmmContFlowOpt when likely:False.
This commit extends the logic there to handle cases with likely:True
as well.
Test Plan: ci
Reviewers: bgamari, simonmar
Reviewed By: bgamari
Subscribers: simonmar, alexbiehl, rwbarton, thomie, carter
Differential Revision: https://phabricator.haskell.org/D4306
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This adds support for the bit deposit and extraction operations provided
by the BMI and BMI2 instruction set extensions on modern amd64 machines.
Implement x86 code generator for pdep and pext. Properly initialise
bmiVersion field.
pdep and pext test cases
Fix pattern match for pdep and pext instructions
Fix build of pdep and pext code for 32-bit architectures
Test Plan: Validate
Reviewers: austin, simonmar, bgamari, angerman
Reviewed By: bgamari
Subscribers: trommler, carter, angerman, thomie, rwbarton, newhoggy
GHC Trac Issues: #14206
Differential Revision: https://phabricator.haskell.org/D4236
|
|
|
|
|
|
| |
This broke the 32-bit build.
This reverts commit f5dc8ccc29429d0a1d011f62b6b430f6ae50290c.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This adds support for the bit deposit and extraction operations provided
by the BMI and BMI2 instruction set extensions on modern amd64 machines.
Test Plan: Validate
Reviewers: austin, simonmar, bgamari, hvr, goldfire, erikd
Reviewed By: bgamari
Subscribers: goldfire, erikd, trommler, newhoggy, rwbarton, thomie
GHC Trac Issues: #14206
Differential Revision: https://phabricator.haskell.org/D4063
|
|
|
|
|
|
|
|
|
|
|
|
| |
Depends on D4090
Reviewers: austin, bgamari, erikd, simonmar, alexbiehl
Reviewed By: bgamari
Subscribers: rwbarton, thomie
Differential Revision: https://phabricator.haskell.org/D4091
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Here we add a flag to instruct the native code generator to add
alignment checks in all info table dereferences. This is helpful in
catching pointer tagging issues.
Thanks to @jrtc27 for uncovering the tagging issues on Sparc which
inspired this flag.
Test Plan: Validate
Reviewers: simonmar, austin, erikd
Reviewed By: simonmar
Subscribers: rwbarton, trofi, thomie, jrtc27
Differential Revision: https://phabricator.haskell.org/D4101
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This switches the compiler/ component to get compiled with
-XNoImplicitPrelude and a `import GhcPrelude` is inserted in all
modules.
This is motivated by the upcoming "Prelude" re-export of
`Semigroup((<>))` which would cause lots of name clashes in every
modulewhich imports also `Outputable`
Reviewers: austin, goldfire, bgamari, alanz, simonmar
Reviewed By: bgamari
Subscribers: goldfire, rwbarton, thomie, mpickering, bgamari
Differential Revision: https://phabricator.haskell.org/D3989
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This fixes #14221, where the NCG and the DWARF code were apparently
giving two different names to the same block.
Test Plan: Validate with DWARF support enabled.
Reviewers: simonmar, austin
Subscribers: rwbarton, thomie
GHC Trac Issues: #14221
Differential Revision: https://phabricator.haskell.org/D3977
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Previously due to #12759 we disabled PIE support entirely. However, this
breaks the user's ability to produce PIEs. Add an explicit flag, -fPIE,
allowing the user to build PIEs.
Test Plan: Validate
Reviewers: rwbarton, austin, simonmar
Subscribers: trommler, simonmar, trofi, jrtc27, thomie
GHC Trac Issues: #12759, #13702
Differential Revision: https://phabricator.haskell.org/D3589
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
GHC 8.2.1 is out, so now GHC's support window only extends back to GHC
8.0. This means we can delete gobs of code that was only used for GHC
7.10 support. Hooray!
Test Plan: ./validate
Reviewers: hvr, bgamari, austin, goldfire, simonmar
Reviewed By: bgamari
Subscribers: Phyx, rwbarton, thomie
Differential Revision: https://phabricator.haskell.org/D3781
|