summaryrefslogtreecommitdiff
path: root/docs/comm/the-beast/ncg.html
blob: 5810a352126232bfe592597efad525e5a261f55e (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">
<html>
  <head>
    <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=ISO-8859-1">
    <title>The GHC Commentary - The Native Code Generator</title>
  </head>

  <body BGCOLOR="FFFFFF">
    <h1>The GHC Commentary - The Native Code Generator</h1>
    <p>
      On some platforms (currently x86 and PowerPC, with bitrotted
      support for Sparc and Alpha), GHC can generate assembly code
      directly, without having to go via C.  This can sometimes almost
      halve compilation time, and avoids the fragility and
      horribleness of the <a href="mangler.html">mangler</a>.  The NCG
      is enabled by default for
      non-optimising compilation on supported platforms.  For most programs
      it generates code which runs only 1-3% slower
      (depending on platform and type of code) than that
      created by gcc on x86s, so it is well worth using even with
      optimised compilation.  FP-intensive x86 programs see a bigger
      slowdown, and all Sparc code runs about 5% slower due to
      us not filling branch delay slots.
    <p>
      The NCG has always been something of a second-class citizen
      inside GHC, an unloved child, rather.  This means that its
      integration into the compiler as a whole is rather clumsy, which
      brings some problems described below.  That apart, the NCG
      proper is fairly cleanly designed, as target-independent as it
      reasonably can be, and so should not be difficult to retarget.
    <p>
      <b>NOTE!</b> The native code generator was largely rewritten as part
      of the C-- backend changes, around May 2004.  Unfortunately the
      rest of this document still refers to the old version, and was written
      with relation to the CVS head as of end-Jan 2002.  Some of it is relevant,
      some of it isn't.

    <h2>Overview</h2>
      The top-level code generator fn is
    <p>
    <code>absCtoNat :: AbstractC -> UniqSM (SDoc, Pretty.Doc)</code>
    <p>
    The returned <code>SDoc</code> is for debugging, so is empty unless
    you specify <code>-ddump-stix</code>.  The <code>Pretty.Doc</code>
    bit is the final assembly code.  Translation involves three main
    phases, the first and third of which are target-independent.
    <ul>
    <li><b>Translation into the <code>Stix</code> representation.</b>  Stix
        is a simple tree-like RTL-style language, in which you can
        mention:
        <p>
        <ul>
        <li>An infinite number of temporary, virtual registers.
        <li>The STG "magic" registers (<code>MagicId</code>), such as
            the heap and stack pointers.
        <li>Literals and low-level machine ops (<code>MachOp</code>).
        <li>Simple address computations.
        <li>Reads and writes of: memory, virtual regs, and various STG
            regs. 
        <li>Labels and <code>if ... goto ...</code> style control-flow.
        </ul>
        <p>
        Stix has two main associated types:
        <p>
        <ul>
        <li><code>StixStmt</code> -- trees executed for their side
        effects: assignments, control transfers, and auxiliary junk
        such as segment changes and literal data.
        <li><code>StixExpr</code> -- trees which denote a value.
        </ul>
        <p>
        Translation into Stix is almost completely
        target-independent.  Needed dependencies are knowledge of
        word size and endianness, used when generating code to do 
        deal with half-word fields in info tables.  This could be
        abstracted out easily enough.  Also, the Stix translation
        needs to know which <code>MagicId</code>s map to registers
        on the given target, and which are stored in offsets from
        <code>BaseReg</code>.
        <p>
        After initial Stix generation, the trees are cleaned up with
        constant-folding and a little copy-propagation ("Stix
        inlining", as the code misleadingly calls it).  We take
        the opportunity to translate <code>MagicId</code>s which are
        stored in memory on the given target, into suitable memory
        references.  Those which are stored in registers are left
        alone.  There is also a half-hearted attempt to lift literal
        strings to the top level in cases where nested strings have
        been observed to give incorrect code in the past.
        <p>
        Primitive machine-level operations will already be phrased in
        terms of <code>MachOp</code>s in the presented Abstract C, and
        these are passed through unchanged.  We comment only that the
        <code>MachOp</code>s have been chosen so as to be easy to
        implement on all targets, and their meaning is intended to be
        unambiguous, and the same on all targets, regardless of word
        size or endianness.
        <p>
        <b>A note on <code>MagicId</code>s.</b>
        Those which are assigned to
        registers on the current target are left unmodified.  Those
        which are not are stored in memory as offsets from
        <code>BaseReg</code> (which is assumed to permanently have the
        value <code>(&MainCapability.r)</code>), so the constant folder
        calculates the offsets and inserts suitable loads/stores.  One
        complication is that not all archs have <code>BaseReg</code>
        itself in a register, so for those (sparc), we instead
        generate the address as an offset from the static symbol
        <code>MainCapability</code>, since the register table lives in
        there.
        <p>
        Finally, <code>BaseReg</code> does occasionally itself get
        mentioned in Stix expression trees, and in this case what is
        denoted is precisely <code>(&MainCapability.r)</code>, not, as
        in all other cases, the value of memory at some offset from
        the start of the register table.  Since what it denotes is an
        r-value and not an l-value, assigning <code>BaseReg</code> is
        meaningless, so the machinery checks to ensure this never
        happens.  All these details are taken into account by the
        constant folder.
    <p>
    <li><b>Instruction selection.</b>  This is the only majorly
        target-specific phase.  It turns Stix statements and
        expressions into sequences of <code>Instr</code>, a data
        type which is different for each architecture.
        <code>Instr</code>, unsurprisingly, has various supporting
        types, such as <code>Reg</code>, <code>Operand</code>,
        <code>Imm</code>, etc.  The generated instructions may refer
        to specific machine registers, or to arbitrary virtual
        registers, either those created within the instruction
        selector, or those mentioned in the Stix passed to it.
        <p>
        The instruction selectors live in <code>MachCode.lhs</code>.
        The core functions, for each target, are:
        <p>
        <code>
        getAmode :: StixExpr -> NatM Amode
        <br>getRegister :: StixExpr -> NatM Register
        <br>assignMem_IntCode :: PrimRep -> StixExpr -> StixExpr -> NatM InstrBlock
        <br>assignReg_IntCode :: PrimRep -> StixReg  -> StixExpr -> NatM InstrBlock
        </code>
        <p>
        The insn selectors use the "maximal munch" algorithm.  The
        bizarrely-misnamed <code>getRegister</code> translates
        expressions.  A simplified version of its type is:
        <p>
        <code>getRegister :: StixExpr -> NatM (OrdList Instr, Reg)</code>
        <p>
        That is: it (monadically) turns a <code>StixExpr</code> into a
        sequence of instructions, and a register, with the meaning
        that after executing the (possibly empty) sequence of
        instructions, the (possibly virtual) register will
        hold the resulting value.  The real situation is complicated
        by the presence of fixed registers, and is detailed below.
        <p>
        Maximal munch is a greedy algorithm and is known not to give
        globally optimal code sequences, but it is good enough, and
        fast and simple.  Early incarnations of the NCG used something
        more sophisticated, but that is long gone now.
        <p>
        Similarly, <code>getAmode</code> translates a value, intended
        to denote an address, into a sequence of insns leading up to
        a (processor-specific) addressing mode.  This stuff could be
        done using the general <code>getRegister</code> selector, but
        would necessarily generate poorer code, because the calculated
        address would be forced into a register, which might be
        unnecessary if it could partially or wholly be calculated
        using an addressing mode.
        <p>
        Finally, <code>assignMem_IntCode</code> and
        <code>assignReg_IntCode</code> create instruction sequences to
        calculate a value and store it in the given register, or at
        the given address.  Because these guys translate a statement,
        not a value, they just return a sequence of insns and no
        associated register.  Floating-point and 64-bit integer
        assignments have analogous selectors.
        <p>
        Apart from the complexities of fixed vs floating registers,
        discussed below, the instruction selector is as simple
        as it can be.  It looks long and scary but detailed
        examination reveals it to be fairly straightforward.
    <p>
    <li><b>Register allocation.</b> The register allocator,
        <code>AsmRegAlloc.lhs</code> takes sequences of
        <code>Instr</code>s which mention a mixture of real and
        virtual registers, and returns a modified sequence referring
        only to real ones.  It is gloriously and entirely
        target-independent.  Well, not exactly true.  Instead it
        regards <code>Instr</code> (instructions) and <code>Reg</code>
        (virtual and real registers) as abstract types, to which it has
        the following interface:
        <p>
        <code>
        insnFuture :: Instr -> InsnFuture
        <br>regUsage :: Instr -> RegUsage
        <br>patchRegs :: Instr -> (Reg -> Reg) -> Instr
        </code>
        <p>
        <code>insnFuture</code> is used to (re)construct the graph of
        all possible control transfers between the insns to be
        allocated.  <code>regUsage</code> returns the sets of registers
        read and written by an instruction.  And
        <code>patchRegs</code> is used to apply the allocator's final
        decision on virtual-to-real reg mapping to an instruction.
        <p>
        Clearly these 3 fns have to be written anew for each
        architecture.  They are defined in
        <code>RegAllocInfo.lhs</code>.  Think twice, no, thrice,
        before modifying them: making false claims about insn
        behaviour will lead to hard-to-find register allocation
        errors.
        <p>
        <code>AsmRegAlloc.lhs</code> contains detailed comments about
	how the allocator works.  Here is a summary.  The head honcho
        <p>
        <code>allocUsingTheseRegs :: [Instr] -> [Reg] -> (Bool, [Instr])</code>
        <p>
        takes a list of instructions and a list of real registers
        available for allocation, and maps as many of the virtual regs
        in the input into real ones as it can.  The returned
        <code>Bool</code> indicates whether or not it was
        successful.  If so, that's the end of it.  If not, the caller
        of <code>allocUsingTheseRegs</code> will attempt spilling.
        More of that later.  What <code>allocUsingTheseRegs</code>
        does is:
        <p>
        <ul>
        <li>Implicitly number each instruction by its position in the
            input list.
        <p>
        <li>Using <code>insnFuture</code>, create the set of all flow
            edges -- possible control transfers -- within this set of
            insns.
        <p>
        <li>Using <code>regUsage</code> and iterating around the flow
            graph from the previous step, calculate, for each virtual
            register, the set of flow edges on which it is live.
        <p>
        <li>Make a real-register committment map, which gives the set
            of edges for which each real register is committed (in
            use).  These sets are initially empty.  For each virtual
            register, attempt to find a real register whose current
            committment does not intersect that of the virtual
            register -- ie, is uncommitted on all edges that the
            virtual reg is live.  If successful, this means the vreg
            can be assigned to the realreg, so add the vreg's set to
            the realreg's committment.  
        <p>
        <li>If all the vregs were assigned to a realreg, use
            <code>patchInstr</code> to apply the mapping to the insns themselves.
        </ul>
        <p>
        <b>Spilling</b>
        <p>
        If <code>allocUsingTheseRegs</code> fails, a baroque
        mechanism comes into play.  We now know that much simpler
        schemes are available to do the same thing and give better
        results.  
        Anyways:
        <p>
        The logic above <code>allocUsingTheseRegs</code>, in
        <code>doGeneralAlloc</code> and <code>runRegAllocate</code>,
        observe that allocation has failed with some set R of real
        registers.  So they apply <code>runRegAllocate</code> a second
        time to the code, but remove (typically) two registers from R
        before doing so.  This naturally fails too, but returns a
        partially-allocated sequence.  <code>doGeneralAlloc</code>
        then inserts spill code into the sequence, and finally re-runs
        <code>allocUsingTheseRegs</code>, but supplying the original,
        unadulterated R.  This is guaranteed to succeed since the two
        registers previously removed from R are sufficient to allocate
        all the spill/restore instructions added.
        <p>
        Because x86 is very short of registers, and in the worst case
        needs three removed from R, a softly-softly approach is used.
        <code>doGeneralAlloc</code> first tries with zero regs removed
        from R, then if that fails one, then two, etc.  This means
        <code>allocUsingTheseRegs</code> may get run several times
        before a successful arrangement is arrived at.
        <code>findReservedRegs</code> cooks up the sets of spill
        registers to try with.
        <p>
        The resulting machinery is complicated and the generated spill
        code is appalling.  The saving grace is that spills are very
        rare so it doesn't matter much.  I did not invent this -- I inherited it.
        <p>
        <b>Dealing with common cases fast</b>
        <p>
        The entire reg-alloc mechanism described so far is general and
        correct, but expensive overkill for many simple code blocks.
        So to begin with we use
        <code>doSimpleAlloc</code>, which attempts to do something
        simple.  It exploits the observation that if the total number
        of virtual registers does not exceed the number of real ones
        available, we can simply dole out a new realreg each time we
        see mention of a new vreg, with no regard for control flow.
        <code>doSimpleAlloc</code> therefore attempts this in a
        single pass over the code.  It gives up if it runs out of real
        regs or sees any condition which renders the above observation
        invalid (fixed reg uses, for example).  
        <p>
        This clever hack handles the majority of code blocks quickly.
        It was copied from the previous reg-allocator (the
        Mattson/Partain/Marlow/Gill one).
    </ul>

<p>
<h2>Complications, observations, and possible improvements</h2>

<h3>Real vs virtual registers in the instruction selectors</h3>

The instruction selectors for expression trees, namely
<code>getRegister</code>, are complicated by the fact that some 
expressions can only be computed into a specific register, whereas
the majority can be computed into any register.  We take x86 as an
example, but the problem applies to all archs.
<p>
Terminology: <em>rreg</em> means real register, a real machine
register.  <em>vreg</em> means one of an infinite set of virtual
registers.  The type <code>Reg</code> is the sum of <em>rreg</em> and
<em>vreg</em>.  The instruction selector generates sequences with
unconstrained use of vregs, leaving the register allocator to map them
all into rregs.
<p>
Now, where was I ?  Oh yes.  We return to the type of
<code>getRegister</code>, which despite its name, selects instructions
to compute the value of an expression tree.  
<pre>
   getRegister :: StixExpr -> NatM Register

   data Register
     = Fixed   PrimRep Reg InstrBlock
     | Any     PrimRep (Reg -> InstrBlock)

   type InstrBlock -- sequence of instructions
</pre>
At first this looks eminently reasonable (apart from the stupid
name).  <code>getRegister</code>, and nobody else, knows whether or
not a given expression has to be computed into a fixed rreg or can be
computed into any rreg or vreg.  In the first case, it returns
<code>Fixed</code> and indicates which rreg the result is in.  In the
second case it defers committing to any specific target register by
returning a function from <code>Reg</code> to <code>InstrBlock</code>,
and the caller can specify the target reg as it sees fit.
<p>
Unfortunately, that forces <code>getRegister</code>'s callers (usually
itself) to use a clumsy and confusing idiom in the common case where
they do not care what register the result winds up in.  The reason is
that although a value might be computed into a fixed rreg, we are
forbidden (on pain of segmentation fault :) from subsequently
modifying the fixed reg.  This and other rules are record in "Rules of
the game" inside <code>MachCode.lhs</code>.
<p>
Why can't fixed registers be modified post-hoc?  Consider a simple
expression like <code>Hp+1</code>.  Since the heap pointer
<code>Hp</code> is definitely in a fixed register, call it R,
<code>getRegister</code> on subterm <code>Hp</code> will simply return
<code>Fixed</code> with an empty sequence and R.  But we can't just
emit an increment instruction for R, because that trashes
<code>Hp</code>; instead we first have to copy it into a fresh vreg
and increment that.
<p>
With all that in mind, consider now writing a <code>getRegister</code>
clause for terms of the form <code>(1 + E)</code>.  Contrived, yes,
but illustrates the matter.  First we do
<code>getRegister</code> on E.  Now we are forced to examine what 
comes back.
<pre>
   getRegister (OnePlus e)
      = getRegister e           `thenNat`   \ e_result ->
        case e_result of
           Fixed e_code e_fixed 
              -> returnNat (Any IntRep (\dst -> e_code ++ [MOV e_fixed dst, INC dst]))
           Any e_any 
              -> Any (\dst -> e_any dst ++ [INC dst])
</pre>
This seems unreasonably cumbersome, yet the instruction selector is
full of such idioms.  A good example of the complexities induced by
this scheme is shown by <code>trivialCode</code> for x86 in
<code>MachCode.lhs</code>.  This deals with general integer dyadic
operations on x86 and has numerous cases.  It was difficult to get
right.
<p>
An alternative suggestion is to simplify the type of
<code>getRegister</code> to this:
<pre>
   getRegister :: StixExpr -> NatM (InstrBloc, VReg)
   type VReg = .... a vreg ...
</pre>
and then we could safely write
<pre>
   getRegister (OnePlus e)
      = getRegister e        `thenNat`  \ (e_code, e_vreg) ->
        returnNat (e_code ++ [INC e_vreg], e_vreg)
</pre>
which is about as straightforward as you could hope for.
Unfortunately, it requires <code>getRegister</code> to insert moves of
values which naturally compute into an rreg, into a vreg.  Consider:
<pre>
   1 + ccall some-C-fn
</pre>
On x86 the ccall result is returned in rreg <code>%eax</code>.  The
resulting sequence, prior to register allocation, would be:
<pre>
   # push args
   call some-C-fn
   # move %esp to nuke args
   movl   %eax, %vreg
   incl   %vreg
</pre>
If, as is likely, <code>%eax</code> is not held live beyond this point
for any other purpose, the move into a fresh register is pointless;
we'd have been better off leaving the value in <code>%eax</code> as
long as possible.
<p>
The simplified <code>getRegister</code> story is attractive.  It would
clean up the instruction selectors significantly and make it simpler
to write new ones.  The only drawback is that it generates redundant
register moves.  I suggest that eliminating these should be the job
of the register allocator.  Indeed:
<ul>
<li>There has been some work on this already ("Iterated register
    coalescing" ?), so this isn't a new idea.
<p>
<li>You could argue that the existing scheme inappropriately blurs the
    boundary between the instruction selector and the register
    allocator.  The instruction selector should .. well .. just
    select instructions, without having to futz around worrying about
    what kind of registers subtrees get generated into.  Register
    allocation should be <em>entirely</em> the domain of the register
    allocator, with the proviso that it should endeavour to allocate
    registers so as to minimise the number of non-redundant reg-reg
    moves in the final output.
</ul>


<h3>Selecting insns for 64-bit values/loads/stores on 32-bit platforms</h3>

Note that this stuff doesn't apply on 64-bit archs, since the
<code>getRegister</code> mechanism applies there.

The relevant functions are:
<pre>
   assignMem_I64Code :: StixExpr -> StixExpr -> NatM InstrBlock
   assignReg_I64Code :: StixReg  -> StixExpr -> NatM InstrBlock
   iselExpr64        :: StixExpr -> NatM ChildCode64

   data ChildCode64     -- a.k.a "Register64"
      = ChildCode64 
           InstrBlock   -- code
           VRegUnique   -- unique for the lower 32-bit temporary
</pre>
<code>iselExpr64</code> is the 64-bit, plausibly-named analogue of
<code>getRegister</code>, and <code>ChildCode64</code> is the analogue
of <code>Register</code>.  The aim here was to generate working 64
bit code as simply as possible.  To this end, I used the 
simplified <code>getRegister</code> scheme described above, in which
<code>iselExpr64</code>generates its results into two vregs which
can always safely be modified afterwards.
<p>
Virtual registers are, unsurprisingly, distinguished by their
<code>Unique</code>s.  There is a small difficulty in how to
know what the vreg for the upper 32 bits of a value is, given the vreg
for the lower 32 bits.  The simple solution adopted is to say that
any low-32 vreg may also have a hi-32 counterpart which shares the
same unique, but is otherwise regarded as a separate entity.
<code>getHiVRegFromLo</code> gets one from the other.
<pre>
   data VRegUnique
      = VRegUniqueLo Unique          -- lower part of a split quantity
      | VRegUniqueHi Unique          -- upper part thereof
</pre>
Apart from that, 64-bit code generation is really simple.  The sparc
and x86 versions are almost copy-n-pastes of each other, with minor
adjustments for endianness.  The generated code isn't wonderful but
is certainly acceptable, and it works.



<h3>Shortcomings and inefficiencies in the register allocator</h3>

<h4>Redundant reconstruction of the control flow graph</h4>

The allocator goes to considerable computational expense to construct
all the flow edges in the group of instructions it's allocating for,
by using the <code>insnFuture</code> function in the
<code>Instr</code> pseudo-abstract type.
<p>
This is really silly, because all that information is present at the
abstract C stage, but is thrown away in the translation to Stix.
So a good thing to do is to modify that translation to
produce a directed graph of Stix straight-line code blocks,
and to preserve that structure through the insn selector, so the
allocator can see it.  
<p>
This would eliminate the fragile, hacky, arch-specific
<code>insnFuture</code> mechanism, and probably make the whole
compiler run measurably faster.  Register allocation is a fair chunk
of the time of non-optimising compilation (10% or more), and
reconstructing the flow graph is an expensive part of reg-alloc.
It would probably accelerate the vreg liveness computation too.

<h4>Really ridiculous method for doing spilling</h4>

This is a more ambitious suggestion, but ... reg-alloc should be
reimplemented, using the scheme described in "Quality and speed in
linear-scan register allocation."  (Traub?)  For straight-line code
blocks, this gives an elegant one-pass algorithm for assigning
registers and creating the minimal necessary spill code, without the
need for reserving spill registers ahead of time.
<p>
I tried it in Rigr, replacing the previous spiller which used the
current GHC scheme described above, and it cut the number of spill
loads and stores by a factor of eight.  Not to mention being simpler,
easier to understand and very fast.
<p>
The Traub paper also describes how to extend their method to multiple
basic blocks, which will be needed for GHC.  It comes down to
reconciling multiple vreg-to-rreg mappings at points where control
flow merges.

<h4>Redundant-move support for revised instruction selector suggestion</h4>

As mentioned above, simplifying the instruction selector will require
the register allocator to try and allocate source and destination
vregs to the same rreg in reg-reg moves, so as to make as many as
possible go away.  Without that, the revised insn selector would
generate worse code than at present.  I know this stuff has been done
but know nothing about it.  The Linear-scan reg-alloc paper mentioned
above does indeed mention a bit about it in the context of single
basic blocks, but I don't know if that's sufficient.


   
<h3>x86 arcana that you should know about</h3>

The main difficulty with x86 is that many instructions have fixed
register constraints, which can occasionally make reg-alloc fail
completely.  And the FPU doesn't have the flat register model which
the reg-alloc abstraction (implicitly) assumes.
<p>
Our strategy is: do a good job for the common small subset, that is
integer loads, stores, address calculations, basic ALU ops (+, -,
and, or, xor), and jumps.  That covers the vast majority of 
executed insns.  And indeed we do do a good job, with a loss of
less than 2% compared with gcc.
<p>
Initially we tried to handle integer instructions with awkward
register constraints (mul, div, shifts by non-constant amounts) via
various jigglings of the spiller et al.  This never worked robustly,
and putting platform-specific tweaks in the generic infrastructure is
a big No-No.  (Not quite true; shifts by a non-constant amount are
still done by a giant kludge, and should be moved into this new
framework.)
<p>
Fortunately, all such insns are rare.  So the current scheme is to 
pretend that they don't have any such constraints.  This fiction is
carried all the way through the register allocator.  When the insn
finally comes to be printed, we emit a sequence which copies the
operands through memory (<code>%esp</code>-relative), satisfying the
constraints of the real instruction.  This localises the gruesomeness
to just one place.  Here, for example, is the code generated for
integer divison of <code>%esi</code> by <code>%ecx</code>:
<pre>
   # BEGIN IQUOT %ecx, %esi
   pushl $0
   pushl %eax  
   pushl %edx
   pushl %ecx
   movl  %esi,% eax
   cltd
   idivl 0(%esp)
   movl %eax, 12(%esp)
   popl %edx  
   popl %edx
   popl %eax
   popl %esi
   # END   IQUOT %ecx, %esi
</pre>
This is not quite as appalling as it seems, if you consider that the
division itself typically takes 16+ cycles, whereas the rest of the
insns probably go through in about 1 cycle each.
<p>
This trick is taken to extremes for FP operations.
<p>
All notions of the x86 FP stack and its insns have been removed.
Instead, we pretend, to the instruction selector and register
allocator, that x86 has six floating point registers,
<code>%fake0</code> .. <code>%fake5</code>, which can be used in the
usual flat manner.  We further claim that x86 has floating point
instructions very similar to SPARC and Alpha, that is, a simple
3-operand register-register arrangement.  Code generation and register
allocation proceed on this basis.
<p>
When we come to print out the final assembly, our convenient fiction
is converted to dismal reality.  Each fake instruction is
independently converted to a series of real x86 instructions.
<code>%fake0</code> .. <code>%fake5</code> are mapped to
<code>%st(0)</code> .. <code>%st(5)</code>.  To do reg-reg arithmetic
operations, the two operands are pushed onto the top of the FP stack,
the operation done, and the result copied back into the relevant
register.  When one of the operands is also the destination, we emit a
slightly less scummy translation.  There are only six
<code>%fake</code> registers because 2 are needed for the translation,
and x86 has 8 in total.
<p>
The translation is inefficient but is simple and it works.  A cleverer
translation would handle a sequence of insns, simulating the FP stack
contents, would not impose a fixed mapping from <code>%fake</code> to
<code>%st</code> regs, and hopefully could avoid most of the redundant
reg-reg moves of the current translation.
<p>
There are, however, two unforeseen bad side effects:
<ul>
<li>This doesn't work properly, because it doesn't observe the normal
    conventions for x86 FP code generation.  It turns out that each of
    the 8 elements in the x86 FP register stack has a tag bit which
    indicates whether or not that register is notionally in use or
    not.  If you do a FPU operation which happens to read a
    tagged-as-empty register, you get an x87 FPU (stack invalid)
    exception, which is normally handled by the FPU without passing it
    to the OS: the program keeps going, but the resulting FP values
    are garbage.  The OS can ask for the FPU to pass it FP
    stack-invalid exceptions, but it usually doesn't.
    <p>
    Anyways: inside NCG created x86 FP code this all works fine.
    However, the NCG's fiction of a flat register set does not operate
    the x87 register stack in the required stack-like way.  When
    control returns to a gcc-generated world, the stack tag bits soon
    cause stack exceptions, and thus garbage results.
    <p>
    The only fix I could think of -- and it is horrible -- is to clear
    all the tag bits just before the next STG-level entry, in chunks
    of code which use FP insns.  <code>i386_insert_ffrees</code>
    inserts the relevant <code>ffree</code> insns into such code
    blocks.  It depends critically on <code>is_G_instr</code> to
    detect such blocks.
<p>
<li>It's very difficult to read the generated assembly and
    reason about it when debugging, because there's so much clutter.
    We print the fake insns as comments in the output, and that helps
    a bit. 
</ul>



<h3>Generating code for ccalls</h3>

For reasons I don't really understand, the instruction selectors for
generating calls to C (<code>genCCall</code>) have proven surprisingly
difficult to get right, and soaked up a lot of debugging time.  As a
result, I have once again opted for schemes which are simple and not
too difficult to argue as correct, even if they don't generate
excellent code.
<p>
The sparc ccall generator in particular forces all arguments into
temporary virtual registers before moving them to the final
out-registers (<code>%o0</code> .. <code>%o5</code>).  This creates
some unnecessary reg-reg moves.  The reason is explained in a
comment in the code.


<h3>Duplicate implementation for many STG macros</h3>

This has been discussed at length already.  It has caused a couple of
nasty bugs due to subtle untracked divergence in the macro
translations.  The macro-expander really should be pushed up into the
Abstract C phase, so the problem can't happen.
<p>
Doing so would have the added benefit that the NCG could be used to
compile more "ways" -- well, at least the 'p' profiling way.


<h3>How to debug the NCG without losing your sanity/hair/cool</h3>

Last, but definitely not least ...
<p>
The usual syndrome is that some program, when compiled via C, works,
but not when compiled via the NCG.  Usually the problem is fairly 
simple to fix, once you find the specific code block which has been
mistranslated.  But the latter can be nearly impossible, since most
modules generate at least hundreds and often thousands of them.
<p>
My solution: cheat.
<p>
Because the via-C and native routes diverge only late in the day,
it is not difficult to construct a 1-1 correspondence between basic
blocks on the two routes.  So, if the program works via C but not on
the NCG, do the following:
<ul>
<li>Recompile <code>AsmCodeGen.lhs</code> in the afflicted compiler
    with <code>-DDEBUG_NCG</code>, so that it inserts
    <code>___ncg_debug_marker</code>s
    into the assembly it emits.
<p>
<li>Using a binary search on modules, find the module which is causing
    the problem.
<p>
<li>Compile that module to assembly code, with identical flags, twice,
    once via C and once via NCG.
    Call the outputs <code>ModuleName.s-gcc</code> and
    <code>ModuleName.s-nat</code>.  Check that the latter does indeed have 
    <code>___ncg_debug_marker</code>s in it; otherwise the next steps fail.
<p>
<li>Build (with a working compiler) the program
    <code>fptools/ghc/utils/debugNCG/diff_gcc_nat</code>.
<p>
<li>Run: <code>diff_gcc_nat ModuleName.s</code>.  This will
    construct the 1-1 correspondence, and emits on stdout 
    a cppable assembly output.  Place this in a file -- I always
    call it <code>synth.S</code>.  Note, the capital S is important;
    otherwise it won't get cpp'd.  You can feed this file directly to
    ghc and it will automatically get cpp'd; you don't have to do so
    yourself.
<p>
<li>By messing with the <code>#define</code>s at the top of
    <code>synth.S</code>, do a binary search to find the incorrect
    block.  Keep a careful record of where you are in the search; it
    is easy to get confused.  Remember also that multiple blocks may
    be wrong, which also confuses matters.  Finally, I usually start
    off by re-checking that I can build the executable with all the
    <code>#define</code>s set to 0 and then all to 1.  This ensures
    you won't get halfway through the search and then get stuck due to
    some snafu with gcc-specific literals.  Usually I set
    <code>UNMATCHED_GCC</code> to 1 all the time, and this bit should
    contain only literal data.
    <code>UNMATCHED_NAT</code> should be empty.
</ul>
<p>
<code>diff_gcc_nat</code> was known to work correctly last time I used
it, in December 01, for both x86 and sparc.  If it doesn't work, due
to changes in assembly syntax, or whatever, make it work.  The
investment is well worth it.  Searching for the incorrect block(s) any
other way is a total time waster.



</ul>




    <p><small>
<!-- hhmts start -->
Last modified: Fri Feb  1 16:14:11 GMT 2002
<!-- hhmts end -->
    </small>
  </body>
</html>