1 files changed, 0 insertions, 749 deletions
diff --git a/docs/comm/the-beast/ncg.html b/docs/comm/the-beast/ncg.html
deleted file mode 100644
index 84beac2d51..0000000000
--- a/docs/comm/the-beast/ncg.html
+++ /dev/null
@@ -1,749 +0,0 @@
-<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">
-<html>
-  <head>
-    <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=ISO-8859-1">
-    <title>The GHC Commentary - The Native Code Generator</title>
-  </head>
-
-  <body BGCOLOR="FFFFFF">
-    <h1>The GHC Commentary - The Native Code Generator</h1>
-    <p>
-      On some platforms (currently x86 and PowerPC, with bitrotted
-      support for Sparc and Alpha), GHC can generate assembly code
-      directly, without having to go via C.  This can sometimes almost
-      halve compilation time, and avoids the fragility and
-      horribleness of the <a href="mangler.html">mangler</a>.  The NCG
-      is enabled by default for
-      non-optimising compilation on supported platforms.  For most programs
-      it generates code which runs only 1-3% slower
-      (depending on platform and type of code) than that
-      created by gcc on x86s, so it is well worth using even with
-      optimised compilation.  FP-intensive x86 programs see a bigger
-      slowdown, and all Sparc code runs about 5% slower due to
-      us not filling branch delay slots.
-    <p>
-      The NCG has always been something of a second-class citizen
-      inside GHC, an unloved child, rather.  This means that its
-      integration into the compiler as a whole is rather clumsy, which
-      brings some problems described below.  That apart, the NCG
-      proper is fairly cleanly designed, as target-independent as it
-      reasonably can be, and so should not be difficult to retarget.
-    <p>
-      <b>NOTE!</b> The native code generator was largely rewritten as part
-      of the C-- backend changes, around May 2004.  Unfortunately the
-      rest of this document still refers to the old version, and was written
-      with relation to the CVS head as of end-Jan 2002.  Some of it is relevant,
-      some of it isn't.
-
-    <h2>Overview</h2>
-      The top-level code generator fn is
-    <p>
-    <code>absCtoNat :: AbstractC -> UniqSM (SDoc, Pretty.Doc)</code>
-    <p>
-    The returned <code>SDoc</code> is for debugging, so is empty unless
-    you specify <code>-ddump-stix</code>.  The <code>Pretty.Doc</code>
-    bit is the final assembly code.  Translation involves three main
-    phases, the first and third of which are target-independent.
-    <ul>
-    <li><b>Translation into the <code>Stix</code> representation.</b>  Stix
-        is a simple tree-like RTL-style language, in which you can
-        mention:
-        <p>
-        <ul>
-        <li>An infinite number of temporary, virtual registers.
-        <li>The STG "magic" registers (<code>MagicId</code>), such as
-            the heap and stack pointers.
-        <li>Literals and low-level machine ops (<code>MachOp</code>).
-        <li>Simple address computations.
-        <li>Reads and writes of: memory, virtual regs, and various STG
-            regs. 
-        <li>Labels and <code>if ... goto ...</code> style control-flow.
-        </ul>
-        <p>
-        Stix has two main associated types:
-        <p>
-        <ul>
-        <li><code>StixStmt</code> -- trees executed for their side
-        effects: assignments, control transfers, and auxiliary junk
-        such as segment changes and literal data.
-        <li><code>StixExpr</code> -- trees which denote a value.
-        </ul>
-        <p>
-        Translation into Stix is almost completely
-        target-independent.  Needed dependencies are knowledge of
-        word size and endianness, used when generating code to do 
-        deal with half-word fields in info tables.  This could be
-        abstracted out easily enough.  Also, the Stix translation
-        needs to know which <code>MagicId</code>s map to registers
-        on the given target, and which are stored in offsets from
-        <code>BaseReg</code>.
-        <p>
-        After initial Stix generation, the trees are cleaned up with
-        constant-folding and a little copy-propagation ("Stix
-        inlining", as the code misleadingly calls it).  We take
-        the opportunity to translate <code>MagicId</code>s which are
-        stored in memory on the given target, into suitable memory
-        references.  Those which are stored in registers are left
-        alone.  There is also a half-hearted attempt to lift literal
-        strings to the top level in cases where nested strings have
-        been observed to give incorrect code in the past.
-        <p>
-        Primitive machine-level operations will already be phrased in
-        terms of <code>MachOp</code>s in the presented Abstract C, and
-        these are passed through unchanged.  We comment only that the
-        <code>MachOp</code>s have been chosen so as to be easy to
-        implement on all targets, and their meaning is intended to be
-        unambiguous, and the same on all targets, regardless of word
-        size or endianness.
-        <p>
-        <b>A note on <code>MagicId</code>s.</b>
-        Those which are assigned to
-        registers on the current target are left unmodified.  Those
-        which are not are stored in memory as offsets from
-        <code>BaseReg</code> (which is assumed to permanently have the
-        value <code>(&MainCapability.r)</code>), so the constant folder
-        calculates the offsets and inserts suitable loads/stores.  One
-        complication is that not all archs have <code>BaseReg</code>
-        itself in a register, so for those (sparc), we instead
-        generate the address as an offset from the static symbol
-        <code>MainCapability</code>, since the register table lives in
-        there.
-        <p>
-        Finally, <code>BaseReg</code> does occasionally itself get
-        mentioned in Stix expression trees, and in this case what is
-        denoted is precisely <code>(&MainCapability.r)</code>, not, as
-        in all other cases, the value of memory at some offset from
-        the start of the register table.  Since what it denotes is an
-        r-value and not an l-value, assigning <code>BaseReg</code> is
-        meaningless, so the machinery checks to ensure this never
-        happens.  All these details are taken into account by the
-        constant folder.
-    <p>
-    <li><b>Instruction selection.</b>  This is the only majorly
-        target-specific phase.  It turns Stix statements and
-        expressions into sequences of <code>Instr</code>, a data
-        type which is different for each architecture.
-        <code>Instr</code>, unsurprisingly, has various supporting
-        types, such as <code>Reg</code>, <code>Operand</code>,
-        <code>Imm</code>, etc.  The generated instructions may refer
-        to specific machine registers, or to arbitrary virtual
-        registers, either those created within the instruction
-        selector, or those mentioned in the Stix passed to it.
-        <p>
-        The instruction selectors live in <code>MachCode.lhs</code>.
-        The core functions, for each target, are:
-        <p>
-        <code>
-        getAmode :: StixExpr -> NatM Amode
-        <br>getRegister :: StixExpr -> NatM Register
-        <br>assignMem_IntCode :: PrimRep -> StixExpr -> StixExpr -> NatM InstrBlock
-        <br>assignReg_IntCode :: PrimRep -> StixReg  -> StixExpr -> NatM InstrBlock
-        </code>
-        <p>
-        The insn selectors use the "maximal munch" algorithm.  The
-        bizarrely-misnamed <code>getRegister</code> translates
-        expressions.  A simplified version of its type is:
-        <p>
-        <code>getRegister :: StixExpr -> NatM (OrdList Instr, Reg)</code>
-        <p>
-        That is: it (monadically) turns a <code>StixExpr</code> into a
-        sequence of instructions, and a register, with the meaning
-        that after executing the (possibly empty) sequence of
-        instructions, the (possibly virtual) register will
-        hold the resulting value.  The real situation is complicated
-        by the presence of fixed registers, and is detailed below.
-        <p>
-        Maximal munch is a greedy algorithm and is known not to give
-        globally optimal code sequences, but it is good enough, and
-        fast and simple.  Early incarnations of the NCG used something
-        more sophisticated, but that is long gone now.
-        <p>
-        Similarly, <code>getAmode</code> translates a value, intended
-        to denote an address, into a sequence of insns leading up to
-        a (processor-specific) addressing mode.  This stuff could be
-        done using the general <code>getRegister</code> selector, but
-        would necessarily generate poorer code, because the calculated
-        address would be forced into a register, which might be
-        unnecessary if it could partially or wholly be calculated
-        using an addressing mode.
-        <p>
-        Finally, <code>assignMem_IntCode</code> and
-        <code>assignReg_IntCode</code> create instruction sequences to
-        calculate a value and store it in the given register, or at
-        the given address.  Because these guys translate a statement,
-        not a value, they just return a sequence of insns and no
-        associated register.  Floating-point and 64-bit integer
-        assignments have analogous selectors.
-        <p>
-        Apart from the complexities of fixed vs floating registers,
-        discussed below, the instruction selector is as simple
-        as it can be.  It looks long and scary but detailed
-        examination reveals it to be fairly straightforward.
-    <p>
-    <li><b>Register allocation.</b> The register allocator,
-        <code>AsmRegAlloc.lhs</code> takes sequences of
-        <code>Instr</code>s which mention a mixture of real and
-        virtual registers, and returns a modified sequence referring
-        only to real ones.  It is gloriously and entirely
-        target-independent.  Well, not exactly true.  Instead it
-        regards <code>Instr</code> (instructions) and <code>Reg</code>
-        (virtual and real registers) as abstract types, to which it has
-        the following interface:
-        <p>
-        <code>
-        insnFuture :: Instr -> InsnFuture
-        <br>regUsage :: Instr -> RegUsage
-        <br>patchRegs :: Instr -> (Reg -> Reg) -> Instr
-        </code>
-        <p>
-        <code>insnFuture</code> is used to (re)construct the graph of
-        all possible control transfers between the insns to be
-        allocated.  <code>regUsage</code> returns the sets of registers
-        read and written by an instruction.  And
-        <code>patchRegs</code> is used to apply the allocator's final
-        decision on virtual-to-real reg mapping to an instruction.
-        <p>
-        Clearly these 3 fns have to be written anew for each
-        architecture.  They are defined in
-        <code>RegAllocInfo.lhs</code>.  Think twice, no, thrice,
-        before modifying them: making false claims about insn
-        behaviour will lead to hard-to-find register allocation
-        errors.
-        <p>
-        <code>AsmRegAlloc.lhs</code> contains detailed comments about
-	how the allocator works.  Here is a summary.  The head honcho
-        <p>
-        <code>allocUsingTheseRegs :: [Instr] -> [Reg] -> (Bool, [Instr])</code>
-        <p>
-        takes a list of instructions and a list of real registers
-        available for allocation, and maps as many of the virtual regs
-        in the input into real ones as it can.  The returned
-        <code>Bool</code> indicates whether or not it was
-        successful.  If so, that's the end of it.  If not, the caller
-        of <code>allocUsingTheseRegs</code> will attempt spilling.
-        More of that later.  What <code>allocUsingTheseRegs</code>
-        does is:
-        <p>
-        <ul>
-        <li>Implicitly number each instruction by its position in the
-            input list.
-        <p>
-        <li>Using <code>insnFuture</code>, create the set of all flow
-            edges -- possible control transfers -- within this set of
-            insns.
-        <p>
-        <li>Using <code>regUsage</code> and iterating around the flow
-            graph from the previous step, calculate, for each virtual
-            register, the set of flow edges on which it is live.
-        <p>
-        <li>Make a real-register committment map, which gives the set
-            of edges for which each real register is committed (in
-            use).  These sets are initially empty.  For each virtual
-            register, attempt to find a real register whose current
-            committment does not intersect that of the virtual
-            register -- ie, is uncommitted on all edges that the
-            virtual reg is live.  If successful, this means the vreg
-            can be assigned to the realreg, so add the vreg's set to
-            the realreg's committment.  
-        <p>
-        <li>If all the vregs were assigned to a realreg, use
-            <code>patchInstr</code> to apply the mapping to the insns themselves.
-        </ul>
-        <p>
-        <b>Spilling</b>
-        <p>
-        If <code>allocUsingTheseRegs</code> fails, a baroque
-        mechanism comes into play.  We now know that much simpler
-        schemes are available to do the same thing and give better
-        results.  
-        Anyways:
-        <p>
-        The logic above <code>allocUsingTheseRegs</code>, in
-        <code>doGeneralAlloc</code> and <code>runRegAllocate</code>,
-        observe that allocation has failed with some set R of real
-        registers.  So they apply <code>runRegAllocate</code> a second
-        time to the code, but remove (typically) two registers from R
-        before doing so.  This naturally fails too, but returns a
-        partially-allocated sequence.  <code>doGeneralAlloc</code>
-        then inserts spill code into the sequence, and finally re-runs
-        <code>allocUsingTheseRegs</code>, but supplying the original,
-        unadulterated R.  This is guaranteed to succeed since the two
-        registers previously removed from R are sufficient to allocate
-        all the spill/restore instructions added.
-        <p>
-        Because x86 is very short of registers, and in the worst case
-        needs three removed from R, a softly-softly approach is used.
-        <code>doGeneralAlloc</code> first tries with zero regs removed
-        from R, then if that fails one, then two, etc.  This means
-        <code>allocUsingTheseRegs</code> may get run several times
-        before a successful arrangement is arrived at.
-        <code>findReservedRegs</code> cooks up the sets of spill
-        registers to try with.
-        <p>
-        The resulting machinery is complicated and the generated spill
-        code is appalling.  The saving grace is that spills are very
-        rare so it doesn't matter much.  I did not invent this -- I inherited it.
-        <p>
-        <b>Dealing with common cases fast</b>
-        <p>
-        The entire reg-alloc mechanism described so far is general and
-        correct, but expensive overkill for many simple code blocks.
-        So to begin with we use
-        <code>doSimpleAlloc</code>, which attempts to do something
-        simple.  It exploits the observation that if the total number
-        of virtual registers does not exceed the number of real ones
-        available, we can simply dole out a new realreg each time we
-        see mention of a new vreg, with no regard for control flow.
-        <code>doSimpleAlloc</code> therefore attempts this in a
-        single pass over the code.  It gives up if it runs out of real
-        regs or sees any condition which renders the above observation
-        invalid (fixed reg uses, for example).  
-        <p>
-        This clever hack handles the majority of code blocks quickly.
-        It was copied from the previous reg-allocator (the
-        Mattson/Partain/Marlow/Gill one).
-    </ul>
-
-<p>
-<h2>Complications, observations, and possible improvements</h2>
-
-<h3>Real vs virtual registers in the instruction selectors</h3>
-
-The instruction selectors for expression trees, namely
-<code>getRegister</code>, are complicated by the fact that some 
-expressions can only be computed into a specific register, whereas
-the majority can be computed into any register.  We take x86 as an
-example, but the problem applies to all archs.
-<p>
-Terminology: <em>rreg</em> means real register, a real machine
-register.  <em>vreg</em> means one of an infinite set of virtual
-registers.  The type <code>Reg</code> is the sum of <em>rreg</em> and
-<em>vreg</em>.  The instruction selector generates sequences with
-unconstrained use of vregs, leaving the register allocator to map them
-all into rregs.
-<p>
-Now, where was I ?  Oh yes.  We return to the type of
-<code>getRegister</code>, which despite its name, selects instructions
-to compute the value of an expression tree.  
-<pre>
-   getRegister :: StixExpr -> NatM Register
-
-   data Register
-     = Fixed   PrimRep Reg InstrBlock
-     | Any     PrimRep (Reg -> InstrBlock)
-
-   type InstrBlock -- sequence of instructions
-</pre>
-At first this looks eminently reasonable (apart from the stupid
-name).  <code>getRegister</code>, and nobody else, knows whether or
-not a given expression has to be computed into a fixed rreg or can be
-computed into any rreg or vreg.  In the first case, it returns
-<code>Fixed</code> and indicates which rreg the result is in.  In the
-second case it defers committing to any specific target register by
-returning a function from <code>Reg</code> to <code>InstrBlock</code>,
-and the caller can specify the target reg as it sees fit.
-<p>
-Unfortunately, that forces <code>getRegister</code>'s callers (usually
-itself) to use a clumsy and confusing idiom in the common case where
-they do not care what register the result winds up in.  The reason is
-that although a value might be computed into a fixed rreg, we are
-forbidden (on pain of segmentation fault :) from subsequently
-modifying the fixed reg.  This and other rules are record in "Rules of
-the game" inside <code>MachCode.lhs</code>.
-<p>
-Why can't fixed registers be modified post-hoc?  Consider a simple
-expression like <code>Hp+1</code>.  Since the heap pointer
-<code>Hp</code> is definitely in a fixed register, call it R,
-<code>getRegister</code> on subterm <code>Hp</code> will simply return
-<code>Fixed</code> with an empty sequence and R.  But we can't just
-emit an increment instruction for R, because that trashes
-<code>Hp</code>; instead we first have to copy it into a fresh vreg
-and increment that.
-<p>
-With all that in mind, consider now writing a <code>getRegister</code>
-clause for terms of the form <code>(1 + E)</code>.  Contrived, yes,
-but illustrates the matter.  First we do
-<code>getRegister</code> on E.  Now we are forced to examine what 
-comes back.
-<pre>
-   getRegister (OnePlus e)
-      = getRegister e           `thenNat`   \ e_result ->
-        case e_result of
-           Fixed e_code e_fixed 
-              -> returnNat (Any IntRep (\dst -> e_code ++ [MOV e_fixed dst, INC dst]))
-           Any e_any 
-              -> Any (\dst -> e_any dst ++ [INC dst])
-</pre>
-This seems unreasonably cumbersome, yet the instruction selector is
-full of such idioms.  A good example of the complexities induced by
-this scheme is shown by <code>trivialCode</code> for x86 in
-<code>MachCode.lhs</code>.  This deals with general integer dyadic
-operations on x86 and has numerous cases.  It was difficult to get
-right.
-<p>
-An alternative suggestion is to simplify the type of
-<code>getRegister</code> to this:
-<pre>
-   getRegister :: StixExpr -> NatM (InstrBloc, VReg)
-   type VReg = .... a vreg ...
-</pre>
-and then we could safely write
-<pre>
-   getRegister (OnePlus e)
-      = getRegister e        `thenNat`  \ (e_code, e_vreg) ->
-        returnNat (e_code ++ [INC e_vreg], e_vreg)
-</pre>
-which is about as straightforward as you could hope for.
-Unfortunately, it requires <code>getRegister</code> to insert moves of
-values which naturally compute into an rreg, into a vreg.  Consider:
-<pre>
-   1 + ccall some-C-fn
-</pre>
-On x86 the ccall result is returned in rreg <code>%eax</code>.  The
-resulting sequence, prior to register allocation, would be:
-<pre>
-   # push args
-   call some-C-fn
-   # move %esp to nuke args
-   movl   %eax, %vreg
-   incl   %vreg
-</pre>
-If, as is likely, <code>%eax</code> is not held live beyond this point
-for any other purpose, the move into a fresh register is pointless;
-we'd have been better off leaving the value in <code>%eax</code> as
-long as possible.
-<p>
-The simplified <code>getRegister</code> story is attractive.  It would
-clean up the instruction selectors significantly and make it simpler
-to write new ones.  The only drawback is that it generates redundant
-register moves.  I suggest that eliminating these should be the job
-of the register allocator.  Indeed:
-<ul>
-<li>There has been some work on this already ("Iterated register
-    coalescing" ?), so this isn't a new idea.
-<p>
-<li>You could argue that the existing scheme inappropriately blurs the
-    boundary between the instruction selector and the register
-    allocator.  The instruction selector should .. well .. just
-    select instructions, without having to futz around worrying about
-    what kind of registers subtrees get generated into.  Register
-    allocation should be <em>entirely</em> the domain of the register
-    allocator, with the proviso that it should endeavour to allocate
-    registers so as to minimise the number of non-redundant reg-reg
-    moves in the final output.
-</ul>
-
-
-<h3>Selecting insns for 64-bit values/loads/stores on 32-bit platforms</h3>
-
-Note that this stuff doesn't apply on 64-bit archs, since the
-<code>getRegister</code> mechanism applies there.
-
-The relevant functions are:
-<pre>
-   assignMem_I64Code :: StixExpr -> StixExpr -> NatM InstrBlock
-   assignReg_I64Code :: StixReg  -> StixExpr -> NatM InstrBlock
-   iselExpr64        :: StixExpr -> NatM ChildCode64
-
-   data ChildCode64     -- a.k.a "Register64"
-      = ChildCode64 
-           InstrBlock   -- code
-           VRegUnique   -- unique for the lower 32-bit temporary
-</pre>
-<code>iselExpr64</code> is the 64-bit, plausibly-named analogue of
-<code>getRegister</code>, and <code>ChildCode64</code> is the analogue
-of <code>Register</code>.  The aim here was to generate working 64
-bit code as simply as possible.  To this end, I used the 
-simplified <code>getRegister</code> scheme described above, in which
-<code>iselExpr64</code>generates its results into two vregs which
-can always safely be modified afterwards.
-<p>
-Virtual registers are, unsurprisingly, distinguished by their
-<code>Unique</code>s.  There is a small difficulty in how to
-know what the vreg for the upper 32 bits of a value is, given the vreg
-for the lower 32 bits.  The simple solution adopted is to say that
-any low-32 vreg may also have a hi-32 counterpart which shares the
-same unique, but is otherwise regarded as a separate entity.
-<code>getHiVRegFromLo</code> gets one from the other.
-<pre>
-   data VRegUnique
-      = VRegUniqueLo Unique          -- lower part of a split quantity
-      | VRegUniqueHi Unique          -- upper part thereof
-</pre>
-Apart from that, 64-bit code generation is really simple.  The sparc
-and x86 versions are almost copy-n-pastes of each other, with minor
-adjustments for endianness.  The generated code isn't wonderful but
-is certainly acceptable, and it works.
-
-
-
-<h3>Shortcomings and inefficiencies in the register allocator</h3>
-
-<h4>Redundant reconstruction of the control flow graph</h4>
-
-The allocator goes to considerable computational expense to construct
-all the flow edges in the group of instructions it's allocating for,
-by using the <code>insnFuture</code> function in the
-<code>Instr</code> pseudo-abstract type.
-<p>
-This is really silly, because all that information is present at the
-abstract C stage, but is thrown away in the translation to Stix.
-So a good thing to do is to modify that translation to
-produce a directed graph of Stix straight-line code blocks,
-and to preserve that structure through the insn selector, so the
-allocator can see it.  
-<p>
-This would eliminate the fragile, hacky, arch-specific
-<code>insnFuture</code> mechanism, and probably make the whole
-compiler run measurably faster.  Register allocation is a fair chunk
-of the time of non-optimising compilation (10% or more), and
-reconstructing the flow graph is an expensive part of reg-alloc.
-It would probably accelerate the vreg liveness computation too.
-
-<h4>Really ridiculous method for doing spilling</h4>
-
-This is a more ambitious suggestion, but ... reg-alloc should be
-reimplemented, using the scheme described in "Quality and speed in
-linear-scan register allocation."  (Traub?)  For straight-line code
-blocks, this gives an elegant one-pass algorithm for assigning
-registers and creating the minimal necessary spill code, without the
-need for reserving spill registers ahead of time.
-<p>
-I tried it in Rigr, replacing the previous spiller which used the
-current GHC scheme described above, and it cut the number of spill
-loads and stores by a factor of eight.  Not to mention being simpler,
-easier to understand and very fast.
-<p>
-The Traub paper also describes how to extend their method to multiple
-basic blocks, which will be needed for GHC.  It comes down to
-reconciling multiple vreg-to-rreg mappings at points where control
-flow merges.
-
-<h4>Redundant-move support for revised instruction selector suggestion</h4>
-
-As mentioned above, simplifying the instruction selector will require
-the register allocator to try and allocate source and destination
-vregs to the same rreg in reg-reg moves, so as to make as many as
-possible go away.  Without that, the revised insn selector would
-generate worse code than at present.  I know this stuff has been done
-but know nothing about it.  The Linear-scan reg-alloc paper mentioned
-above does indeed mention a bit about it in the context of single
-basic blocks, but I don't know if that's sufficient.
-
-
-   
-<h3>x86 arcana that you should know about</h3>
-
-The main difficulty with x86 is that many instructions have fixed
-register constraints, which can occasionally make reg-alloc fail
-completely.  And the FPU doesn't have the flat register model which
-the reg-alloc abstraction (implicitly) assumes.
-<p>
-Our strategy is: do a good job for the common small subset, that is
-integer loads, stores, address calculations, basic ALU ops (+, -,
-and, or, xor), and jumps.  That covers the vast majority of 
-executed insns.  And indeed we do a good job, with a loss of
-less than 2% compared with gcc.
-<p>
-Initially we tried to handle integer instructions with awkward
-register constraints (mul, div, shifts by non-constant amounts) via
-various jigglings of the spiller et al.  This never worked robustly,
-and putting platform-specific tweaks in the generic infrastructure is
-a big No-No.  (Not quite true; shifts by a non-constant amount are
-still done by a giant kludge, and should be moved into this new
-framework.)
-<p>
-Fortunately, all such insns are rare.  So the current scheme is to 
-pretend that they don't have any such constraints.  This fiction is
-carried all the way through the register allocator.  When the insn
-finally comes to be printed, we emit a sequence which copies the
-operands through memory (<code>%esp</code>-relative), satisfying the
-constraints of the real instruction.  This localises the gruesomeness
-to just one place.  Here, for example, is the code generated for
-integer divison of <code>%esi</code> by <code>%ecx</code>:
-<pre>
-   # BEGIN IQUOT %ecx, %esi
-   pushl $0
-   pushl %eax  
-   pushl %edx
-   pushl %ecx
-   movl  %esi,% eax
-   cltd
-   idivl 0(%esp)
-   movl %eax, 12(%esp)
-   popl %edx  
-   popl %edx
-   popl %eax
-   popl %esi
-   # END   IQUOT %ecx, %esi
-</pre>
-This is not quite as appalling as it seems, if you consider that the
-division itself typically takes 16+ cycles, whereas the rest of the
-insns probably go through in about 1 cycle each.
-<p>
-This trick is taken to extremes for FP operations.
-<p>
-All notions of the x86 FP stack and its insns have been removed.
-Instead, we pretend, to the instruction selector and register
-allocator, that x86 has six floating point registers,
-<code>%fake0</code> .. <code>%fake5</code>, which can be used in the
-usual flat manner.  We further claim that x86 has floating point
-instructions very similar to SPARC and Alpha, that is, a simple
-3-operand register-register arrangement.  Code generation and register
-allocation proceed on this basis.
-<p>
-When we come to print out the final assembly, our convenient fiction
-is converted to dismal reality.  Each fake instruction is
-independently converted to a series of real x86 instructions.
-<code>%fake0</code> .. <code>%fake5</code> are mapped to
-<code>%st(0)</code> .. <code>%st(5)</code>.  To do reg-reg arithmetic
-operations, the two operands are pushed onto the top of the FP stack,
-the operation done, and the result copied back into the relevant
-register.  When one of the operands is also the destination, we emit a
-slightly less scummy translation.  There are only six
-<code>%fake</code> registers because 2 are needed for the translation,
-and x86 has 8 in total.
-<p>
-The translation is inefficient but is simple and it works.  A cleverer
-translation would handle a sequence of insns, simulating the FP stack
-contents, would not impose a fixed mapping from <code>%fake</code> to
-<code>%st</code> regs, and hopefully could avoid most of the redundant
-reg-reg moves of the current translation.
-<p>
-There are, however, two unforeseen bad side effects:
-<ul>
-<li>This doesn't work properly, because it doesn't observe the normal
-    conventions for x86 FP code generation.  It turns out that each of
-    the 8 elements in the x86 FP register stack has a tag bit which
-    indicates whether or not that register is notionally in use or
-    not.  If you do a FPU operation which happens to read a
-    tagged-as-empty register, you get an x87 FPU (stack invalid)
-    exception, which is normally handled by the FPU without passing it
-    to the OS: the program keeps going, but the resulting FP values
-    are garbage.  The OS can ask for the FPU to pass it FP
-    stack-invalid exceptions, but it usually doesn't.
-    <p>
-    Anyways: inside NCG created x86 FP code this all works fine.
-    However, the NCG's fiction of a flat register set does not operate
-    the x87 register stack in the required stack-like way.  When
-    control returns to a gcc-generated world, the stack tag bits soon
-    cause stack exceptions, and thus garbage results.
-    <p>
-    The only fix I could think of -- and it is horrible -- is to clear
-    all the tag bits just before the next STG-level entry, in chunks
-    of code which use FP insns.  <code>i386_insert_ffrees</code>
-    inserts the relevant <code>ffree</code> insns into such code
-    blocks.  It depends critically on <code>is_G_instr</code> to
-    detect such blocks.
-<p>
-<li>It's very difficult to read the generated assembly and
-    reason about it when debugging, because there's so much clutter.
-    We print the fake insns as comments in the output, and that helps
-    a bit. 
-</ul>
-
-
-
-<h3>Generating code for ccalls</h3>
-
-For reasons I don't really understand, the instruction selectors for
-generating calls to C (<code>genCCall</code>) have proven surprisingly
-difficult to get right, and soaked up a lot of debugging time.  As a
-result, I have once again opted for schemes which are simple and not
-too difficult to argue as correct, even if they don't generate
-excellent code.
-<p>
-The sparc ccall generator in particular forces all arguments into
-temporary virtual registers before moving them to the final
-out-registers (<code>%o0</code> .. <code>%o5</code>).  This creates
-some unnecessary reg-reg moves.  The reason is explained in a
-comment in the code.
-
-
-<h3>Duplicate implementation for many STG macros</h3>
-
-This has been discussed at length already.  It has caused a couple of
-nasty bugs due to subtle untracked divergence in the macro
-translations.  The macro-expander really should be pushed up into the
-Abstract C phase, so the problem can't happen.
-<p>
-Doing so would have the added benefit that the NCG could be used to
-compile more "ways" -- well, at least the 'p' profiling way.
-
-
-<h3>How to debug the NCG without losing your sanity/hair/cool</h3>
-
-Last, but definitely not least ...
-<p>
-The usual syndrome is that some program, when compiled via C, works,
-but not when compiled via the NCG.  Usually the problem is fairly 
-simple to fix, once you find the specific code block which has been
-mistranslated.  But the latter can be nearly impossible, since most
-modules generate at least hundreds and often thousands of them.
-<p>
-My solution: cheat.
-<p>
-Because the via-C and native routes diverge only late in the day,
-it is not difficult to construct a 1-1 correspondence between basic
-blocks on the two routes.  So, if the program works via C but not on
-the NCG, do the following:
-<ul>
-<li>Recompile <code>AsmCodeGen.lhs</code> in the afflicted compiler
-    with <code>-DDEBUG_NCG</code>, so that it inserts
-    <code>___ncg_debug_marker</code>s
-    into the assembly it emits.
-<p>
-<li>Using a binary search on modules, find the module which is causing
-    the problem.
-<p>
-<li>Compile that module to assembly code, with identical flags, twice,
-    once via C and once via NCG.
-    Call the outputs <code>ModuleName.s-gcc</code> and
-    <code>ModuleName.s-nat</code>.  Check that the latter does indeed have 
-    <code>___ncg_debug_marker</code>s in it; otherwise the next steps fail.
-<p>
-<li>Build (with a working compiler) the program
-    <code>fptools/ghc/utils/debugNCG/diff_gcc_nat</code>.
-<p>
-<li>Run: <code>diff_gcc_nat ModuleName.s</code>.  This will
-    construct the 1-1 correspondence, and emits on stdout 
-    a cppable assembly output.  Place this in a file -- I always
-    call it <code>synth.S</code>.  Note, the capital S is important;
-    otherwise it won't get cpp'd.  You can feed this file directly to
-    ghc and it will automatically get cpp'd; you don't have to do so
-    yourself.
-<p>
-<li>By messing with the <code>#define</code>s at the top of
-    <code>synth.S</code>, do a binary search to find the incorrect
-    block.  Keep a careful record of where you are in the search; it
-    is easy to get confused.  Remember also that multiple blocks may
-    be wrong, which also confuses matters.  Finally, I usually start
-    off by re-checking that I can build the executable with all the
-    <code>#define</code>s set to 0 and then all to 1.  This ensures
-    you won't get halfway through the search and then get stuck due to
-    some snafu with gcc-specific literals.  Usually I set
-    <code>UNMATCHED_GCC</code> to 1 all the time, and this bit should
-    contain only literal data.
-    <code>UNMATCHED_NAT</code> should be empty.
-</ul>
-<p>
-<code>diff_gcc_nat</code> was known to work correctly last time I used
-it, in December 01, for both x86 and sparc.  If it doesn't work, due
-to changes in assembly syntax, or whatever, make it work.  The
-investment is well worth it.  Searching for the incorrect block(s) any
-other way is a total time waster.
-
-
-
-</ul>
-
-
-
-
-    <p><small>
-<!-- hhmts start -->
-Last modified: Mon Aug 19 11:41:43 CEST 2013
-<!-- hhmts end -->
-    </small>
-  </body>
-</html>