diff options
Diffstat (limited to 'docs/comm/the-beast/ncg.html')
| -rw-r--r-- | docs/comm/the-beast/ncg.html | 749 |
1 files changed, 0 insertions, 749 deletions
diff --git a/docs/comm/the-beast/ncg.html b/docs/comm/the-beast/ncg.html deleted file mode 100644 index 84beac2d51..0000000000 --- a/docs/comm/the-beast/ncg.html +++ /dev/null @@ -1,749 +0,0 @@ -<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN"> -<html> - <head> - <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=ISO-8859-1"> - <title>The GHC Commentary - The Native Code Generator</title> - </head> - - <body BGCOLOR="FFFFFF"> - <h1>The GHC Commentary - The Native Code Generator</h1> - <p> - On some platforms (currently x86 and PowerPC, with bitrotted - support for Sparc and Alpha), GHC can generate assembly code - directly, without having to go via C. This can sometimes almost - halve compilation time, and avoids the fragility and - horribleness of the <a href="mangler.html">mangler</a>. The NCG - is enabled by default for - non-optimising compilation on supported platforms. For most programs - it generates code which runs only 1-3% slower - (depending on platform and type of code) than that - created by gcc on x86s, so it is well worth using even with - optimised compilation. FP-intensive x86 programs see a bigger - slowdown, and all Sparc code runs about 5% slower due to - us not filling branch delay slots. - <p> - The NCG has always been something of a second-class citizen - inside GHC, an unloved child, rather. This means that its - integration into the compiler as a whole is rather clumsy, which - brings some problems described below. That apart, the NCG - proper is fairly cleanly designed, as target-independent as it - reasonably can be, and so should not be difficult to retarget. - <p> - <b>NOTE!</b> The native code generator was largely rewritten as part - of the C-- backend changes, around May 2004. Unfortunately the - rest of this document still refers to the old version, and was written - with relation to the CVS head as of end-Jan 2002. Some of it is relevant, - some of it isn't. - - <h2>Overview</h2> - The top-level code generator fn is - <p> - <code>absCtoNat :: AbstractC -> UniqSM (SDoc, Pretty.Doc)</code> - <p> - The returned <code>SDoc</code> is for debugging, so is empty unless - you specify <code>-ddump-stix</code>. The <code>Pretty.Doc</code> - bit is the final assembly code. Translation involves three main - phases, the first and third of which are target-independent. - <ul> - <li><b>Translation into the <code>Stix</code> representation.</b> Stix - is a simple tree-like RTL-style language, in which you can - mention: - <p> - <ul> - <li>An infinite number of temporary, virtual registers. - <li>The STG "magic" registers (<code>MagicId</code>), such as - the heap and stack pointers. - <li>Literals and low-level machine ops (<code>MachOp</code>). - <li>Simple address computations. - <li>Reads and writes of: memory, virtual regs, and various STG - regs. - <li>Labels and <code>if ... goto ...</code> style control-flow. - </ul> - <p> - Stix has two main associated types: - <p> - <ul> - <li><code>StixStmt</code> -- trees executed for their side - effects: assignments, control transfers, and auxiliary junk - such as segment changes and literal data. - <li><code>StixExpr</code> -- trees which denote a value. - </ul> - <p> - Translation into Stix is almost completely - target-independent. Needed dependencies are knowledge of - word size and endianness, used when generating code to do - deal with half-word fields in info tables. This could be - abstracted out easily enough. Also, the Stix translation - needs to know which <code>MagicId</code>s map to registers - on the given target, and which are stored in offsets from - <code>BaseReg</code>. - <p> - After initial Stix generation, the trees are cleaned up with - constant-folding and a little copy-propagation ("Stix - inlining", as the code misleadingly calls it). We take - the opportunity to translate <code>MagicId</code>s which are - stored in memory on the given target, into suitable memory - references. Those which are stored in registers are left - alone. There is also a half-hearted attempt to lift literal - strings to the top level in cases where nested strings have - been observed to give incorrect code in the past. - <p> - Primitive machine-level operations will already be phrased in - terms of <code>MachOp</code>s in the presented Abstract C, and - these are passed through unchanged. We comment only that the - <code>MachOp</code>s have been chosen so as to be easy to - implement on all targets, and their meaning is intended to be - unambiguous, and the same on all targets, regardless of word - size or endianness. - <p> - <b>A note on <code>MagicId</code>s.</b> - Those which are assigned to - registers on the current target are left unmodified. Those - which are not are stored in memory as offsets from - <code>BaseReg</code> (which is assumed to permanently have the - value <code>(&MainCapability.r)</code>), so the constant folder - calculates the offsets and inserts suitable loads/stores. One - complication is that not all archs have <code>BaseReg</code> - itself in a register, so for those (sparc), we instead - generate the address as an offset from the static symbol - <code>MainCapability</code>, since the register table lives in - there. - <p> - Finally, <code>BaseReg</code> does occasionally itself get - mentioned in Stix expression trees, and in this case what is - denoted is precisely <code>(&MainCapability.r)</code>, not, as - in all other cases, the value of memory at some offset from - the start of the register table. Since what it denotes is an - r-value and not an l-value, assigning <code>BaseReg</code> is - meaningless, so the machinery checks to ensure this never - happens. All these details are taken into account by the - constant folder. - <p> - <li><b>Instruction selection.</b> This is the only majorly - target-specific phase. It turns Stix statements and - expressions into sequences of <code>Instr</code>, a data - type which is different for each architecture. - <code>Instr</code>, unsurprisingly, has various supporting - types, such as <code>Reg</code>, <code>Operand</code>, - <code>Imm</code>, etc. The generated instructions may refer - to specific machine registers, or to arbitrary virtual - registers, either those created within the instruction - selector, or those mentioned in the Stix passed to it. - <p> - The instruction selectors live in <code>MachCode.lhs</code>. - The core functions, for each target, are: - <p> - <code> - getAmode :: StixExpr -> NatM Amode - <br>getRegister :: StixExpr -> NatM Register - <br>assignMem_IntCode :: PrimRep -> StixExpr -> StixExpr -> NatM InstrBlock - <br>assignReg_IntCode :: PrimRep -> StixReg -> StixExpr -> NatM InstrBlock - </code> - <p> - The insn selectors use the "maximal munch" algorithm. The - bizarrely-misnamed <code>getRegister</code> translates - expressions. A simplified version of its type is: - <p> - <code>getRegister :: StixExpr -> NatM (OrdList Instr, Reg)</code> - <p> - That is: it (monadically) turns a <code>StixExpr</code> into a - sequence of instructions, and a register, with the meaning - that after executing the (possibly empty) sequence of - instructions, the (possibly virtual) register will - hold the resulting value. The real situation is complicated - by the presence of fixed registers, and is detailed below. - <p> - Maximal munch is a greedy algorithm and is known not to give - globally optimal code sequences, but it is good enough, and - fast and simple. Early incarnations of the NCG used something - more sophisticated, but that is long gone now. - <p> - Similarly, <code>getAmode</code> translates a value, intended - to denote an address, into a sequence of insns leading up to - a (processor-specific) addressing mode. This stuff could be - done using the general <code>getRegister</code> selector, but - would necessarily generate poorer code, because the calculated - address would be forced into a register, which might be - unnecessary if it could partially or wholly be calculated - using an addressing mode. - <p> - Finally, <code>assignMem_IntCode</code> and - <code>assignReg_IntCode</code> create instruction sequences to - calculate a value and store it in the given register, or at - the given address. Because these guys translate a statement, - not a value, they just return a sequence of insns and no - associated register. Floating-point and 64-bit integer - assignments have analogous selectors. - <p> - Apart from the complexities of fixed vs floating registers, - discussed below, the instruction selector is as simple - as it can be. It looks long and scary but detailed - examination reveals it to be fairly straightforward. - <p> - <li><b>Register allocation.</b> The register allocator, - <code>AsmRegAlloc.lhs</code> takes sequences of - <code>Instr</code>s which mention a mixture of real and - virtual registers, and returns a modified sequence referring - only to real ones. It is gloriously and entirely - target-independent. Well, not exactly true. Instead it - regards <code>Instr</code> (instructions) and <code>Reg</code> - (virtual and real registers) as abstract types, to which it has - the following interface: - <p> - <code> - insnFuture :: Instr -> InsnFuture - <br>regUsage :: Instr -> RegUsage - <br>patchRegs :: Instr -> (Reg -> Reg) -> Instr - </code> - <p> - <code>insnFuture</code> is used to (re)construct the graph of - all possible control transfers between the insns to be - allocated. <code>regUsage</code> returns the sets of registers - read and written by an instruction. And - <code>patchRegs</code> is used to apply the allocator's final - decision on virtual-to-real reg mapping to an instruction. - <p> - Clearly these 3 fns have to be written anew for each - architecture. They are defined in - <code>RegAllocInfo.lhs</code>. Think twice, no, thrice, - before modifying them: making false claims about insn - behaviour will lead to hard-to-find register allocation - errors. - <p> - <code>AsmRegAlloc.lhs</code> contains detailed comments about - how the allocator works. Here is a summary. The head honcho - <p> - <code>allocUsingTheseRegs :: [Instr] -> [Reg] -> (Bool, [Instr])</code> - <p> - takes a list of instructions and a list of real registers - available for allocation, and maps as many of the virtual regs - in the input into real ones as it can. The returned - <code>Bool</code> indicates whether or not it was - successful. If so, that's the end of it. If not, the caller - of <code>allocUsingTheseRegs</code> will attempt spilling. - More of that later. What <code>allocUsingTheseRegs</code> - does is: - <p> - <ul> - <li>Implicitly number each instruction by its position in the - input list. - <p> - <li>Using <code>insnFuture</code>, create the set of all flow - edges -- possible control transfers -- within this set of - insns. - <p> - <li>Using <code>regUsage</code> and iterating around the flow - graph from the previous step, calculate, for each virtual - register, the set of flow edges on which it is live. - <p> - <li>Make a real-register committment map, which gives the set - of edges for which each real register is committed (in - use). These sets are initially empty. For each virtual - register, attempt to find a real register whose current - committment does not intersect that of the virtual - register -- ie, is uncommitted on all edges that the - virtual reg is live. If successful, this means the vreg - can be assigned to the realreg, so add the vreg's set to - the realreg's committment. - <p> - <li>If all the vregs were assigned to a realreg, use - <code>patchInstr</code> to apply the mapping to the insns themselves. - </ul> - <p> - <b>Spilling</b> - <p> - If <code>allocUsingTheseRegs</code> fails, a baroque - mechanism comes into play. We now know that much simpler - schemes are available to do the same thing and give better - results. - Anyways: - <p> - The logic above <code>allocUsingTheseRegs</code>, in - <code>doGeneralAlloc</code> and <code>runRegAllocate</code>, - observe that allocation has failed with some set R of real - registers. So they apply <code>runRegAllocate</code> a second - time to the code, but remove (typically) two registers from R - before doing so. This naturally fails too, but returns a - partially-allocated sequence. <code>doGeneralAlloc</code> - then inserts spill code into the sequence, and finally re-runs - <code>allocUsingTheseRegs</code>, but supplying the original, - unadulterated R. This is guaranteed to succeed since the two - registers previously removed from R are sufficient to allocate - all the spill/restore instructions added. - <p> - Because x86 is very short of registers, and in the worst case - needs three removed from R, a softly-softly approach is used. - <code>doGeneralAlloc</code> first tries with zero regs removed - from R, then if that fails one, then two, etc. This means - <code>allocUsingTheseRegs</code> may get run several times - before a successful arrangement is arrived at. - <code>findReservedRegs</code> cooks up the sets of spill - registers to try with. - <p> - The resulting machinery is complicated and the generated spill - code is appalling. The saving grace is that spills are very - rare so it doesn't matter much. I did not invent this -- I inherited it. - <p> - <b>Dealing with common cases fast</b> - <p> - The entire reg-alloc mechanism described so far is general and - correct, but expensive overkill for many simple code blocks. - So to begin with we use - <code>doSimpleAlloc</code>, which attempts to do something - simple. It exploits the observation that if the total number - of virtual registers does not exceed the number of real ones - available, we can simply dole out a new realreg each time we - see mention of a new vreg, with no regard for control flow. - <code>doSimpleAlloc</code> therefore attempts this in a - single pass over the code. It gives up if it runs out of real - regs or sees any condition which renders the above observation - invalid (fixed reg uses, for example). - <p> - This clever hack handles the majority of code blocks quickly. - It was copied from the previous reg-allocator (the - Mattson/Partain/Marlow/Gill one). - </ul> - -<p> -<h2>Complications, observations, and possible improvements</h2> - -<h3>Real vs virtual registers in the instruction selectors</h3> - -The instruction selectors for expression trees, namely -<code>getRegister</code>, are complicated by the fact that some -expressions can only be computed into a specific register, whereas -the majority can be computed into any register. We take x86 as an -example, but the problem applies to all archs. -<p> -Terminology: <em>rreg</em> means real register, a real machine -register. <em>vreg</em> means one of an infinite set of virtual -registers. The type <code>Reg</code> is the sum of <em>rreg</em> and -<em>vreg</em>. The instruction selector generates sequences with -unconstrained use of vregs, leaving the register allocator to map them -all into rregs. -<p> -Now, where was I ? Oh yes. We return to the type of -<code>getRegister</code>, which despite its name, selects instructions -to compute the value of an expression tree. -<pre> - getRegister :: StixExpr -> NatM Register - - data Register - = Fixed PrimRep Reg InstrBlock - | Any PrimRep (Reg -> InstrBlock) - - type InstrBlock -- sequence of instructions -</pre> -At first this looks eminently reasonable (apart from the stupid -name). <code>getRegister</code>, and nobody else, knows whether or -not a given expression has to be computed into a fixed rreg or can be -computed into any rreg or vreg. In the first case, it returns -<code>Fixed</code> and indicates which rreg the result is in. In the -second case it defers committing to any specific target register by -returning a function from <code>Reg</code> to <code>InstrBlock</code>, -and the caller can specify the target reg as it sees fit. -<p> -Unfortunately, that forces <code>getRegister</code>'s callers (usually -itself) to use a clumsy and confusing idiom in the common case where -they do not care what register the result winds up in. The reason is -that although a value might be computed into a fixed rreg, we are -forbidden (on pain of segmentation fault :) from subsequently -modifying the fixed reg. This and other rules are record in "Rules of -the game" inside <code>MachCode.lhs</code>. -<p> -Why can't fixed registers be modified post-hoc? Consider a simple -expression like <code>Hp+1</code>. Since the heap pointer -<code>Hp</code> is definitely in a fixed register, call it R, -<code>getRegister</code> on subterm <code>Hp</code> will simply return -<code>Fixed</code> with an empty sequence and R. But we can't just -emit an increment instruction for R, because that trashes -<code>Hp</code>; instead we first have to copy it into a fresh vreg -and increment that. -<p> -With all that in mind, consider now writing a <code>getRegister</code> -clause for terms of the form <code>(1 + E)</code>. Contrived, yes, -but illustrates the matter. First we do -<code>getRegister</code> on E. Now we are forced to examine what -comes back. -<pre> - getRegister (OnePlus e) - = getRegister e `thenNat` \ e_result -> - case e_result of - Fixed e_code e_fixed - -> returnNat (Any IntRep (\dst -> e_code ++ [MOV e_fixed dst, INC dst])) - Any e_any - -> Any (\dst -> e_any dst ++ [INC dst]) -</pre> -This seems unreasonably cumbersome, yet the instruction selector is -full of such idioms. A good example of the complexities induced by -this scheme is shown by <code>trivialCode</code> for x86 in -<code>MachCode.lhs</code>. This deals with general integer dyadic -operations on x86 and has numerous cases. It was difficult to get -right. -<p> -An alternative suggestion is to simplify the type of -<code>getRegister</code> to this: -<pre> - getRegister :: StixExpr -> NatM (InstrBloc, VReg) - type VReg = .... a vreg ... -</pre> -and then we could safely write -<pre> - getRegister (OnePlus e) - = getRegister e `thenNat` \ (e_code, e_vreg) -> - returnNat (e_code ++ [INC e_vreg], e_vreg) -</pre> -which is about as straightforward as you could hope for. -Unfortunately, it requires <code>getRegister</code> to insert moves of -values which naturally compute into an rreg, into a vreg. Consider: -<pre> - 1 + ccall some-C-fn -</pre> -On x86 the ccall result is returned in rreg <code>%eax</code>. The -resulting sequence, prior to register allocation, would be: -<pre> - # push args - call some-C-fn - # move %esp to nuke args - movl %eax, %vreg - incl %vreg -</pre> -If, as is likely, <code>%eax</code> is not held live beyond this point -for any other purpose, the move into a fresh register is pointless; -we'd have been better off leaving the value in <code>%eax</code> as -long as possible. -<p> -The simplified <code>getRegister</code> story is attractive. It would -clean up the instruction selectors significantly and make it simpler -to write new ones. The only drawback is that it generates redundant -register moves. I suggest that eliminating these should be the job -of the register allocator. Indeed: -<ul> -<li>There has been some work on this already ("Iterated register - coalescing" ?), so this isn't a new idea. -<p> -<li>You could argue that the existing scheme inappropriately blurs the - boundary between the instruction selector and the register - allocator. The instruction selector should .. well .. just - select instructions, without having to futz around worrying about - what kind of registers subtrees get generated into. Register - allocation should be <em>entirely</em> the domain of the register - allocator, with the proviso that it should endeavour to allocate - registers so as to minimise the number of non-redundant reg-reg - moves in the final output. -</ul> - - -<h3>Selecting insns for 64-bit values/loads/stores on 32-bit platforms</h3> - -Note that this stuff doesn't apply on 64-bit archs, since the -<code>getRegister</code> mechanism applies there. - -The relevant functions are: -<pre> - assignMem_I64Code :: StixExpr -> StixExpr -> NatM InstrBlock - assignReg_I64Code :: StixReg -> StixExpr -> NatM InstrBlock - iselExpr64 :: StixExpr -> NatM ChildCode64 - - data ChildCode64 -- a.k.a "Register64" - = ChildCode64 - InstrBlock -- code - VRegUnique -- unique for the lower 32-bit temporary -</pre> -<code>iselExpr64</code> is the 64-bit, plausibly-named analogue of -<code>getRegister</code>, and <code>ChildCode64</code> is the analogue -of <code>Register</code>. The aim here was to generate working 64 -bit code as simply as possible. To this end, I used the -simplified <code>getRegister</code> scheme described above, in which -<code>iselExpr64</code>generates its results into two vregs which -can always safely be modified afterwards. -<p> -Virtual registers are, unsurprisingly, distinguished by their -<code>Unique</code>s. There is a small difficulty in how to -know what the vreg for the upper 32 bits of a value is, given the vreg -for the lower 32 bits. The simple solution adopted is to say that -any low-32 vreg may also have a hi-32 counterpart which shares the -same unique, but is otherwise regarded as a separate entity. -<code>getHiVRegFromLo</code> gets one from the other. -<pre> - data VRegUnique - = VRegUniqueLo Unique -- lower part of a split quantity - | VRegUniqueHi Unique -- upper part thereof -</pre> -Apart from that, 64-bit code generation is really simple. The sparc -and x86 versions are almost copy-n-pastes of each other, with minor -adjustments for endianness. The generated code isn't wonderful but -is certainly acceptable, and it works. - - - -<h3>Shortcomings and inefficiencies in the register allocator</h3> - -<h4>Redundant reconstruction of the control flow graph</h4> - -The allocator goes to considerable computational expense to construct -all the flow edges in the group of instructions it's allocating for, -by using the <code>insnFuture</code> function in the -<code>Instr</code> pseudo-abstract type. -<p> -This is really silly, because all that information is present at the -abstract C stage, but is thrown away in the translation to Stix. -So a good thing to do is to modify that translation to -produce a directed graph of Stix straight-line code blocks, -and to preserve that structure through the insn selector, so the -allocator can see it. -<p> -This would eliminate the fragile, hacky, arch-specific -<code>insnFuture</code> mechanism, and probably make the whole -compiler run measurably faster. Register allocation is a fair chunk -of the time of non-optimising compilation (10% or more), and -reconstructing the flow graph is an expensive part of reg-alloc. -It would probably accelerate the vreg liveness computation too. - -<h4>Really ridiculous method for doing spilling</h4> - -This is a more ambitious suggestion, but ... reg-alloc should be -reimplemented, using the scheme described in "Quality and speed in -linear-scan register allocation." (Traub?) For straight-line code -blocks, this gives an elegant one-pass algorithm for assigning -registers and creating the minimal necessary spill code, without the -need for reserving spill registers ahead of time. -<p> -I tried it in Rigr, replacing the previous spiller which used the -current GHC scheme described above, and it cut the number of spill -loads and stores by a factor of eight. Not to mention being simpler, -easier to understand and very fast. -<p> -The Traub paper also describes how to extend their method to multiple -basic blocks, which will be needed for GHC. It comes down to -reconciling multiple vreg-to-rreg mappings at points where control -flow merges. - -<h4>Redundant-move support for revised instruction selector suggestion</h4> - -As mentioned above, simplifying the instruction selector will require -the register allocator to try and allocate source and destination -vregs to the same rreg in reg-reg moves, so as to make as many as -possible go away. Without that, the revised insn selector would -generate worse code than at present. I know this stuff has been done -but know nothing about it. The Linear-scan reg-alloc paper mentioned -above does indeed mention a bit about it in the context of single -basic blocks, but I don't know if that's sufficient. - - - -<h3>x86 arcana that you should know about</h3> - -The main difficulty with x86 is that many instructions have fixed -register constraints, which can occasionally make reg-alloc fail -completely. And the FPU doesn't have the flat register model which -the reg-alloc abstraction (implicitly) assumes. -<p> -Our strategy is: do a good job for the common small subset, that is -integer loads, stores, address calculations, basic ALU ops (+, -, -and, or, xor), and jumps. That covers the vast majority of -executed insns. And indeed we do a good job, with a loss of -less than 2% compared with gcc. -<p> -Initially we tried to handle integer instructions with awkward -register constraints (mul, div, shifts by non-constant amounts) via -various jigglings of the spiller et al. This never worked robustly, -and putting platform-specific tweaks in the generic infrastructure is -a big No-No. (Not quite true; shifts by a non-constant amount are -still done by a giant kludge, and should be moved into this new -framework.) -<p> -Fortunately, all such insns are rare. So the current scheme is to -pretend that they don't have any such constraints. This fiction is -carried all the way through the register allocator. When the insn -finally comes to be printed, we emit a sequence which copies the -operands through memory (<code>%esp</code>-relative), satisfying the -constraints of the real instruction. This localises the gruesomeness -to just one place. Here, for example, is the code generated for -integer divison of <code>%esi</code> by <code>%ecx</code>: -<pre> - # BEGIN IQUOT %ecx, %esi - pushl $0 - pushl %eax - pushl %edx - pushl %ecx - movl %esi,% eax - cltd - idivl 0(%esp) - movl %eax, 12(%esp) - popl %edx - popl %edx - popl %eax - popl %esi - # END IQUOT %ecx, %esi -</pre> -This is not quite as appalling as it seems, if you consider that the -division itself typically takes 16+ cycles, whereas the rest of the -insns probably go through in about 1 cycle each. -<p> -This trick is taken to extremes for FP operations. -<p> -All notions of the x86 FP stack and its insns have been removed. -Instead, we pretend, to the instruction selector and register -allocator, that x86 has six floating point registers, -<code>%fake0</code> .. <code>%fake5</code>, which can be used in the -usual flat manner. We further claim that x86 has floating point -instructions very similar to SPARC and Alpha, that is, a simple -3-operand register-register arrangement. Code generation and register -allocation proceed on this basis. -<p> -When we come to print out the final assembly, our convenient fiction -is converted to dismal reality. Each fake instruction is -independently converted to a series of real x86 instructions. -<code>%fake0</code> .. <code>%fake5</code> are mapped to -<code>%st(0)</code> .. <code>%st(5)</code>. To do reg-reg arithmetic -operations, the two operands are pushed onto the top of the FP stack, -the operation done, and the result copied back into the relevant -register. When one of the operands is also the destination, we emit a -slightly less scummy translation. There are only six -<code>%fake</code> registers because 2 are needed for the translation, -and x86 has 8 in total. -<p> -The translation is inefficient but is simple and it works. A cleverer -translation would handle a sequence of insns, simulating the FP stack -contents, would not impose a fixed mapping from <code>%fake</code> to -<code>%st</code> regs, and hopefully could avoid most of the redundant -reg-reg moves of the current translation. -<p> -There are, however, two unforeseen bad side effects: -<ul> -<li>This doesn't work properly, because it doesn't observe the normal - conventions for x86 FP code generation. It turns out that each of - the 8 elements in the x86 FP register stack has a tag bit which - indicates whether or not that register is notionally in use or - not. If you do a FPU operation which happens to read a - tagged-as-empty register, you get an x87 FPU (stack invalid) - exception, which is normally handled by the FPU without passing it - to the OS: the program keeps going, but the resulting FP values - are garbage. The OS can ask for the FPU to pass it FP - stack-invalid exceptions, but it usually doesn't. - <p> - Anyways: inside NCG created x86 FP code this all works fine. - However, the NCG's fiction of a flat register set does not operate - the x87 register stack in the required stack-like way. When - control returns to a gcc-generated world, the stack tag bits soon - cause stack exceptions, and thus garbage results. - <p> - The only fix I could think of -- and it is horrible -- is to clear - all the tag bits just before the next STG-level entry, in chunks - of code which use FP insns. <code>i386_insert_ffrees</code> - inserts the relevant <code>ffree</code> insns into such code - blocks. It depends critically on <code>is_G_instr</code> to - detect such blocks. -<p> -<li>It's very difficult to read the generated assembly and - reason about it when debugging, because there's so much clutter. - We print the fake insns as comments in the output, and that helps - a bit. -</ul> - - - -<h3>Generating code for ccalls</h3> - -For reasons I don't really understand, the instruction selectors for -generating calls to C (<code>genCCall</code>) have proven surprisingly -difficult to get right, and soaked up a lot of debugging time. As a -result, I have once again opted for schemes which are simple and not -too difficult to argue as correct, even if they don't generate -excellent code. -<p> -The sparc ccall generator in particular forces all arguments into -temporary virtual registers before moving them to the final -out-registers (<code>%o0</code> .. <code>%o5</code>). This creates -some unnecessary reg-reg moves. The reason is explained in a -comment in the code. - - -<h3>Duplicate implementation for many STG macros</h3> - -This has been discussed at length already. It has caused a couple of -nasty bugs due to subtle untracked divergence in the macro -translations. The macro-expander really should be pushed up into the -Abstract C phase, so the problem can't happen. -<p> -Doing so would have the added benefit that the NCG could be used to -compile more "ways" -- well, at least the 'p' profiling way. - - -<h3>How to debug the NCG without losing your sanity/hair/cool</h3> - -Last, but definitely not least ... -<p> -The usual syndrome is that some program, when compiled via C, works, -but not when compiled via the NCG. Usually the problem is fairly -simple to fix, once you find the specific code block which has been -mistranslated. But the latter can be nearly impossible, since most -modules generate at least hundreds and often thousands of them. -<p> -My solution: cheat. -<p> -Because the via-C and native routes diverge only late in the day, -it is not difficult to construct a 1-1 correspondence between basic -blocks on the two routes. So, if the program works via C but not on -the NCG, do the following: -<ul> -<li>Recompile <code>AsmCodeGen.lhs</code> in the afflicted compiler - with <code>-DDEBUG_NCG</code>, so that it inserts - <code>___ncg_debug_marker</code>s - into the assembly it emits. -<p> -<li>Using a binary search on modules, find the module which is causing - the problem. -<p> -<li>Compile that module to assembly code, with identical flags, twice, - once via C and once via NCG. - Call the outputs <code>ModuleName.s-gcc</code> and - <code>ModuleName.s-nat</code>. Check that the latter does indeed have - <code>___ncg_debug_marker</code>s in it; otherwise the next steps fail. -<p> -<li>Build (with a working compiler) the program - <code>fptools/ghc/utils/debugNCG/diff_gcc_nat</code>. -<p> -<li>Run: <code>diff_gcc_nat ModuleName.s</code>. This will - construct the 1-1 correspondence, and emits on stdout - a cppable assembly output. Place this in a file -- I always - call it <code>synth.S</code>. Note, the capital S is important; - otherwise it won't get cpp'd. You can feed this file directly to - ghc and it will automatically get cpp'd; you don't have to do so - yourself. -<p> -<li>By messing with the <code>#define</code>s at the top of - <code>synth.S</code>, do a binary search to find the incorrect - block. Keep a careful record of where you are in the search; it - is easy to get confused. Remember also that multiple blocks may - be wrong, which also confuses matters. Finally, I usually start - off by re-checking that I can build the executable with all the - <code>#define</code>s set to 0 and then all to 1. This ensures - you won't get halfway through the search and then get stuck due to - some snafu with gcc-specific literals. Usually I set - <code>UNMATCHED_GCC</code> to 1 all the time, and this bit should - contain only literal data. - <code>UNMATCHED_NAT</code> should be empty. -</ul> -<p> -<code>diff_gcc_nat</code> was known to work correctly last time I used -it, in December 01, for both x86 and sparc. If it doesn't work, due -to changes in assembly syntax, or whatever, make it work. The -investment is well worth it. Searching for the incorrect block(s) any -other way is a total time waster. - - - -</ul> - - - - - <p><small> -<!-- hhmts start --> -Last modified: Mon Aug 19 11:41:43 CEST 2013 -<!-- hhmts end --> - </small> - </body> -</html> |
