ghc/docs/comm/the-beast/ghci.html


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">
<html>
  <head>
    <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=ISO-8859-1">
    <title>The GHC Commentary - GHCi</title>
  </head>

  <body BGCOLOR="FFFFFF">
    <h1>The GHC Commentary - GHCi</h1>

    This isn't a coherent description of how GHCi works, sorry.  What
    it is (currently) is a dumping ground for various bits of info
    pertaining to GHCi, which ought to be recorded somewhere.

    <h2>Debugging the interpreter</h2>

    The usual symptom is that some expression / program crashes when
    running on the interpreter (commonly), or gets wierd results
    (rarely).  Unfortunately, finding out what the problem really is
    has proven to be extremely difficult.  In retrospect it may be
    argued a design flaw that GHC's implementation of the STG
    execution mechanism provides only the weakest of support for
    automated internal consistency checks.  This makes it hard to
    debug.
    <p>
    Execution failures in the interactive system can be due to
    problems with the bytecode interpreter, problems with the bytecode
    generator, or problems elsewhere.  From the bugs seen so far, 
    the bytecode generator is often the culprit, with the interpreter
    usually being correct.
    <p>
    Here are some tips for tracking down interactive nonsense:
    <ul>
      <li>Find the smallest source fragment which causes the problem.
      <p>
      <li>Using an RTS compiled with <code>-DDEBUG</code> (nb, that
          means the RTS from the previous stage!), run with <code>+RTS
          -D2</code> to get a listing in great detail from the
          interpreter.  Note that the listing is so voluminous that
          this is impractical unless you have been diligent in
          the previous step.
      <p>
      <li>At least in principle, using the trace and a bit of GDB
          poking around at the time of death, you can figure out what
          the problem is.  In practice you quickly get depressed at
          the hopelessness of ever making sense of the mass of
          details.  Well, I do, anyway.
      <p>
      <li><code>+RTS -D2</code> tries hard to print useful
          descriptions of what's on the stack, and often succeeds.
          However, it has no way to map addresses to names in
          code/data loaded by our runtime linker.  So the C function
          <code>ghci_enquire</code> is provided.  Given an address, it
          searches the loaded symbol tables for symbols close to that
          address.  You can run it from inside GDB:
      <pre>
      (gdb) p ghci_enquire ( 0x50a406f0 )
      0x50a406f0 + -48  ==  `PrelBase_Czh_con_info'
      0x50a406f0 + -12  ==  `PrelBase_Izh_static_info'
      0x50a406f0 + -48  ==  `PrelBase_Czh_con_entry'
      0x50a406f0 + -24  ==  `PrelBase_Izh_con_info'
      0x50a406f0 +  16  ==  `PrelBase_ZC_con_entry'
      0x50a406f0 +   0  ==  `PrelBase_ZMZN_static_entry'
      0x50a406f0 + -36  ==  `PrelBase_Czh_static_entry'
      0x50a406f0 + -24  ==  `PrelBase_Izh_con_entry'
      0x50a406f0 +  64  ==  `PrelBase_EQ_static_info'
      0x50a406f0 +   0  ==  `PrelBase_ZMZN_static_info'
      0x50a406f0 +  48  ==  `PrelBase_LT_static_entry'
      $1 = void
      </pre>
         In this case the enquired-about address is
         <code>PrelBase_ZMZN_static_entry</code>.  If no symbols are
         close to the given addr, nothing is printed.  Not a great
         mechanism, but better than nothing.
      <p>
      <li>We have had various problems in the past due to the bytecode
          generator (<code>compiler/ghci/ByteCodeGen.lhs</code>) being
          confused about the true set of free variables of an
          expression.  The compilation scheme for <code>let</code>s
          applies the BCO for the RHS of the let to its free
          variables, so if the free-var annotation is wrong or
          misleading, you end up with code which has wrong stack
          offsets, which is usually fatal.
      <p>
      <li>The baseline behaviour of the interpreter is to interpret
          BCOs, and hand all other closures back to the scheduler for
          evaluation.  However, this causes a huge number of expensive
          context switches, so the interpreter knows how to enter the
          most common non-BCO closure types by itself.
          <p>
          These optimisations complicate the interpreter.
          If you think you have an interpreter problem, re-enable the
          define <code>REFERENCE_INTERPRETER</code> in
          <code>ghc/rts/Interpreter.c</code>.  All optimisations are
          thereby disabled, giving the baseline
          I-only-know-how-to-enter-BCOs behaviour.
      <p>
      <li>Following the traces is often problematic because execution
          hops back and forth between the interpreter, which is
          traced, and compiled code, which you can't see.
          Particularly annoying is when the stack looks OK in the
          interpreter, then compiled code runs for a while, and later
          we arrive back in the interpreter, with the stack corrupted,
          and usually in a completely different place from where we
          left off.
          <p>
          If this is biting you baaaad, it may be worth copying
          sources for the compiled functions causing the problem, into
          your interpreted module, in the hope that you stay in the
          interpreter more of the time.  Of course this doesn't work
          very well if you've defined
          <code>REFERENCE_INTERPRETER</code> in
          <code>ghc/rts/Interpreter.c</code>.
      <p>
      <li>There are various commented-out pieces of code in
          <code>Interpreter.c</code> which can be used to get the
          stack sanity-checked after every entry, and even after after
          every bytecode instruction executed.  Note that some
          bytecodes (<code>PUSH_UBX</code>) leave the stack in
          an unwalkable state, so the <code>do_print_stack</code>
          local variable is used to suppress the stack walk after
          them.
    </ul>


    <h2>Useful stuff to know about the interpreter</h2>

    The code generation scheme is straightforward (naive, in fact).
    <code>-ddump-bcos</code> prints each BCO along with the Core it
    was generated from, which is very handy.
    <ul>
    <li>Simple lets are compiled in-line.  For the general case, let
        v = E in ..., E is compiled into a new BCO which takes as
        args its free variables, and v is bound to AP(the new BCO,
        free vars of E).
    <p>
    <li><code>case</code>s as usual, become: push the return
        continuation, enter the scrutinee.  There is some magic to
        make all combinations of compiled/interpreted calls and
        returns work, described below.  In the interpreted case, all
        case alts are compiled into a single big return BCO, which
        commences with instructions implementing a switch tree.
    </ul>
    <p>
    <b>ARGCHECK magic</b>
    <p>
    You may find ARGCHECK instructions at the start of BCOs which
    don't appear to need them; case continuations in particular.
    These play an important role: they force objects which should
    evaluated to BCOs to actually be BCOs.
    <p>
    Typically, there may be an application node somewhere in the heap.
    This is a thunk which when leant on turns into a BCO for a return
    continuation.  The thunk may get entered with an update frame on
    top of the stack.  This is legitimate since from one viewpoint
    this is an AP which simply reduces to a data object, so does not
    have functional type.  However, once the AP turns itself into a
    BCO (so to speak) we cannot simply enter the BCO, because that
    expects to see args on top of the stack, not an update frame.
    Therefore any BCO which expects something on the stack above an
    update frame, even non-function BCOs, start with an ARGCHECK.  In
    this case it fails, the update is done, the update frame is
    removed, and the BCO re-entered.  Subsequent entries of the BCO of
    course go unhindered.
    <p>
    The optimised (<code>#undef REFERENCE_INTERPRETER</code>) handles
    this case specially, so that a trip through the scheduler is
    avoided.  When reading traces from <code>+RTS -D2 -RTS</code>, you
    may see BCOs which appear to execute their initial ARGCHECK insn
    twice.  The first time it fails; the interpreter does the update
    immediately and re-enters with no further comment.
    <p>
    This is all a bit ugly, and, as SimonM correctly points out, it
    would have been cleaner to make BCOs unpointed (unthunkable)
    objects, so that a pointer to something <code>:: BCO#</code>
    really points directly at a BCO.
    <p>
    <b>Stack management</b>
    <p>
    There isn't any attempt to stub the stack, minimise its growth, or
    generally remove unused pointers ahead of time.  This is really
    due to lazyness on my part, although it does have the minor
    advantage that doing something cleverer would almost certainly
    increase the number of bytecodes that would have to be executed.
    Of course we SLIDE out redundant stuff, to get the stack back to
    the sequel depth, before returning a HNF, but that's all.  As
    usual this is probably a cause of major space leaks.
    <p>
    <b>Building constructors</b>
    <p>
    Constructors are built on the stack and then dumped into the heap
    with a single PACK instruction, which simply copies the top N
    words of the stack verbatim into the heap, adds an info table, and zaps N
    words from the stack.  The constructor args are pushed onto the
    stack one at a time.  One upshot of this is that unboxed values
    get pushed untaggedly onto the stack (via PUSH_UBX), because that's how they
    will be in the heap.  That in turn means that the stack is not 
    always walkable at arbitrary points in BCO execution, although
    naturally it is whenever GC might occur.
    <p>
    Function closures created by the interpreter use the AP-node
    (tagged) format, so although their fields are similarly
    constructed on the stack, there is never a stack walkability
    problem.
    <p>
    <b>Unpacking constructors</b>
    <p>
    At the start of a case continuation, the returned constructor is
    unpacked onto the stack, which means that unboxed fields have to
    be tagged.  Rather than burdening all such continuations with a
    complex, general mechanism, I split it into two.  The
    allegedly-common all-pointers case uses a single UNPACK insn
    to fish out all fields with no further ado.  The slow case uses a
    sequence of more complex UPK_TAG insns, one for each field (I
    think).  This seemed like a good compromise to me.
    <p>
    <b>Perspective</b>
    <p>
    I designed the bytecode mechanism with the experience of both STG
    hugs and Classic Hugs in mind.  The latter has an small
    set of bytecodes, a small interpreter loop, and runs amazingly
    fast considering the cruddy code it has to interpret.  The former
    had a large interpretative loop with many different opcodes,
    including multiple minor variants of the same thing, which
    made it difficult to optimise and maintain, yet it performed more
    or less comparably with Classic Hugs.
    <p>
    My design aims were therefore to minimise the interpreter's
    complexity whilst maximising performance.  This means reducing the
    number of opcodes implemented, whilst reducing the number of insns
    despatched.  In particular there are only two opcodes, PUSH_UBX
    and UPK_TAG, which deal with tags.  STG Hugs had dozens of opcodes
    for dealing with tagged data.  In cases where the common
    all-pointers case is significantly simpler (UNPACK) I deal with it
    specially.  Finally, the number of insns executed is reduced a
    little by merging multiple pushes, giving PUSH_LL and PUSH_LLL.
    These opcode pairings were determined by using the opcode-pair
    frequency profiling stuff which is ifdef-d out in
    <code>Interpreter.c</code>.  These significantly improve
    performance without having much effect on the uglyness or
    complexity of the interpreter.
    <p>
    Overall, the interpreter design is something which turned out
    well, and I was pleased with it.  Unfortunately I cannot say the
    same of the bytecode generator.

    <h2><code>case</code> returns between interpreted and compiled code</h2>

    Variants of the following scheme have been drifting around in GHC
    RTS documentation for several years.  Since what follows is
    actually what is implemented, I guess it supersedes all other
    documentation.  Beware; the following may make your brain melt.
    In all the pictures below, the stack grows downwards.
    <p>
    <b>Returning to interpreted code</b>.
    <p>
    Interpreted returns employ a set of polymorphic return infotables.
    Each element in the set corresponds to one of the possible return
    registers (R1, D1, F1) that compiled code will place the returned
    value in.  In fact this is a bit misleading, since R1 can be used
    to return either a pointer or an int, and we need to distinguish
    these cases.  So, supposing the set of return registers is {R1p,
    R1n, D1, F1}, there would be four corresponding infotables,
    <code>stg_ctoi_ret_R1p_info</code>, etc.  In the pictures below we
    call them <code>stg_ctoi_ret_REP_info</code>.  
    <p>
    These return itbls are polymorphic, meaning that all 8 vectored
    return codes and the direct return code are identical.
    <p>
    Before the scrutinee is entered, the stack is arranged like this:
    <pre>
   |        |
   +--------+
   |  BCO   | -------> the return contination BCO
   +--------+
   | itbl * | -------> stg_ctoi_ret_REP_info, with all 9 codes as follows:
   +--------+
                          BCO* bco = Sp[1];
                          push R1/F1/D1 depending on REP
                          push bco
                          yield to sched
    </pre>
    On entry, the interpreted contination BCO expects the stack to look
    like this:
    <pre>
   |        |
   +--------+
   |  BCO   | -------> the return contination BCO
   +--------+
   | itbl * | -------> ret_REP_ctoi_info, with all 9 codes as follows:
   +--------+
   : VALUE  :  (the returned value, shown with : since it may occupy
   +--------+   multiple stack words)
    </pre>
    A machine code return will park the returned value in R1/F1/D1,
    and enter the itbl on the top of the stack.  Since it's our magic
    itbl, this pushes the returned value onto the stack, which is
    where the interpreter expects to find it.  It then pushes the BCO
    (again) and yields.  The scheduler removes the BCO from the top,
    and enters it, so that the continuation is interpreted with the
    stack as shown above.
    <p>
    An interpreted return will create the value to return at the top
    of the stack.  It then examines the return itbl, which must be
    immediately underneath the return value, to see if it is one of
    the magic <code>stg_ctoi_ret_REP_info</code> set.  Since this is so,
    it knows it is returning to an interpreted contination.  It
    therefore simply enters the BCO which it assumes it immediately
    underneath the itbl on the stack.

    <p>
    <b>Returning to compiled code</b>.
    <p>
    Before the scrutinee is entered, the stack is arranged like this:
    <pre>
                        ptr to vec code 8 ------> return vector code 8
   |        |           ....
   +--------+           ptr to vec code 1 ------> return vector code 1
   | itbl * | --        Itbl end
   +--------+   \       ....   
                 \      Itbl start
                  ----> direct return code
    </pre>
    The scrutinee value is then entered.
    The case continuation(s) expect the stack to look the same, with
    the returned HNF in a suitable return register, R1, D1, F1 etc.
    <p>
    A machine code return knows whether it is doing a vectored or
    direct return, and, if the former, which vector element it is.
    So, for a direct return we jump to <code>Sp[0]</code>, and for a
    vectored return, jump to <code>((CodePtr*)(Sp[0]))[ - ITBL_LENGTH
    - vector number ]</code>.  This is (of course) the scheme that
    compiled code has been using all along.
    <p>
    An interpreted return will, as described just above, have examined
    the itbl immediately beneath the return value it has just pushed,
    and found it not to be one of the <code>ret_REP_ctoi_info</code> set,
    so it knows this must be a return to machine code.  It needs to
    pop the return value, currently on the stack, into R1/F1/D1, and
    jump through the info table.  Unfortunately the first part cannot
    be accomplished directly since we are not in Haskellised-C world.
    <p>
    We therefore employ a second family of magic infotables, indexed,
    like the first, on the return representation, and therefore with
    names of the form <code>stg_itoc_ret_REP_info</code>.  (Note:
    <code>itoc</code>; the previous bunch were <code>ctoi</code>).
    This is pushed onto the stack (note, tagged values have their tag
    zapped), giving:
    <pre>
   |        |
   +--------+
   | itbl * | -------> arbitrary machine code return itbl
   +--------+
   : VALUE  :  (the returned value, possibly multiple words)
   +--------+
   | itbl * | -------> stg_itoc_ret_REP_info, with code:
   +--------+
                          pop myself (stg_itoc_ret_REP_info) off the stack
                          pop return value into R1/D1/F1
                          do standard machine code return to itbl at t.o.s.
    </pre>
    We then return to the scheduler, asking it to enter the itbl at
    t.o.s.  When entered, <code>stg_itoc_ret_REP_info</code> removes
    itself from the stack, pops the return value into the relevant
    return register, and returns to the itbl to which we were trying
    to return in the first place.  
    <p>
    Amazingly enough, this stuff all actually works!  Well, mostly ...
    <p>
    <b>Unboxed tuples: a Right Royal Spanner In The Works</b>
    <p>
    The above scheme depends crucially on having magic infotables
    <code>stg_{itoc,ctoi}_ret_REP_info</code> for each return
    representation <code>REP</code>.  It unfortunately fails miserably
    in the face of unboxed tuple returns, because the set of required
    tables would be infinite; this despite the fact that for any given
    unboxed tuple return type, the scheme could be made to work fine.
    <p>
    This is a serious problem, because it prevents interpreted
    code from doing <code>IO</code>-typed returns, since <code>IO
    t</code> is implemented as <code>(# t, RealWorld# #)</code> or
    thereabouts.  This restriction in turn rules out FFI stuff in the
    interpreter.  Not good.
    <p>
    Although we have no way to make general unboxed tuples work, we
    can at least make <code>IO</code>-types work using the following
    ultra-kludgey observation: <code>RealWorld#</code> doesn't really
    exist and so has zero size, in compiled code.  In turn this means
    that a type of the form <code>(# t, RealWorld# #)</code> has the
    same representation as plain <code>t</code> does.  So the bytecode
    generator, whilst rejecting code with general unboxed tuple
    returns, recognises and accepts this special case.  Which means
    that <code>IO</code>-typed stuff works in the interpreter.  Just.
    <p>
    If anyone asks, I will claim I was out of radio contact, on a
    6-month walking holiday to the south pole, at the time this was
    ... er ... dreamt up.


<p><small>
   
<!-- hhmts start -->
Last modified: Thursday February  7 15:33:49 GMT 2002
<!-- hhmts end -->
    </small>
  </body>
</html>