summaryrefslogtreecommitdiff
path: root/docs/storage-mgt/ldv.tex
blob: 9084ae1ea466b56fa414dfdf6271fcab88354c96 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
\documentclass{article}
\usepackage{code,a4wide}

\usepackage{graphics,epsfig,epic,eepic,epsfig}

\setlength{\parskip}{0.25cm}
\setlength{\parsep}{0.25cm}
\setlength{\topsep}{0cm}
\setlength{\parindent}{0cm}
\renewcommand{\textfraction}{0.2}
\renewcommand{\floatpagefraction}{0.7}


% Terminology
\newcommand{\block}{block}
\newcommand{\Block}{Block}
\newcommand{\segment}{segment}
\newcommand{\Segment}{Segment}
\newcommand{\step}{step}
\newcommand{\Step}{Step}

\newcommand{\note}[1]{{\em $\spadesuit$ #1}}

\begin{document}
\title{Implementation of Lag/Drag/Void/Use Profiling}
\author{Sungwoo Park \\ Simon Marlow}

\makeatactive
\maketitle

\section{Lag/Drag/Void/Use Profiling}

\emph{Lag/Drag/Void/Use} (LDVU) profiling~\cite{RR} is a profiling technique 
which yields a summary of the biography of all the dynamic closures created
during program execution.
In this profiling scheme,
the biography of a closure is determined by four important events associated
with the closure: \emph{creation}, \emph{first use}, 
\emph{last use}, and \emph{destruction} (see Figure~\ref{fig-ldv}).
The intervals between these successive events correspond to three phases
for the closure: \emph{lag} (between creation and first use), 
\emph{use} (between first use and last use), and 
\emph{drag} (between last use and destruction).
If the closure is never used, it is considered to remain in the \emph{void}
phase all its lifetime.

\begin{figure}[ht]
\begin{center}
\input{ldv.eepic}
\caption{The biography of a closure}
\label{fig-ldv}
\end{center}
\end{figure}

The LDVU profiler regularly performs heap censuses during program execution.
Each time a heap census is performed, the LDVU profiler increments a global
time, which is used for timing all the events (such as creation and destruction
of a closure) occurring during program execution.
Hence, for instance, all closures creating between two successive heap censuses
have the same creation time and belong to the same \emph{generation}.\footnote{In
this document, a generation is related with heap censuses, not garbage collections
as in other profiling schemes.}
After the program terminates, it yields a post-mortem report on how much 
of the \emph{live} graph is in one of the four phases at the moment of each 
heap census.

It must be emphasized that the LDVU profiler considers only live closures;
it should not take into consideration dead closures which do not constitute
the graph. Therefore, the result of LDVU profiling does not depend on the
frequency of garbage collections.

This document describes the implementation of LDVU profiling on the Glasgow
Haskell Compiler runtime system.\footnote{Unless otherwise noted, all identifiers
are defined in @LdvProfile.c@}.

\section{An Overview of the Implementation}

Every closure is augmented with an additional word in its profiling header
to accommodate three additional pieces of information: 
1) state flag indicating whether the closure has been used at least once or not.
2) creation time; 3) time of most recent use if any so far.
We refer to such a word as an LDV word.

The LDVU profiler maintains a global time, stored in @ldvTime@.
It is incremented each time a heap census is performed.
During a heap census, the profiler scans all live closures and computes the 
following: 
1) the total size of all closures which have never been used; 
2) the total size of all closures which have been used at least once 
in the past.\footnote{There is another category of closures, namely, 
\emph{inherently used} closures. We will explain 
in Section~\ref{sec-heap-censuses}.}
It is not until the whole program execution finishes that the profiler 
can actually decide the total size corresponding to each of the four phases for
a particular heap census. It is only when a closure is destroyed that the profiler
can determine how long the closure has been in a specific phase. 
Therefore, it is not sufficient to perform heap censuses periodically in order to
compute the profiling statistics: the runtime system needs to intercept
all events associated with any closures and update necessary information.

All events associated with closures are handled by one of the three 
macros defined
in @rts/include/StgLdv.h@: @LDV_recordCreate()@, @LDV_recordUse()@, and 
@LDV_recordDead()@.

\begin{itemize}
\item{@LDV_recordCreate()@} is called when a closure is created and updates its 
creation time field.

\item{@LDV_recordUse()@} is called when a closure is used and updates its most recent
use time field.

\item{@LDV_recordDead()@} is called when a closure @c@ is removed from the graph.
It does not update its LDV word (because @c@ is about to be destroyed).
Instead, it updates the statistics on LDVU profiling according to the following
observation:
if @c@ has never been used (which is indicated by the state flag in its LDV 
word), 
@c@ contributes to the void phase from its creation time to the last census
time; if @c@ was used at least once (which is also indicated by the state flag),
@c@ contributes to the @drag@ phase after its last use time. 
\end{itemize}

At the end of the program execution, the profiler performs a last census during
which all closures in the heap are declared to be dead and @LDV_recordDead()@
is invoked on each of them. 
Then, the profiler computes the final statistics. 

\section{LDV Words}

We choose to share the LDV word for both retainer profiling and LDVU 
profiling, which cannot take place simultaneously. 
This is the reason why there is a
union structure inside the @StgProHeader@ structure.
The field @hp.ldvw@ in the @StgProfHeader@ structure corresponds to the LDV 
word:
\begin{code}
typedef struct {
  ...
  union {
    retainerSet *rs;          // Retainer Set
    StgWord ldvw;             // Lag/Drag/Void Word
  } hp;
} StgProfHeader;
\end{code}
For instance, the LDV word of a closure @c@ can now be accessed with 
@c->header.prof.hp.ldvw@ (or by @LDVW(c)@ where @LDVW()@ is a macro in 
@rts/include/StgLdvProf.h@).

An LDV word is divided into three fields, whose position is specified
by three constants in @rts/include/StgLdvProf.h@:
\begin{itemize}
\item{@LDV_STATE_MASK@} corresponds to the state flag. 
\item{@LDV_CREATE_MASK@} corresponds to the creation time.
\item{@LDV_LAST_MASK@} corresponds to the most recent use time.
\end{itemize}
The constant @LDV_SHIFT@ specifies how many bits are allocated for 
creation time or most recent use time.
For instance, the creation time of a closure @c@ can be obtained by
@(LDVW(c) & LDV_CREATE_MASK) >> LDV_SHIFT@.

The creation time field and the most recent use time field can be set only by the 
macros @LDV_recordCreate()@ and @LDV_recordUse()@. 
@LDV_recordCreate()@ must be called whenever a new dynamic closure is created,
and this is handily accomplished by rewriting the macro @SET_PROF_HDR()@ 
(in @rts/include/ClosureMacros.h@) (we do not need to change @SET_STATIC_PROF_HDR()@
because static closures are not involved in LDVU profiling at all):

\begin{code}
#define SET_PROF_HDR(c,ccs_)            \
        ((c)->header.prof.ccs = ccs_,   \
        LDV_recordCreate((c)))
\end{code}

There are a few cases in which the info table of a closure changes through
an explicit invocation of @SET_INFO()@ or a direct assignment to its @header.info@
field: 1) an indirection closure is replaced by an old-generation 
indirection closure; 2) a thunk is replaced by a blackhole; 3) a thunk is replaced 
by an indirection closure when its evaluation result becomes available.

\emph{We regard such a situation as
the destruction of an old closure followed by the creation of a new closure
at the same memory address.}\footnote{This would be unnecessary if the two closures
are of the same size, but it is not always the case. We choose to distinguish
the two closures for the sake of consistency.}
For instance, when an @IND_PERM@ closure is replaced by an @IND_OLDGEN_PERM@
closures (during scavenging in @GC.c@), we wrap the invocation of @SET_INFO()@ with 
the invocations of @LDV_recordDead()@ and @LDV_recordCreate()@ as follows 
(@LDV_recordDead()@ requires the actual size of the closures being destroyed):

\begin{code}
  LDV_recordDead((StgClosure *)p, sizeofW(StgInd) - sizeofW(StgProfHeader));
  SET_INFO(((StgClosure *)p), &stg_IND_OLDGEN_PERM_info);
  LDV_recordCreate((StgClosure *)p);
\end{code}

\textbf{To do:}
A direct assignment to the @header.info@ field implies that its cost centre 
field is not initialized. This is no problem in the case of @EVACUATED@ closures 
because they will 
not be used again after a garbage collection. However, I am not sure if this is safe
for @BLACKHOLE_BQ@ closures (in @StgMiscClosures.hc@) when retainer profiling, 
which employs cost centre stacks, is going on. 
If it is safe, please leave a comment there.

@LDV_recordUse()@ is called on a closure whenever it is used, or \emph{entered}.
Its state flag changes if necessary to indicate that it has been used, and
the current global time is stored in its last use time field. 

\section{Global Time \texttt{ldvTime} and Retainer Profiling}

The global time, stored in @ldvTime@, records the current time period.
It is initialized to $1$ and incremented after each time a heap census
is completed through an invocation of @LdvCensus()@.  Note that each
value of @ldvTime@ represents a time \emph{period}, not a point in
time.

All closures created between two successive invocations of
@LdvCensus()@ have the same creation time.  If a closure is used at
least once between two successive heap censuses, we consider the
closure to be in the use phase during the corresponding time period
(because we just set its last use time field to the current value of
@ldvTime@ whenever it is used).  Notice that a closure with a creation
time $t_c$ may be destroyed before the actual heap census for time
$t_c$ and thus may \emph{not} be observed during the heap census for
time $t_c$.  Such a closure does not show up in the profile at all.

In addition, the value of @ldvTime@ indicates which of LDVU profiling
and retainer profiling is currently active: during LDVU profiling, it
is initialized to $1$ in @initLdvProfiling()@ and then increments as
LDVU profiling proceeds; during retainer profiling, however, it is
always fixed to $0$.  Thus, wherever a piece of code shared by both
retainer profiling and LDVU profiling comes to play, we usually need
to first examine the value of @ldvTime@ if necessary. For instance,
consider the macro @LDV_recordUse()@:

\begin{code}
#define LDV_recordUse(c)                              \
  if (ldvTime > 0)                                    \
    LDVW((c)) = (LDVW((c)) & LDV_CREATE_MASK) | ldvTime | LDV_STATE_USE; 
\end{code}

If retainer profiling is being performed, @ldvTime@ is equal to $0$,
and @LDV_recordUse()@ causes no side effect.\footnote{Due to this
interference with LDVU profiling, retainer profiling slows down a bit;
for instance, checking @ldvTime@ against $0$ in the above example
would always evaluate to @false@ during retainer profiling.
However, this is the price to be paid for our decision not to employ a
separate field for LDVU profiling.}

As another example, consider @LDV_recordCreate()@:

\begin{code}
#define LDV_recordCreate(c)   \
  LDVW((c)) = (ldvTime << LDV_SHIFT) | LDV_STATE_CREATE
\end{code}

The above definition of @LDV_recordCreate()@ works without any problem
even for retainer profiling: during retainer profiling, 
a retainer set field (@hp.ldvw@) must be initialized to a null pointer.
Since @ldvTime@ is fixed to $0$, @LDV_recordCreate()@ initializes 
retainer set fields correctly.

\section{Heap Censuses}
\label{sec-heap-censuses}

The LDVU profiler performs heap censuses periodically by invoking the
function @LdvCensus()@.  Because we need to know exactly which
closures in the heap are live at census time, we always precede the
census with a major garbage collection.

During a census, we examine each closure one by one and compute the
following three quantities:

\begin{enumerate}
\item the total size of all \emph{inherently used} closures.
\item the total size of all closures which have not been used (yet).
\item the total size of all closures which have been used at least once. 
\end{enumerate}

For most closures, a \emph{use} consists of entering the closure.  For
unlifted objects which are never entered (eg. @ARR_WORDS@), it would
be difficult to determine their points of use because such points are
scattered around the implementation in various primitive operations.
For this reason we consider all unlifted objects as ``inherently
used''.  The following types of closures are considered to be
inherently used: @TSO@, @MVAR@, @MUT_ARR_PTRS@, @MUT_ARR_PTRS_FROZEN@,
@ARR_WORDS@, @WEAK@, @MUT_VAR@, @MUT_CONS@, @FOREIGN@, @BCO@, and
@STABLE_NAME@.

The three quantities are stored in an @LdvGenInfo@ array @gi[]@.
@gi[]@ is indexed by time period.  For instance, @gi[ldvTime]@ stores
the three quantaties for the current global time period.  The
structure @LdvGenInfo@ is defined as follows:

\begin{code}
typedef struct {
  ...
  int inherentlyUsed;   // total size of 'inherently used' closures
  int notUsed;          // total size of 'not used yet' closures
  int used;             // total size of 'used at least once' closures
  ...
} LdvGenInfo;
\end{code} 

The above three quantities account for mutually exclusive sets of closures.
In other words, if a closure is not inherently used, it belongs to 
either the second or the third.

\subsection{Taking a Census of the Live Heap}

During a heap census, we need to visit every live closure once, so we
perform a linear scan of the live heap after a major GC.  We can take
advantage of the following facts to implement a linear scan for heap
censuses:

\begin{itemize}
\item The nursery is empty. The small object pool and the large object pool, 
      however, may \emph{not} be empty. This is because the garbage collector
      invokes @scheduleFinalizer()@ after removing dead closures, and
      @scheduleFinalizer()@ may create new closures through @allocate()@.
\item @IND@, @IND_OLDGEN@, and @EVACUATED@ closures do not appear in
the live heap.
\end{itemize}

There is one small complication when traversing the live heap: the
garbage collector may have replaced @WEAK@ objects with @DEAD_WEAK@
objects, which have a smaller size and hence leave some space before
the next object.  To avoid this problem we change the size of
@DEAD_WEAK@ objects to match that of @WEAK@ objects when profiling is
enabled (see @StgMiscClosures.hc@).

\section{Destruction of Closures}

In order to compute the total size of closures for each of the four
phases, we must report the destruction of every closure (except
inherently used closures) to the LDVU profiler by invoking
@LDV_recordDead()@.  @LDV_recordDead()@ must not be called on any
inherently used closure because any invocation of @LDV_recordDead()@
affects the statistics regarding void and drag phases, which no
inherently used closure can be in.

@LDV_recordDead()@ updates two fields @voidNew@ and @dragNew@ in the
@LdvGenInfo@ array @gi[]@:

\begin{code}
typedef struct {
  ...
  int voidNew; 
  int dragnew;
  ...
} LdvGenInfo;
\end{code} 

@gi[ldvTime].voidNew@ accumulates the size of all closures satisfying
the following two conditions: 1) observed during the heap census at
time @ldvTime@; 2) now known to have been in the void phase at time
@ldvTime@.  It is updated when a closure which has never been used is
destroyed.  Suppose that a closure @c@ which has never been used is
about to be destroyed.  If its creation time is $t_c$, we judge that
@c@ has been in the void phase all its lifetime, namely, from time
$t_c$ to @ldvTime@.  Since @c@ will not be observed during the next
heap census, which corresponds to time @ldvTime@, @c@ contributes to
the void phase of times $t_c$ through @ldvTime@ - 1.  Therefore, we
increase the @voidNew@ field of @gi[@$t_c$@]@ through @gi[ldvTime - 1]@
 by the size of @c@.\footnote{In the actual implementation, we
update @gi[$t_c$]@ and @gi[ldvTime]@ (not @gi[ldvTime@$ - $1@]@) only:
@gi[$t_c$]@ and @gi[ldvTime]@ are increased and decreased by the size
of @c@, respectively.  After finishing the program execution, we can
correctly adjust all the fields as follows: @gi[$t_c$]@ is computed as
$\sum_{i=0}^{t_c}$@gi[$i$]@.  }

@gi[ldvTime].dragNew@ accumulates the size of all closures satisfying the following
two conditions: 1) observed during the heap census at time @ldvTime@;
2) now known to have been in the drag phase at time @ldvTime@.
It is updated when a closure which has been used at least once is destroyed.
Suppose that a closure @c@ which has been used last at time $t_l$ is about to
be destroyed.
We judge that @c@ has been in the drag phase from time $t_l + 1$ to 
time @ldvTime@$ - 1$ (if $t_l + 1 > $@ldvTime@$ - 1$, nothing happens).
Therefore, we increase the @dragNew@ field of @gi[@$t_l + 1$@]@ through 
@gi[ldvTime@$ - 1$@]@
by the size of @c@.\footnote{As in the case of @voidNew@, we update
@gi[@$t_l + 1$@]@ and @gi[ldvTime]@ only.}

Now we need to find out all the cases of closure destruction.
There are four cases in which a closure is destroyed:

\begin{enumerate}
\item A closure is overwritten with a blackhole: 
  @UPD_BH_UPDATABLE()@ in @rts/include/StgMacros.h@,
  @threadLazyBlackHole()@ and @threadSqueezeStack()@ in @GC.c@,
  the entry code for @BLACKHOLE@ closures in @StgMiscClosures.hc@ (a 
  @BLACKHOLE@ closure is changed into a @BLACKHOLE_BQ@ closure).
  We call either @LDV_recordDead()@ or @LDV_recordDead_FILL_SLOP_DYNAMIC()@.

\item A weak pointer is overwritten with a dead weak pointer:
  @finalizzeWeakzh_fast()@ in @PrimOps.hc@, 
  @finalizeWeakPointersNow()@ and @scheduleFinalizers()@ in @Weak.c@.
  Since a weak pointer is inherently used, we do not call @LDV_recordDead()@.

\item A closure is overwritten with an indirection closure:
  @updateWithIndirection()@ and @updateWithPermIndirection()@ in @Storage.h@,
  @scavenge()@ in @GC.c@, in which an @IND_PERM@ closure is explicitly replaced
  with an @IND_OLDGEN_PERM@ closure during scavenging.
  We call either @LDV_recordDead()@ or @LDV_recordDead_FILL_SLOP_DYNAMIC()@.
  
\item Closures are removed permanently from the graph during garbage
collections.  We locate and dispose of all dead closures by linearly
scanning the from-space right before tidying up.  This is feasible
because any closures which is about to be removed from the graph still
remains in the from-space until tidying up is completed.  The next
subsection explains how to implement this idea.
\end{enumerate}

\subsection{Linear scan of the from-space during garbage collections}

In order to implement linear scan of the from-space during a garbage collection 
(before tidying up),
we need to take into consideration the following facts:

\begin{itemize}
\item The pointer @free@ of a block in the nursery may incorrectly point to
a byte past its actual boundary.
This happens because 
the Haskell mutator first increases @hpLim@ without comparing it with the
actual boundary when allocating fresh memory for a new closure.
@hpLim@ is later assigned to the pointer @free@ of the corresponding memory
block, which means that during a heap census, the pointer @hpLim@ may not
be trusted. 
Notice that @hpLim@ is not available during LDVU profiling; it is valid
only during the Haskell mutator time.

\item The from-space may well contain a good number of @EVACUATED@ closures,
and they must be skipped over.

\item The from-space includes the nursery. 
Furthermore, a closure in the nursery may not necessarily be adjacent to the next 
closure because slop words may lie between the two closures;
the Haskell mutator may allocate more space than actually needed in the
nursery when creating a closure, potentially leaving slop words. 
\end{itemize}

The first problem is easily solved by limiting the scan up to the
actual block boundary for each nursery block (see
@processNurseryForDead()@).  In other words, for a nursery block
descriptor @bd@, whichever of @bd->start@$ + $@BLOCK_SIZE_W@ and
@bd->free@ is smaller is used as the actual boundary.

We solve the second problem by exploiting LDV words of @EVACUATED@
closures: we store the size of an evacuated closure, which now resides
in the to-space, in the LDV word of the new @EVACUATED@ closure
occupying its memory.  This is easily implemented by inserting a call
to the macro @SET_EVACUAEE_FOR_LDV()@ in @copy()@ and @copyPart()@ (in
@GC.c@).  Thus, when we encounter an @EVACUATED@ closure while
linearly scanning the nursery, we can skip a correct number of words
by referring to its LDV word.

The third problem could be partially solved by always monitoring @Hp@
during the Haskell mutator time: whenever @Hp@ is increased, we fill
with zeroes as many words as the change of @HP@. Then, we could skip
any trailing zero words when linearly scanning the nursery.
Alternatively we could initialize the entire nursery with zeroes after
each garbage collection and not worry about any change made to @Hp@
during the Haskell mutator time.  The number of zero words to be
written to the nursery could be reduced in the first approach, for we
do not have to fill the header for a new closure.  Nevertheless we
choose to employ the second approach because it simplifies the
implementation code significantly (see @resetNurseries()@ in
@Storage.c@).  Moreover, the second approach compensates for its
redundant initialization cost by providing faster execution (due to a
single memory write loop in contrast to frequent memory write loops in
the first approach).  Also, we attribute the initialization cost to
the runtime system and thus the Haskell mutator behavior is little
affected.

There is further complication though: occasionally a closure is
overwritten with a closure of a smaller size, leaving some slop
between itself and the next closure in the heap.  There are two cases:

\begin{enumerate}
\item A closure is overwritten with a blackhole. 
\item A closure is overwritten with an indirection closure.
\end{enumerate}

In either case, an existing closure is destroyed after being replaced
with a new closure.  If the two closures are of the same size, no slop
words are introduced and we only need to invoke @LDV_recordDead()@ on
the existing closure, which cannot be an inherently used closure.  If
not, that is, the new closure is smaller than the existing closure
(the opposite cannot happen), we need to fill one or more slop words
with zeroes as well as invoke @LDV_recordDead()@ on the existing
closure.  The macro @LDV_recordDead_FILL_SLOP_DYNAMIC()@ accomplishes
these two tasks: it determines the size of the existing closure,
invokes @LDV_recordDead()@, and fills the slop words with zeroes.
After excluding all cases in which the two closures are of the same
size, we invoke @LDV_recordDead_FILL_SLOP_DYNAMIC()@ only from:

\begin{enumerate}
\item @threadLazyBlackHole()@ and @threadSqueezeStack()@ in @GC.c@
(for lazy blackholing),
\item @UPD_BH_UPDATABLE()@ in
@rts/include/StgMacros.h@ (for eager blackholing, which isn't the
default),
\item @updateWithIndirection()@ and @updateWithPermIndirection()@ 
in @Storage.h@.\footnote{Actually slop words created in 
@updateWithIndirection()@ cannot survive major garbage collections.
Still we invoke @LDV\_recordDead\_FILL\_SLOP\_DYNAMIC()@ to support linear
scan of the heap during a garbage collection, which is discussed in the next
section.}
\end{enumerate}

The linear scan of the from-space is initiated by the garbage
collector.  From the function @LdvCensusForDead()@, every dead closure
in the from-space is visited through an invocation of
@processHeapClosureForDead()@.

\subsection{Final scan of the heap}

Since a closure surviving the final garbage collection is implicitly destroyed
when the runtime system shuts down, we must invoke @processHeapClosureForDead@
on \emph{every} closure in the heap once more after the final garbage collection.
The function @LdvCensusKillAll()@, which is invoked from @shutdownHaskell()@
in @RtsStartup.c@, traverses the entire heap and visits each closure.
It also stops LDVU profiling by resetting @ldvTime@ to $0$. 

It may be that after LDVU profiling stops, new closures may be created
and even garbage collections may be performed.
We choose to ignore these closures because they are all concerned about
finalizing weak pointers (in @finalizeWeakPointersNow()@).
It can be catastrophic to invoke @LdvCensusKillAll()@ after finishing
@finalizeWeakPointersNow()@: @finalizeWeakPointersNow()@ calls
@rts_evalIO()@, which is essentially initiating a new program execution,
and no assumptions made upon LDVU profiling hold any longer. 

\section{Time of Use}

In order to yield correct LDVU profiling results, we must make sure
that @LDV_recordUse()@ be called on a closure whenever it is used;
otherwise, most of closures would be reported to be in the void phase.
@rts/include/StgLdvProf.h@ provides an entry macro @LDV_ENTER@ which
expands to @LDV_recordUse()@.  The compiler arranges to invoke
@LDV_ENTER@ in the entry code for every dynamic closure it generates
code for (constructors, thunks and functions).  We also have to add
@LDV_ENTER@ calls to the closures statically compiled into the RTS:
@PAP@s, @AP_UPD@s, standard thunk forms (in @StgStdThunks.hc@, and
several others in @StgMiscClosures.hc@.

\section{Computing Final Statistics}

After the final scan of the heap, we can accurately determine the total
size of closures in one of the four phases at the moment of each heap census.
The structure @LdvGenInfo@ is augmented with two additional fields 
@voidTotal@ and @dragTotal@:

\begin{code}
typedef struct {
  ...
  int voidTotal;
  int dragTotal;
  ...
} LdvGenInfo;
\end{code} 

@gi[@$i$@].voidTotal@ and @gi[@$i$@].dragTotal@ are computed
from @gi[@$i$@].voidNew@ and @gi[@$i$@].dragNew@, respectively.\footnote{Due
to a slight optimization described before, @gi[@$i$@].voidTotal@ is actually
computed as $\sum_{1 \leq j \leq i}$@gi[@$j$@].voidNew@. 
@gi[@$i$@].dragTotal@ is computed in a similar way.}
Then, the total size of closures in the lag phase @gi[@$i$@].lagTotal@ is computed
as @gi[@$i$@].notUsed@$-$@gi[@$i$@].voidTotal@ (because any unused closure
is either in the void phase or in the lag phase).
Similarly, 
the total size of closures in the use phase @gi[@$i$@].useTotal@ is computed
as @gi[@$i$@].used@$-$@gi[@$i$@].dragTotal@ (because any used closure
is either in the use phase or in the drag phase).
@endLdvProfiling()@, called from @endHeapProfiling@ in @ProfHeap.c@, computes these 
final statistics.

\section{Usage}

The runtime system option @-hL@ tells the executable program to
perform LDVU profiling and produce a @.hp@ file:

\begin{code}
$ Foo.out +RTS -hL
\end{code}

The option @-i@ can be used to 
specify a desired interval at which LDVU profiling is performed.
The default and minimum value is half a second:

\begin{code}
$ Foo.out +RTS -hL -i2.5 -RTS
\end{code}

The @.hp@ file can be supplied to the @hp2ps@ program to create a postscript
file showing the progress of LDVU profiling in a graph:

\begin{code}
$ hp2ps Foo.hs
$ gv Foo.ps
\end{code}

The horizontal axis of the graph is in the Haskell mutator time, which excludes
the runtime system time such as garbage collection time and LDVU profiling
time. 
The Haskell mutator runs a bit slower than it would without performing
LDVU profiling, but the difference is minute.
Also, the timer employed to periodically perform retainer profiling
is not perfectly accurate. Therefore, the result may slightly vary for each
execution of retainer profiling.

\textbf{To do:} Currently the LDVU profiling is not supported with @-G1@ option.

\textbf{To do:} When we perform LDVU profiling, the Haskell mutator time seems to
be affected by @-S@ or @-s@ runtime option. For instance, the following 
two options should result in nearly same profiling outputs, but
the second run (without @-Sstderr@ option) spends almost twice as
long in the Haskell mutator as the first run:
1) @+RTS -Sstderr -hL -RTS@; 2) @+RTS -hL -RTS@.
This is quite a subtle bug because this weird phenomenon is not 
observed in retainer profiling, yet the implementation of 
@mut_user_time_during_LDV()@ is completely analogous to that of 
@mut_user_time_during_RP()@. The overall shapes of the resultant graphs 
are almost the same, though.

\section{Files}

This section gives a summary of changes made to the GHC in 
implementing LDVU profiling.
Only three files (@rts/include/StgLdvProf.h@, @LdvProfile.c@, and 
@LdvProfile.h@) are new, and all others exist in the GHC.

@\rts\include@ directory:

\begin{description}
\item[StgLdvProf.h] defines type @LdvGenInfo@, constants, and macros related
with LDVU profiling.
\item[ClosureMacros.h] changes macro @SET_PROF_HDR()@.
\item[Stg.h] includes th header file @StgLdvProf.h@.
\item[StgMacros.h] changes macros @UPD_BH_UPDATABLE()@.
\end{description}

@\rts@ directory:

\begin{description}
\item[GC.c] invokes @LdvCensusForDead()@ before tidying up, sets @hasBeenAnyGC@ to 
  @true@, and changes @copy()@ and @copyPart()@.
  Invokes @LDV_recordDead()@ and @LDV_recordDead_FILL_SLOP_DYNAMIC()@.
\item[Itimer.c] changes @handle_tick()@.
\item[LdvProfile.c] implements the LDVU profiling engine. 
\item[LdvProfile.h] is the header for @LdvProfile.c@.
\item[PrimOps.hc] changes @finalizzeWeakzh_fast()@.
\item[ProfHeap.c] changes @initHeapProfiling()@ and @endHeapProfiling()@.
\item[Profiling.c] changes @initProfilingLogFile@ and @report_ccs_profiling()@.
\item[Proftimer.c] declares @ticks_to_retainer_ldv_profiling@,
  @performRetainerLdvProfiling@, and @doContextSwitch@.
\item[Proftimer.h] is the header for @Proftimer.c@. Defines @PROFILING_MIN_PERIOD@,
  which specifies the minimum profiling period and the default profiling period.
%\item[RtsAPI.c] implements @setProfileHeader()@.
\item[RtsFlags.c]
  sets @RtsFlags.ProfFlags.doHeapProfile@,
  adds a string to @usage_text[]@ in @setupRtsFlags()@.
\item[RtsFlags.h] defines constants @HEAP_BY_LDV@ and @LDVchar@.
\item[RtsStartup.c] changes @shutDownHaskell()@.
\item[Schedule.c] changes @schedule()@.
\item[Stats.c] 
  declares @LDV_start_time@, @LDV_tot_time@, @LDVe_start_time@, 
  @LDVe_tot_time@.
  Changes @mut_user_time_during_GC()@, @mut_user_time()@,
  @stat_startExit()@,
  @stat_endExit()@, and
  @stat_exit()@.
  Defines
  @mut_user_time_during_LDV()@,
  @stat_startLDV()@, and
  @stat_endLDV()@.
\item[Stats.h] is hte header for @Stats.c@.
\item[StgMiscClosures.hc] inserts entry macros in 
  @stg_IND_entry()@, @stg_IND_PERM_entry()@, @stg_IND_OLDGEN_entry()@,
  @stg_IND_OLDGEN_PERM_entry()@, @stg_BLACKHOLE_entry()@, @stg_BLACKHOLE_BQ_entry()@,
  and @stg_CAF_BLACKHOLE_entry()@.
  Invokes @LDV_recordDead()@ in @stg_BLACKHOLE_entry@.
  Redefines @stg_DEAD_WEAK_info@.
\item[Storage.c] changes @initStorage()@, @resetNurseries()@, and @allocNursery()@.
\item[Storage.h] changes @updateWithIndirection()@ and @updateWithPermIndirection()@.
\item[Updates.hc] inserts entry macros in @stg_PAP_entry()@ and @stg_AP_UPD_entry()@.
\item[Weak.c] changes @scheduleFinalizers()@.
\end{description}

\bibliographystyle{plain}
\bibliography{reference}

\end{document}