1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
|
%
% (c) The OBFUSCATION-THROUGH-GRATUITOUS-PREPROCESSOR-ABUSE Project,
% Glasgow University, 1990-2000
%
% \documentstyle[preprint]{acmconf}
\documentclass[11pt]{article}
\oddsidemargin 0.1 in % Note that \oddsidemargin = \evensidemargin
\evensidemargin 0.1 in
\marginparwidth 0.85in % Narrow margins require narrower marginal notes
\marginparsep 0 in
\sloppy
%\usepackage{epsfig}
\usepackage{shortvrb}
\MakeShortVerb{\@}
%\newcommand{\note}[1]{{\em Note: #1}}
\newcommand{\note}[1]{{{\bf Note:}\sl #1}}
\newcommand{\ToDo}[1]{{{\bf ToDo:}\sl #1}}
\newcommand{\Arg}[1]{\mbox{${\tt arg}_{#1}$}}
\newcommand{\bottom}{\perp}
\newcommand{\secref}[1]{Section~\ref{sec:#1}}
\newcommand{\figref}[1]{Figure~\ref{fig:#1}}
\newcommand{\Section}[2]{\section{#1}\label{sec:#2}}
\newcommand{\Subsection}[2]{\subsection{#1}\label{sec:#2}}
\newcommand{\Subsubsection}[2]{\subsubsection{#1}\label{sec:#2}}
% DIMENSION OF TEXT:
\textheight 8.5 in
\textwidth 6.25 in
\topmargin 0 in
\headheight 0 in
\headsep .25 in
\setlength{\parskip}{0.15cm}
\setlength{\parsep}{0.15cm}
\setlength{\topsep}{0cm} % Reduces space before and after verbatim,
% which is implemented using trivlist
\setlength{\parindent}{0cm}
\renewcommand{\textfraction}{0.2}
\renewcommand{\floatpagefraction}{0.7}
\begin{document}
\title{The GHCi Draft Design, round 2}
\author{MSR Cambridge Haskell Crew \\
Microsoft Research Ltd., Cambridge}
\maketitle
%%%\tableofcontents
%%%\newpage
%%-----------------------------------------------------------------%%
\section{Details}
\subsection{Outline of the design}
\label{sec:details-intro}
The design falls into three major parts:
\begin{itemize}
\item The compilation manager (CM), which coordinates the
system and supplies a HEP-like interface to clients.
\item The module compiler (@compile@), which translates individual
modules to interpretable or machine code.
\item The linker (@link@),
which maintains the executable image in interpreted mode.
\end{itemize}
There are also three auxiliary parts: the finder, which locates
source, object and interface files, the summariser, which quickly
finds dependency information for modules, and the static info
(compiler flags and package details), which is unchanged over the
course of a session.
This section continues with an overview of the session-lifetime data
structures. Then follows the finder (section~\ref{sec:finder}),
summariser (section~\ref{sec:summariser}),
static info (section~\ref{sec:staticinfo}),
and finally the three big sections
(\ref{sec:manager},~\ref{sec:compiler},~\ref{sec:linker})
on the compilation manager, compiler and linker respectively.
\subsubsection*{Some terminology}
Lifetimes: the phrase {\bf session lifetime} covers a complete run of
GHCI, encompassing multiple recompilation runs. {\bf Module lifetime}
is a lot shorter, being that of data needed to translate a single
module, but then discarded, for example Core, AbstractC, Stix trees.
Data structures with module lifetime are well documented and understood.
This document is mostly concerned with session-lifetime data.
Most of these structures are ``owned'' by CM, since that's
the only major component of GHCI which deals with session-lifetime
issues.
Modules and packages: {\bf home} refers to modules in this package,
precisely the ones tracked and updated by the compilation manager.
{\bf Package} refers to all other packages, which are assumed static.
\subsubsection*{A summary of all session-lifetime data structures}
These structures have session lifetime but not necessarily global
visibility. Subsequent sections elaborate who can see what.
\begin{itemize}
\item {\bf Home Symbol Table (HST)} (owner: CM) holds the post-renaming
environments created by compiling each home module.
\item {\bf Home Interface Table (HIT)} (owner: CM) holds in-memory
representations of the interface file created by compiling
each home module.
\item {\bf Unlinked Images (UI)} (owner: CM) are executable but as-yet
unlinked translations of home modules only.
\item {\bf Module Graph (MG)} (owner: CM) is the current module graph.
\item {\bf Static Info (SI)} (owner: CM) is the package configuration
information (PCI) and compiler flags (FLAGS).
\item {\bf Persistent Compiler State (PCS)} (owner: @compile@)
is @compile@'s private cache of information about package
modules.
\item {\bf Persistent Linker State (PLS)} (owner: @link@) is
@link@'s private information concerning the the current
state of the (in-memory) executable image.
\end{itemize}
%%-- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --%%
\subsection{The finder (\mbox{\tt type Finder})}
\label{sec:finder}
@Path@ could be an indication of a location in a filesystem, or it
could be some more generic kind of resource identifier, a URL for
example.
\begin{verbatim}
data Path = ...
\end{verbatim}
And some names. @Module@s are now used as primary keys for various
maps, so they are given a @Unique@.
\begin{verbatim}
type ModName = String -- a module name
type PkgName = String -- a package name
type Module = -- contains ModName and a Unique, at least
\end{verbatim}
A @ModLocation@ says where a module is, what it's called and in what
form it is.
\begin{verbatim}
data ModLocation = SourceOnly Module Path -- .hs
| ObjectCode Module Path Path -- .o, .hi
| InPackage Module PkgName
-- examine PCI to determine package Path
\end{verbatim}
The module finder generates @ModLocation@s from @ModName@s. We expect
it will assume packages to be static, but we want to be able to track
changes in home modules during the session. Specifically, we want to
be able to notice that a module's object and interface have been
updated, presumably by a compile run outside of the GHCI session.
Hence the two-stage type:
\begin{verbatim}
type Finder = ModName -> IO ModLocation
newFinder :: PCI -> IO Finder
\end{verbatim}
@newFinder@ examines the package information right at the start, but
returns an @IO@-typed function which can inspect home module changes
later in the session.
%%-- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --%%
\subsection{The summariser (\mbox{\tt summarise})}
\label{sec:summariser}
A @ModSummary@ records the minimum information needed to establish the
module graph and determine whose source has changed. @ModSummary@s
can be created quickly.
\begin{verbatim}
data ModSummary = ModSummary
ModLocation -- location and kind
(Maybe (String, Fingerprint))
-- source and fingerprint if .hs
(Maybe [ModName]) -- imports if .hs or .hi
type Fingerprint = ... -- file timestamp, or source checksum?
summarise :: ModLocation -> IO ModSummary
\end{verbatim}
The summary contains the location and source text, and the location
contains the name. We would like to remove the assumption that
sources live on disk, but I'm not sure this is good enough yet.
\ToDo{Should @ModSummary@ contain source text for interface files too?}
\ToDo{Also say that @ModIFace@ contains its module's @ModSummary@ (why?).}
%%-- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --%%
\subsection{Static information (SI)}
\label{sec:staticinfo}
PCI, the package configuration information, is a list of @PkgInfo@,
each containing at least the following:
\begin{verbatim}
data PkgInfo
= PkgInfo PkgName -- my name
Path -- path to my base location
[PkgName] -- who I depend on
[ModName] -- modules I supply
[Unlinked] -- paths to my object files
type PCI = [PkgInfo]
\end{verbatim}
The @Path@s in it, including those in the @Unlinked@s, are set up
when GHCI starts.
FLAGS is a bunch of compiler options. We haven't figured out yet how
to partition them into those for the whole session vs those for
specific source files, so currently the best we can do is:
\begin{verbatim}
data FLAGS = ...
\end{verbatim}
The static information (SI) is the both of these:
\begin{verbatim}
data SI = SI PCI
FLAGS
\end{verbatim}
%%-- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --%%
\subsection{The Compilation Manager (CM)}
\label{sec:manager}
\subsubsection{Data structures owned by CM}
CM maintains two maps (HST, HIT) and a set (UI). It's important to
realise that CM only knows about the map/set-ness, and has no idea
what a @ModDetails@, @ModIFace@ or @Linkable@ is. Only @compile@ and
@link@ know that, and CM passes these types around without
inspecting them.
\begin{itemize}
\item
{\bf Home Symbol Table (HST)} @:: FiniteMap Module ModDetails@
The @ModDetails@ (a couple of layers down) contain tycons, classes,
instances, etc, collectively known as ``entities''. Referrals from
other modules to these entities is direct, with no intervening
indirections of any kind; conversely, these entities refer directly
to other entities, regardless of module boundaries. HST only holds
information for home modules; the corresponding wired-up details
for package (non-home) modules are created on demand in the package
symbol table (PST) inside the persistent compiler's state (PCS).
CM maintains the HST, which is passed to, but not modified by,
@compile@. If compilation of a module is successful, @compile@
returns the resulting @ModDetails@ (inside the @CompResult@) which
CM then adds to HST.
CM throws away arbitrarily large parts of HST at the start of a
rebuild, and uses @compile@ to incrementally reconstruct it.
\item
{\bf Home Interface Table (HIT)} @:: FiniteMap Module ModIFace@
(Completely private to CM; nobody else sees this).
Compilation of a module always creates a @ModIFace@, which contains
the unlinked symbol table entries. CM maintains this @FiniteMap@
@ModName@ @ModIFace@, with session lifetime. CM never throws away
@ModIFace@s, but it does update them, by passing old ones to
@compile@ if they exist, and getting new ones back.
CM acquires @ModuleIFace@s from @compile@, which it only applies
to modules in the home package. As a result, HIT only contains
@ModuleIFace@s for modules in the home package. Those from other
packages reside in the package interface table (PIT) which is a
component of PCS.
\item
{\bf Unlinked Images (UI)} @:: Set Linkable@
The @Linkable@s in UI represent executable but as-yet unlinked
module translations. A @Linkable@ can contain the name of an
object, archive or DLL file. In interactive mode, it may also be
the STG trees derived from translating a module. So @compile@
returns a @Linkable@ from each successful run, namely that of
translating the module at hand.
At link-time, CM supplies @Linkable@s for the upwards closure of
all packages which have changed, to @link@. It also examines the
@ModSummary@s for all home modules, and by examining their imports
and the SI.PCI (package configuration info) it can determine the
@Linkable@s from all required imported packages too.
@Linkable@s and @ModIFace@s have a close relationship. Each
translated module has a corresponding @Linkable@ somewhere.
However, there may be @Linkable@s with no corresponding modules
(the RTS, for example). Conversely, multiple modules may share a
single @Linkable@ -- as is the case for any module from a
multi-module package. For these reasons it seems appropriate to
keep the two concepts distinct. @Linkable@s also provide
information about the sequence in which individual package
components should be linked, and that isn't the business of any
specific module to know.
CM passes @compile@ a module's old @ModIFace@, if it has one, in
the hope that the module won't need recompiling. If so, @compile@
can just return the new @ModDetails@ created from it, and CM will
re-use the old @ModIFace@. If the module {\em is} recompiled (or
scheduled to be loaded from disk), @compile@ returns both the
new @ModIFace@ and new @Linkable@.
\item
{\bf Module Graph (MG)} @:: known-only-to-CM@
Records, for CM's purposes, the current module graph,
up-to-dateness and summaries. More details when I get to them.
Only contains home modules.
\end{itemize}
Probably all this stuff is rolled together into the Persistent CM
State (PCMS):
\begin{verbatim}
data PCMS = PCMS HST HIT UI MG
emptyPCMS :: IO PCMS
\end{verbatim}
\subsubsection{What CM implements}
It pretty much implements the HEP interface. First, though, define a
containing structure for the state of the entire CM system and its
subsystems @compile@ and @link@:
\begin{verbatim}
data CmState
= CmState PCMS -- CM's stuff
PCS -- compile's stuff
PLS -- link's stuff
SI -- the static info, never changes
Finder -- the finder
\end{verbatim}
The @CmState@ is threaded through the HEP interface. In reality
this might be done using @IORef@s, but for clarity:
\begin{verbatim}
type ModHandle = ... (opaque to CM/HEP clients) ...
type HValue = ... (opaque to CM/HEP clients) ...
cmInit :: FLAGS
-> [PkgInfo]
-> IO CmState
cmLoadModule :: CmState
-> ModName
-> IO (CmState, Either [SDoc] ModHandle)
cmGetExpr :: ModHandle
-> CmState
-> String -> IO (CmState, Either [SDoc] HValue)
cmRunExpr :: HValue -> IO () -- don't need CmState here
\end{verbatim}
Almost all the huff and puff in this document pertains to @cmLoadModule@.
\subsubsection{Implementing \mbox{\tt cmInit}}
@cmInit@ creates an empty @CmState@ using @emptyPCMS@, @emptyPCS@,
@emptyPLS@, making SI from the supplied flags and package info, and
by supplying the package info the @newFinder@.
\subsubsection{Implementing \mbox{\tt cmLoadModule}}
\begin{enumerate}
\item {\bf Downsweep:} using @finder@ and @summarise@, chase from
the given module to
establish the new home module graph (MG). Do not chase into
package modules.
\item Remove from HIT, HST, UI any modules in the old MG which are
not in the new one. The old MG is then replaced by the new one.
\item Topologically sort MG to generate a bottom-to-top traversal
order, giving a worklist.
\item {\bf Upsweep:} call @compile@ on each module in the worklist in
turn, passing it
the ``correct'' HST, PCS, the old @ModIFace@ if
available, and the summary. ``Correct'' HST in the sense that
HST contains only the modules in the this module's downward
closure, so that @compile@ can construct the correct instance
and rule environments simply as the union of those in
the module's downward closure.
If @compile@ doesn't return a new interface/linkable pair,
compilation wasn't necessary. Either way, update HST with
the new @ModDetails@, and UI and HIT respectively if a
compilation {\em did} occur.
Keep going until the root module is successfully done, or
compilation fails.
\item If the previous step terminated because compilation failed,
define the successful set as those modules in successfully
completed SCCs, i.e. all @Linkable@s returned by @compile@ excluding
those from modules in any cycle which includes the module which failed.
Remove from HST, HIT, UI and MG all modules mentioned in MG which
are not in the successful set. Call @link@ with the successful
set,
which should succeed. The net effect is to back off to a point
in which those modules which are still aboard are correctly
compiled and linked.
If the previous step terminated successfully,
call @link@ passing it the @Linkable@s in the upward closure of
all those modules for which @compile@ produced a new @Linkable@.
\end{enumerate}
As a small optimisation, do this:
\begin{enumerate}
\item[3a.] Remove from the worklist any module M where M's source
hasn't changed and neither has the source of any module in M's
downward closure. This has the effect of not starting the upsweep
right at the bottom of the graph when that's not needed.
Source-change checking can be done quickly by CM by comparing
summaries of modules in MG against corresponding
summaries from the old MG.
\end{enumerate}
%%-- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --%%
\subsection{The compiler (\mbox{\tt compile})}
\label{sec:compiler}
\subsubsection{Data structures owned by \mbox{\tt compile}}
{\bf Persistent Compiler State (PCS)} @:: known-only-to-compile@
This contains info about foreign packages only, acting as a cache,
which is private to @compile@. The cache never becomes out of
date. There are three parts to it:
\begin{itemize}
\item
{\bf Package Interface Table (PIT)} @:: FiniteMap Module ModIFace@
@compile@ reads interfaces from modules in foreign packages, and
caches them in the PIT. Subsequent imports of the same module get
them directly out of the PIT, avoiding slow lexing/parsing phases.
Because foreign packages are assumed never to become out of date,
all contents of PIT remain valid forever. @compile@ of course
tries to find package interfaces in PIT in preference to reading
them from files.
Both successful and failed runs of @compile@ can add arbitrary
numbers of new interfaces to the PIT. The failed runs don't matter
because we assume that packages are static, so the data cached even
by a failed run is valid forever (ie for the rest of the session).
\item
{\bf Package Symbol Table (PST)} @:: FiniteMap Module ModDetails@
Adding an package interface to PIT doesn't make it directly usable
to @compile@, because it first needs to be wired (renamed +
typechecked) into the sphagetti of the HST. On the other hand,
most modules only use a few entities from any imported interface,
so wiring-in the interface at PIT-entry time might be a big time
waster. Also, wiring in an interface could mean reading other
interfaces, and we don't want to do that unnecessarily.
The PST avoids these problems by allowing incremental wiring-in to
happen. Pieces of foreign interfaces are copied out of the holding
pen (HP), renamed, typechecked, and placed in the PST, but only as
@compile@ discovers it needs them. In the process of incremental
renaming/typechecking, @compile@ may need to read more package
interfaces, which are added to the PIT and hence to
HP.~\ToDo{How? When?}
CM passes the PST to @compile@ and is returned an updated version
on both success and failure.
\item
{\bf Holding Pen (HP)} @:: HoldingPen@
HP holds parsed but not-yet renamed-or-typechecked fragments of
package interfaces. As typechecking of other modules progresses,
fragments are removed (``slurped'') from HP, renamed and
typechecked, and placed in PCS.PST (see above). Slurping a
fragment may require new interfaces to be read into HP. The hope
is, though, that many fragments will never get slurped, reducing
the total number of interfaces read (as compared to eager slurping).
\end{itemize}
PCS is opaque to CM; only @compile@ knows what's in it, and how to
update it. Because packages are assumed static, PCS never becomes
out of date. So CM only needs to be able to create an empty PCS,
with @emptyPCS@, and thence just passes it through @compile@ with
no further ado.
In return, @compile@ must promise not to store in PCS any
information pertaining to the home modules. If it did so, CM would
need to have a way to remove this information prior to commencing a
rebuild, which conflicts with PCS's opaqueness to CM.
\subsubsection{What {\tt compile} does}
@compile@ is necessarily somewhat complex. We've decided to do away
with private global variables -- they make the design specification
less clear, although the implementation might use them. Without
further ado:
\begin{verbatim}
compile :: SI -- obvious
-> Finder -- to find modules
-> ModSummary -- summary, including source
-> Maybe ModIFace
-- former summary, if avail
-> HST -- for home module ModDetails
-> PCS -- IN: the persistent compiler state
-> IO CompResult
data CompResult
= CompOK ModDetails -- new details (== HST additions)
(Maybe (ModIFace, Linkable))
-- summary and code; Nothing => compilation
-- not needed (old summary and code are still valid)
PCS -- updated PCS
[SDoc] -- warnings
| CompErrs PCS -- updated PCS
[SDoc] -- warnings and errors
data PCS
= MkPCS PIT -- package interfaces
PST -- post slurping global symtab contribs
HoldingPen -- pre slurping interface bits and pieces
emptyPCS :: IO PCS -- since CM has no other way to make one
\end{verbatim}
Although @compile@ is passed three of the global structures (FLAGS,
HST and PCS), it only modifies PCS. The rest are modified by CM as it
sees fit, from the stuff returned in the @CompResult@.
@compile@ is allowed to return an updated PCS even if compilation
errors occur, since the information in it pertains only to foreign
packages and is assumed to be always-correct.
What @compile@ does: \ToDo{A bit vague ... needs refining. How does
@finder@ come into the game?}
\begin{itemize}
\item Figure out if this module needs recompilation.
\begin{itemize}
\item If there's no old @ModIFace@, it does. Else:
\item Compare the @ModSummary@ supplied with that in the
old @ModIFace@. If the source has changed, recompilation
is needed. Else:
\item Compare the usage version numbers in the old @ModIFace@ with
those in the imported @ModIFace@s. All needed interfaces
for this should be in either HIT or PIT. If any version
numbers differ, recompilation is needed.
\item Otherwise it isn't needed.
\end{itemize}
\item
If recompilation is not needed, create a new @ModDetails@ from the
old @ModIFace@, looking up information in HST and PCS.PST as
necessary. Return the new details, a @Nothing@ denoting
compilation was not needed, the PCS \ToDo{I don't think the PCS
should be updated, but who knows?}, and an empty warning list.
\item
Otherwise, compilation is needed.
If the module is only available in object+interface form, read the
interface, make up details, create a linkable pointing at the
object code. \ToDo{Does this involve reading any more interfaces? Does
it involve updating PST?}
Otherwise, translate from source, then create and return: an
details, interface, linkable, updated PST, and warnings.
When looking for a new interface, search HST, then PCS.PIT, and only
then read from disk. In which case add the new interface(s) to
PCS.PIT.
\ToDo{If compiling a module with a boot-interface file, check the
boot interface against the inferred interface.}
\end{itemize}
\subsubsection{Contents of \mbox{\tt ModDetails},
\mbox{\tt ModIFace} and \mbox{\tt HoldingPen}}
Only @compile@ can see inside these three types -- they are opaque to
everyone else. @ModDetails@ holds the post-renaming,
post-typechecking environment created by compiling a module.
\begin{verbatim}
data ModDetails
= ModDetails {
moduleExports :: Avails
moduleEnv :: GlobalRdrEnv -- == FM RdrName [Name]
typeEnv :: FM Name TyThing -- TyThing is in TcEnv.lhs
instEnv :: InstEnv
fixityEnv :: FM Name Fixity
ruleEnv :: FM Id [Rule]
}
\end{verbatim}
@ModIFace@ is nearly the same as @ParsedIFace@ from @RnMonad.lhs@:
\begin{verbatim}
type ModIFace = ParsedIFace -- not really, but ...
data ParsedIface
= ParsedIface {
pi_mod :: Module, -- Complete with package info
pi_vers :: Version, -- Module version number
pi_orphan :: WhetherHasOrphans, -- Whether this module has orphans
pi_usages :: [ImportVersion OccName], -- Usages
pi_exports :: [ExportItem], -- Exports
pi_insts :: [RdrNameInstDecl], -- Local instance declarations
pi_decls :: [(Version, RdrNameHsDecl)], -- Local definitions
pi_fixity :: (Version, [RdrNameFixitySig]), -- Local fixity declarations,
-- with their version
pi_rules :: (Version, [RdrNameRuleDecl]), -- Rules, with their version
pi_deprecs :: [RdrNameDeprecation] -- Deprecations
}
\end{verbatim}
@HoldingPen@ is a cleaned-up version of that found in @RnMonad.lhs@,
retaining just the 3 pieces actually comprising the holding pen:
\begin{verbatim}
data HoldingPen
= HoldingPen {
iDecls :: DeclsMap, -- A single, global map of Names to decls
iInsts :: IfaceInsts,
-- The as-yet un-slurped instance decls; this bag is depleted when we
-- slurp an instance decl so that we don't slurp the same one twice.
-- Each is 'gated' by the names that must be available before
-- this instance decl is needed.
iRules :: IfaceRules
-- Similar to instance decls, only for rules
}
\end{verbatim}
%%-- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --%%
\subsection{The linker (\mbox{\tt link})}
\label{sec:linker}
\subsubsection{Data structures owned by the linker}
In the same way that @compile@ has a persistent compiler state (PCS),
the linker has a persistent (session-lifetime) state, PLS, the
Linker's Persistent State. In batch mode PLS is entirely irrelevant,
because there is only a single link step, and can be a unit value
ignored by everybody. In interactive mode PLS is composed of the
following three parts:
\begin{itemize}
\item
\textbf{The Source Symbol Table (SST)}@ :: FiniteMap RdrName HValue@
The source symbol table is used when linking interpreted code.
Unlinked interpreted code consists of an STG tree where
the leaves are @RdrNames@. The linker's job is to resolve these to
actual addresses (the alternative is to resolve these lazily when
the code is run, but this requires passing the full symbol table
through the interpreter and the repeated lookups will probably be
expensive).
The source symbol table therefore maps @RdrName@s to @HValue@s, for
every @RdrName@ that currently \emph{has} an @HValue@, including all
exported functions from object code modules that are currently
linked in. Linking therefore turns a @StgTree RdrName@ into an
@StgTree HValue@.
It is important that we can prune this symbol table by throwing away
the mappings for an entire module, whenever we recompile/relink a
given module. The representation is therefore probably a two-level
mapping, from module names, to function/constructor names, to
@HValue@s.
\item \textbf{The Object Symbol Table (OST)}@ :: FiniteMap String Addr@
This is a lower level symbol table, mapping symbol names in object
modules to their addresses in memory. It is used only when
resolving the external references in an object module, and contains
only entries that are defined in object modules.
Why have two symbol tables? Well, there is a clear distinction
between the two: the source symbol table maps Haskell symbols to
Haskell values, and the object symbol table maps object symbols to
addresses. There is some overlap, in that Haskell symbols certainly
have addresses, and we could look up a Haskell symbol's address by
manufacturing the right object symbol and looking that up in the
object symbol table, but this is likely to be slow and would force
us to extend the object symbol table with all the symbols
``exported'' by interpreted code. Doing it this way enables us to
decouple the object management subsystem from the rest of the linker
with a minimal interface; something like
\begin{verbatim}
loadObject :: Unlinked -> IO Object
unloadModule :: Unlinked -> IO ()
lookupSymbol :: String -> IO Addr
\end{verbatim}
Rather unfortunately we need @lookupSymbol@ in order to populate the
source symbol table when linking in a new compiled module. Our
object management subsystem is currently written in C, so decoupling
this interface as much as possible is highly desirable.
\item
{\bf Linked Image (LI)} @:: no-explicit-representation@
LI isn't explicitly represented in the system, but we record it
here for completeness anyway. LI is the current set of
linked-together module, package and other library fragments
constituting the current executable mass. LI comprises:
\begin{itemize}
\item Machine code (@.o@, @.a@, @.DLL@ file images) in memory.
These are loaded from disk when needed, and stored in
@malloc@ville. To simplify storage management, they are
never freed or reused, since this creates serious
complications for storage management. When no longer needed,
they are simply abandoned. New linkings of the same object
code produces new copies in memory. We hope this not to be
too much of a space leak.
\item STG trees, which live in the GHCI heap and are managed by the
storage manager in the usual way. They are held alive (are
reachable) via the @HValue@s in the OST. Such @HValue@s are
applications of the interpreter function to the trees
themselves. Linking a tree comprises travelling over the
tree, replacing all the @Id@s with pointers directly to the
relevant @_closure@ labels, as determined by searching the
OST. Once the leaves are linked, trees are wrapped with the
interpreter function. The resulting @HValue@s then behave
indistinguishably from compiled versions of the same code.
\end{itemize}
Because object code is outside the heap and never deallocated,
whilst interpreted code is held alive via the HST, there's no need
to have a data structure which ``is'' the linked image.
For batch compilation, LI doesn't exist because OST doesn't exist,
and because @link@ doesn't load code into memory, instead just
invokes the system linker.
\ToDo{Do we need to say anything about CAFs and SRTs? Probably ...}
\end{itemize}
As with PCS, CM has no way to create an initial PLS, so we supply
@emptyPLS@ for that purpose.
\subsubsection{The linker's interface}
In practice, the PLS might be hidden in the I/O monad rather
than passed around explicitly. (The same might be true for PCS).
Anyway:
\begin{verbatim}
data PLS -- as described above; opaque to everybody except the linker
link :: PCI -> ??? -> [[Linkable]] -> PLS -> IO LinkResult
data LinkResult = LinkOK PLS
| LinkErrs PLS [SDoc]
emptyPLS :: IO PLS -- since CM has no other way to make one
\end{verbatim}
CM uses @link@ as follows:
After repeatedly using @compile@ to compile all modules which are
out-of-date, the @link@ is invoked. The @[[Linkable]]@ argument to
@link@ represents the list of (recursive groups of) home modules which
have been newly compiled, along with @Linkable@s for each of
the packages in use (the compilation manager knows which external
packages are referenced by the home package). The order of the list
is important: it is sorted in such a way that linking any prefix of
the list will result in an image with no unresolved references. Note
that for batch linking there may be further restrictions; for example
it may not be possible to link recursive groups containing libraries.
@link@ does the following:
\begin{itemize}
\item
In batch mode, do nothing. In interactive mode,
examine the supplied @[[Linkable]]@ to determine which home
module @Unlinked@s are new. Remove precisely these @Linkable@s
from PLS. (In fact we really need to remove their upwards
transitive closure, but I think it is an invariant that CM will
supply an upwards transitive closure of new modules).
See below for descriptions of @Linkable@ and @Unlinked@.
\item
Batch system: invoke the external linker to link everything in one go.
Interactive: bind the @Unlinked@s for the newly compiled modules,
plus those for any newly required packages, into PLS.
Note that it is the linker's responsibility to remember which
objects and packages have already been linked. By comparing this
with the @Linkable@s supplied to @link@, it can determine which
of the linkables in LI are out of date
\end{itemize}
If linking in of a group should fail for some reason, @link@ should
not modify its PLS at all. In other words, linking each group
is atomic; it either succeeds or fails.
\subsubsection*{\mbox{\tt Unlinked} and \mbox{\tt Linkable}}
Two important types: @Unlinked@ and @Linkable@. The latter is a
higher-level representation involving multiple of the former.
An @Unlinked@ is a reference to unlinked executable code, something
a linker could take as input:
\begin{verbatim}
data Unlinked = DotO Path
| DotA Path
| DotDLL Path
| Trees [StgTree RdrName]
\end{verbatim}
The first three describe the location of a file (presumably)
containing the code to link. @Trees@, which only exists in
interactive mode, gives a list of @StgTrees@, in which the unresolved
references are @RdrNames@ -- hence it's non-linkedness. Once linked,
those @RdrNames@ are replaced with pointers to the machine code
implementing them.
A @Linkable@ gathers together several @Unlinked@s and associates them
with either a module or package:
\begin{verbatim}
data Linkable = LM Module [Unlinked] -- a module
| LP PkgName -- a package
\end{verbatim}
The order of the @Unlinked@s in the list is important, as
they are linked in left-to-right order. The @Unlinked@ objects for a
particular package can be obtained from the package configuration (see
Section \ref{sec:staticinfo}).
\ToDo{When adding @Addr@s from an object module to SST, we need to
somehow find out the @RdrName@s of the symbols exported by that
module.
So we'd need to pass in the @ModDetails@ or @ModIFace@ or some such?}
%%-----------------------------------------------------------------%%
\section{Background ideas}
\subsubsection*{Out of date, but correct in spirit}
\subsection{Restructuring the system}
At the moment @hsc@ compiles one source module into C or assembly.
This functionality is pushed inside a function called @compile@,
introduced shortly. The main new chunk of code is CM, the compilation manager,
which supervises multiple runs of @compile@ so as to create up-to-date
translations of a whole bunch of modules, as quickly as possible.
CM also employs some minor helper functions, @finder@, @summarise@ and
@link@, to do its work.
Our intent is to allow CM to be used as the basis either of a
multi-module, batch mode compilation system, or to supply an
interactive environment similar to that of Hugs.
Only minor modifications to the behaviour of @compile@ and @link@
are needed to give these different behaviours.
CM and @compile@, and, for interactive use, an interpreter, are the
main code components. The most important data structure is the global
symbol table; much design effort has been expended thereupon.
\subsection{How the global symbol table is implemented}
The top level symbol table is a @FiniteMap@ @ModuleName@
@ModuleDetails@. @ModuleDetails@ contains essentially the environment
created by compiling a module. CM manages this finite map, adding and
deleting module entries as required.
The @ModuleDetails@ for a module @M@ contains descriptions of all
tycons, classes, instances, values, unfoldings, etc (henceforth
referred to as ``entities''), available from @M@. These are just
trees in the GHCI heap. References from other modules to these
entities is direct -- when you have a @TyCon@ in your hand, you really
have a pointer directly to the @TyCon@ structure in the defining module,
rather than some kind of index into a global symbol table. So there
is a global symbol table, but it has a distributed (sphagetti-like?)
nature.
This gives fast and convenient access to tycon, class, instance,
etc, information. But because there are no levels of indirection,
there's a problem when we replace @M@ with an updated version of @M@.
We then need to find all references to entities in the old @M@'s
sphagetti, and replace them with pointers to the new @M@'s sphagetti.
This problem motivates a large part of the design.
\subsection{Implementing incremental recompilation -- simple version}
Given the following module graph
\begin{verbatim}
D
/ \
/ \
B C
\ /
\ /
A
\end{verbatim}
(@D@ imports @B@ and @C@, @B@ imports @A@, @C@ imports @A@) the aim is to do the
least possible amount of compilation to bring @D@ back up to date. The
simplest scheme we can think of is:
\begin{itemize}
\item {\bf Downsweep}:
starting with @D@, re-establish what the current module graph is
(it might have changed since last time). This means getting a
@ModuleSummary@ of @D@. The summary can be quickly generated,
contains @D@'s import lists, and gives some way of knowing whether
@D@'s source has changed since the last time it was summarised.
Transitively follow summaries from @D@, thereby establishing the
module graph.
\item
Remove from the global symbol table (the @FiniteMap@ @ModuleName@
@ModuleDetails@) the upwards closure of all modules in this package
which are out-of-date with respect to their previous versions. Also
remove all modules no longer reachable from @D@.
\item {\bf Upsweep}:
Starting at the lowest point in the still-in-date module graph,
start compiling upwards, towards @D@. At each module, call
@compile@, passing it a @FiniteMap@ @ModuleName@ @ModuleDetails@,
and getting a new @ModuleDetails@ for the module, which is added to
the map.
When compiling a module, the compiler must be able to know which
entries in the map are for modules in its strict downwards closure,
and which aren't, so that it can manufacture the instance
environment correctly (as union of instances in its downwards
closure).
\item
Once @D@ has been compiled, invoke some kind of linking phase
if batch compilation. For interactive use, can either do it all
at the end, or as you go along.
\end{itemize}
In this simple world, recompilation visits the upwards closure of
all changed modules. That means when a module @M@ is recompiled,
we can be sure no-one has any references to entities in the old @M@,
because modules importing @M@ will have already been removed from the
top-level finite map in the second step above.
The upshot is that we don't need to worry about updating links to @M@ in
the global symbol table -- there shouldn't be any to update.
\ToDo{What about mutually recursive modules?}
CM will happily chase through module interfaces in other packages in
the downsweep. But it will only process modules in this package
during the upsweep. So it assumes that modules in other packages
never become out of date. This is a design decision -- we could have
decided otherwise.
In fact we go further, and require other packages to be compiled,
i.e. to consist of a collection of interface files, and one or more
source files. CM will never apply @compile@ to a foreign package
module, so there's no way a package can be built on the fly from source.
We require @compile@ to cache foreign package interfaces it reads, so
that subsequent uses don't have to re-read them. The cache never
becomes out of date, since we've assumed that the source of foreign
packages doesn't change during the course of a session (run of GHCI).
As well as caching interfaces, @compile@ must cache, in some sense,
the linkable code for modules. In batch compilation this might simply
mean remembering the names of object files to link, whereas in
interactive mode @compile@ probably needs to load object code into
memory in preparation for in-memory linking.
Important signatures for this simple scheme are:
\begin{verbatim}
finder :: ModuleName -> ModLocation
summarise :: ModLocation -> IO ModSummary
compile :: ModSummary
-> FM ModName ModDetails
-> IO CompileResult
data CompileResult = CompOK ModDetails
| CompErr [ErrMsg]
link :: [ModLocation] -> [PackageLocation] -> IO Bool -- linked ok?
\end{verbatim}
\subsection{Implementing incremental recompilation -- clever version}
So far, our upsweep, which is the computationally expensive bit,
recompiles a module if either its source is out of date, or it
imports a module which has been recompiled. Sometimes we know
we can do better than this:
\begin{verbatim}
module B where module A
import A ( f ) {-# NOINLINE f #-}
... f ... f x = x + 42
\end{verbatim}
If the definition of @f@ is changed to @f x = x + 43@, the simple
upsweep would recompile @B@ unnecessarily. We would like to detect
this situation and avoid propagating recompilation all the way to the
top. There are two parts to this: detecting when a module doesn't
need recompilation, and managing inter-module references in the
global symbol table.
\subsubsection*{Detecting when a module doesn't need recompilation}
To do this, we introduce a new concept: the @ModuleIFace@. This is
effectively an in-memory interface file. References to entities in
other modules are done via strings, rather than being pointers
directly to those entities. Recall that, by comparison,
@ModuleDetails@ do contain pointers directly to the entities they
refer to. So a @ModuleIFace@ is not part of the global symbol table.
As before, compiling a module produces a @ModuleDetails@ (inside the
@CompileResult@), but it also produces a @ModuleIFace@. The latter
records, amongst things, the version numbers of all imported entities
needed for the compilation of that module. @compile@ optionally also
takes the old @ModuleIFace@ as input during compilation:
\begin{verbatim}
data CompileResult = CompOK ModDetails ModIFace
| CompErr [ErrMsg]
compile :: ModSummary
-> FM ModName ModDetails
-> Maybe ModuleIFace
-> IO CompileResult
\end{verbatim}
Now, if the @ModuleSummary@ indicates this module's source hasn't
changed, we only need to recompile it if something it depends on has
changed. @compile@ can detect this by inspecting the imported entity
version numbers in the module's old @ModuleIFace@, and comparing them
with the version numbers from the entities in the modules being
imported. If they are all the same, nothing it depends on has
changed, so there's no point in recompiling.
\subsubsection*{Managing inter-module references in the global symbol table}
In the above example with @A@, @B@ and @f@, the specified change to @f@ would
require @A@ but not @B@ to be recompiled. That generates a new
@ModuleDetails@ for @A@. Problem is, if we leave @B@'s @ModuleDetails@
unchanged, they continue to refer (directly) to the @f@ in @A@'s old
@ModuleDetails@. This is not good, especially if equality between
entities is implemented using pointer equality.
One solution is to throw away @B@'s @ModuleDetails@ and recompile @B@.
But this is precisely what we're trying to avoid, as it's expensive.
Instead, a cheaper mechanism achieves the same thing: recreate @B@'s
details directly from the old @ModuleIFace@. The @ModuleIFace@ will
(textually) mention @f@; @compile@ can then find a pointer to the
up-to-date global symbol table entry for @f@, and place that pointer
in @B@'s @ModuleDetails@. The @ModuleDetails@ are, therefore,
regenerated just by a quick lookup pass over the module's former
@ModuleIFace@. All this applies, of course, only when @compile@ has
concluded it doesn't need to recompile @B@.
Now @compile@'s signature becomes a little clearer. @compile@ has to
recompile the module, generating a fresh @ModuleDetails@ and
@ModuleIFace@, if any of the following hold:
\begin{itemize}
\item
The old @ModuleIFace@ wasn't supplied, for some reason (perhaps
we've never compiled this module before?)
\item
The module's source has changed.
\item
The module's source hasn't changed, but inspection of @ModuleIFaces@
for this and its imports indicates that an imported entity has
changed.
\end{itemize}
If none of those are true, we're in luck: quickly knock up a new
@ModuleDetails@ from the old @ModuleIFace@, and return them both.
As a result, the upsweep still visits all modules in the upwards
closure of those whose sources have changed. However, at some point
we hopefully make a transition from generating new @ModuleDetails@ the
expensive way (recompilation) to a cheap way (recycling old
@ModuleIFaces@). Either way, all modules still get new
@ModuleDetails@, so the global symbol table is correctly
reconstructed.
\subsection{How linking works, roughly}
When @compile@ translates a module, it produces a @ModuleDetails@,
@ModuleIFace@ and a @Linkable@. The @Linkable@ contains the
translated but un-linked code for the module. And when @compile@
ventures into an interface in package it hasn't seen so far, it
copies the package's object code into memory, producing one or more
@Linkable@s. CM keeps track of these linkables.
Once all modules have been @compile@d, CM invokes @link@, supplying
the all the @Linkable@s it knows about. If @compile@ had also been
linking incrementally as it went along, @link@ doesn't have to do
anything. On the other hand, @compile@ could choose not to be
incremental, and leave @link@ to do all the work.
@Linkable@s are opaque to CM. For batch compilation, a @Linkable@
can record just the name of an object file, DLL, archive, or whatever,
in which case the CM's call to @link@ supplies exactly the set of
file names to be linked. @link@ can pass these verbatim to the
standard system linker.
%%-----------------------------------------------------------------%%
\section{Ancient stuff}
\subsubsection*{Should be selectively merged into ``Background ideas''}
\subsection{Overall}
Top level structure is:
\begin{itemize}
\item The Compilation Manager (CM) calculates and maintains module
dependencies, and knows how create up-to-date object or bytecode
for a given module. In doing so it may need to recompile
arbitrary other modules, based on its knowledge of the module
dependencies.
\item On top of the CM are the ``user-level'' services. We envisage
both a HEP-like interface, for interactive use, and an
@hmake@ style batch compiler facility.
\item The CM only deals with inter-module issues. It knows nothing
about how to recompile an individual module, nor where the compiled
result for a module lives, nor how to tell if
a module is up to date, nor how to find the dependencies of a module.
Instead, these services are supplied abstractly to CM via a
@Compiler@ record. To a first approximation, a @Compiler@
contains
the same functionality as @hsc@ has had until now -- the ability to
translate a single Haskell module to C/assembly/object/bytecode.
Different clients of CM (HEP vs @hmake@) may supply different
@Compiler@s, since they need slightly different behaviours.
Specifically, HEP needs a @Compiler@ which creates bytecode
in memory, and knows how to link it, whereas @hmake@ wants
the traditional behaviour of emitting assembly code to disk,
and making no attempt at linkage.
\end{itemize}
\subsection{Open questions}
\begin{itemize}
\item
Error reporting from @open@ and @compile@.
\item
Instance environment management
\item
We probably need to make interface files say what
packages they depend on (so that we can figure out
which packages to load/link).
\item
CM is parameterised both by the client uses and the @Compiler@
supplied. But it doesn't make sense to have a HEP-style client
attached to a @hmake@-style @Compiler@. So, really, the
parameterising entity should contain both aspects, not just the
current @Compiler@ contents.
\end{itemize}
\subsection{Assumptions}
\begin{itemize}
\item Packages other than the "current" one are assumed to be
already compiled.
\item
The "current" package is usually "MAIN",
but we can set it with a command-line flag.
One invocation of ghci has only one "current" package.
\item
Packages are not mutually recursive
\item
All the object code for a package P is in libP.a or libP.dll
\end{itemize}
\subsection{Stuff we need to be able to do}
\begin{itemize}
\item Create the environment in which a module has been translated,
so that interactive queries can be satisfied as if ``in'' that
module.
\end{itemize}
%%-----------------------------------------------------------------%%
\section{The Compilation Manager}
CM (@compilationManager@) is a functor, thus:
\begin{verbatim}
compilationManager :: Compiler -> IO HEP -- IO so that it can create
-- global vars (IORefs)
data HEP = HEP {
load :: ModuleName -> IO (),
compileString :: ModuleName -> String -> IO HValue,
....
}
newCompiler :: IO Compiler -- ??? this is a peer of compilationManager?
run :: HValue -> IO () -- Run an HValue of type IO ()
-- In HEP?
\end{verbatim}
@load@ is the central action of CM: its job is to bring a module and
all its descendents into an executable state, by doing the following:
\begin{enumerate}
\item
Use @summarise@ to descend the module hierarchy, starting from the
nominated root, creating @ModuleSummary@s, and
building a map @ModuleName@ @->@ @ModuleSummary@. @summarise@
expects to be passed absolute paths to files. Use @finder@ to
convert module names to file paths.
\item
Topologically sort the map,
using dependency info in the @ModuleSummary@s.
\item
Clean up the symbol table by deleting the upward closure of
changed modules.
\item
Working bottom to top, call @compile@ on the upward closure of
all modules whose source has changed. A module's source has
changed when @sourceHasChanged@ indicates there is a difference
between old and new summaries for the module. Update the running
@FiniteMap@ @ModuleName@ @ModuleDetails@ with the new details
for this module. Ditto for the running
@FiniteMap@ @ModuleName@ @ModuleIFace@.
\item
Call @compileDone@ to signify that we've reached the top, so
that the batch system can now link.
\end{enumerate}
%%-----------------------------------------------------------------%%
\section{A compiler}
Most of the system's complexity is hidden inside the functions
supplied in the @Compiler@ record:
\begin{verbatim}
data Compiler = Compiler {
finder :: PackageConf -> [Path] -> IO (ModuleName -> ModuleLocation)
summarise :: ModuleLocation -> IO ModuleSummary
compile :: ModuleSummary
-> Maybe ModuleIFace
-> FiniteMap ModuleName ModuleDetails
-> IO CompileResult
compileDone :: IO ()
compileStarting :: IO () -- still needed? I don't think so.
}
type ModuleName = String (or some such)
type Path = String -- an absolute file name
\end{verbatim}
\subsection{The module \mbox{\tt finder}}
The @finder@, given a package configuration file and a list of
directories to look in, will map module names to @ModuleLocation@s,
in which the @Path@s are filenames, probably with an absolute path
to them.
\begin{verbatim}
data ModuleLocation = SourceOnly Path -- .hs
| ObjectCode Path Path -- .o & .hi
| InPackage Path -- .hi
\end{verbatim}
@SourceOnly@ and @ObjectCode@ are unremarkable. For sanity,
we require that a module's object and interface be in the same
directory. @InPackage@ indicates that the module is in a
different package.
@Module@ values -- perhaps all @Name@ish things -- contain the name of
their package. That's so that
\begin{itemize}
\item Correct code can be generated for in-DLL vs out-of-DLL refs.
\item We don't have version number dependencies for symbols
imported from different packages.
\end{itemize}
Somehow or other, it will be possible to know all the packages
required, so that the for the linker can load them.
We could detect package dependencies by recording them in the
@compile@r's @ModuleIFace@ cache, and with that and the
package config info, figure out the complete set of packages
to link. Or look at the command line args on startup.
\ToDo{Need some way to tell incremental linkers about packages,
since in general we'll need to load and link them before
linking any modules in the current package.}
\subsection{The module \mbox{\tt summarise}r}
Given a filename of a module (\ToDo{presumably source or iface}),
create a summary of it. A @ModuleSummary@ should contain only enough
information for CM to construct an up-to-date picture of the
dependency graph. Rather than expose CM to details of timestamps,
etc, @summarise@ merely provides an up-to-date summary of any module.
CM can extract the list of dependencies from a @ModuleSummary@, but
other than that has no idea what's inside it.
\begin{verbatim}
data ModuleSummary = ... (abstract) ...
depsFromSummary :: ModuleSummary -> [ModuleName] -- module names imported
sourceHasChanged :: ModuleSummary -> ModuleSummary -> Bool
\end{verbatim}
@summarise@ is intended to be fast -- a @stat@ of the source or
interface to see if it has changed, and, if so, a quick semi-parse to
determine the new imports.
\subsection{The module \mbox{\tt compile}r}
@compile@ traffics in @ModuleIFace@s and @ModuleDetails@.
A @ModuleIFace@ is an in-memory representation of the contents of an
interface file, including version numbers, unfoldings and pragmas, and
the linkable code for the module. @ModuleIFace@s are un-renamed,
using @HsSym@/@RdrNames@ rather than (globally distinct) @Names@.
@ModuleDetails@, by contrast, is an in-memory representation of the
static environment created by compiling a module. It is phrased in
terms of post-renaming @Names@, @TyCon@s, etc, so it's basically a
renamed-to-global-uniqueness rendition of a @ModuleIFace@.
In an interactive session, we'll want to be able to evaluate
expressions as if they had been compiled in the scope of some
specified module. This means that the @ModuleDetails@ must contain
the type of everything defined in the module, rather than just the
types of exported stuff. As a consequence, @ModuleIFace@ must also
contain the type of everything, because it should always be possible
to generate a module's @ModuleDetails@ from its @ModuleIFace@.
CM maintains two mappings, one from @ModuleName@s to @ModuleIFace@s,
the other from @ModuleName@s to @ModuleDetail@s. It passes the former
to each call of @compile@. This is used to supply information about
modules compiled prior to this one (lower down in the graph). The
returned @CompileResult@ supplies a new @ModuleDetails@ for the module
if compilation succeeded, and CM adds this to the mapping. The
@CompileResult@ also supplies a new @ModuleIFace@, which is either the
same as that supplied to @compile@, if @compile@ decided not to
retranslate the module, or is the result of a fresh translation (from
source). So these mappings are an explicitly-passed-around part of
the global system state.
@compile@ may also {\em optionally} also accumulate @ModuleIFace@s for
modules in different packages -- that is, interfaces which we read,
but never attempt to recompile source for. Such interfaces, being
from foreign packages, never change, so @compile@ can accumulate them
in perpetuity in a private global variable. Indeed, a major motivator
of this design is to facilitate this caching of interface files,
reading of which is a serious bottleneck for the current compiler.
When CM restarts compilation down at the bottom of the module graph,
it first needs to throw away all \ToDo{all?} @ModuleDetails@ in the
upward closure of the out-of-date modules. So @ModuleDetails@ don't
persist across recompilations. But @ModuleIFace@s do, since they
are conceptually equivalent to interface files.
\subsubsection*{What @compile@ returns}
@compile@ returns a @CompileResult@ to CM.
Note that the @compile@'s foreign-package interface cache can
become augmented even as a result of reading interfaces for a
compilation attempt which ultimately fails, although it will not be
augmented with a new @ModuleIFace@ for the failed module.
\begin{verbatim}
-- CompileResult is not abstract to the Compilation Manager
data CompileResult
= CompOK ModuleIFace
ModuleDetails -- compiled ok, here are new details
-- and new iface
| CompErr [SDoc] -- compilation gave errors
| NoChange -- no change required, meaning:
-- exports, unfoldings, strictness, etc,
-- unchanged, and executable code unchanged
\end{verbatim}
\subsubsection*{Re-establishing local-to-global name mappings}
Consider
\begin{verbatim}
module Upper where module Lower ( f ) where
import Lower ( f ) f = ...
g = ... f ...
\end{verbatim}
When @Lower@ is first compiled, @f@ is allocated a @Unique@
(presumably inside an @Id@ or @Name@?). When @Upper@ is then
compiled, its reference to @f@ is attached directly to the
@Id@ created when compiling @Lower@.
If the definition of @f@ is now changed, but not the type,
unfolding, strictness, or any other thing which affects the way
it should be called, we will have to recompile @Lower@, but not
@Upper@. This creates a problem -- @g@ will then refer to the
the old @Id@ for @f@, not the new one. This may or may not
matter, but it seems safer to ensure that all @Unique@-based
references into child modules are always up to date.
So @compile@ recreates the @ModuleDetails@ for @Upper@ from
the @ModuleIFace@ of @Upper@ and the @ModuleDetails@ of @Lower@.
The rule is: if a module is up to date with respect to its
source, but a child @C@ has changed, then either:
\begin{itemize}
\item On examination of the version numbers in @C@'s
interface/@ModuleIFace@ that we used last time, we discover that
an @Id@/@TyCon@/class/instance we depend on has changed. So
we need to retranslate the module from its source, generating
a new @ModuleIFace@ and @ModuleDetails@.
\item Or: there's nothing in @C@'s interface that we depend on.
So we quickly recreate a new @ModuleDetails@ from the existing
@ModuleIFace@, creating fresh links to the new @Unique@-world
entities in @C@'s new @ModuleDetails@.
\end{itemize}
Upshot: we need to redo @compile@ on all modules all the way up,
rather than just the ones that need retranslation. However, we hope
that most modules won't need retranslation -- just regeneration of the
@ModuleDetails@ from the @ModuleIFace@. In effect, the @ModuleIFace@
is a quickly-compilable representation of the module's contents, just
enough to create the @ModuleDetails@.
\ToDo{Is there anything in @ModuleDetails@ which can't be
recreated from @ModuleIFace@ ?}
So the @ModuleIFace@s persist across calls to @HEP.load@, whereas
@ModuleDetails@ are reconstructed on every compilation pass. This
means that @ModuleIFace@s have the same lifetime as the byte/object
code, and so should somehow contain their code.
The behind-the-scenes @ModuleIFace@ cache has some kind of holding-pen
arrangement, to lazify the copying-out of stuff from it, and thus to
minimise redundant interface reading. \ToDo{Burble burble. More
details.}.
When CM starts working back up the module graph with @compile@, it
needs to remove from the travelling @FiniteMap@ @ModuleName@
@ModuleDetails@ the details for all modules in the upward closure of
the compilation start points. However, since we're going to visit
precisely those modules and no others on the way back up, we might as
well just zap them the old @ModuleDetails@ incrementally. This does
mean that the @FiniteMap@ @ModuleName@ @ModuleDetails@ will be
inconsistent until we reach the top.
In interactive mode, each @compile@ call on a module for which no
object code is available, or for which it is out of date wrt source,
emit bytecode into memory, update the resulting @ModuleIFace@ with the
address of the bytecode image, and link the image.
In batch mode, emit assembly or object code onto disk. Record
somewhere \ToDo{where?} that this object file needs to go into the
final link.
When we reach the top, @compileDone@ is called, to signify that batch
linking can now proceed, if need be.
Modules in other packages never get a @ModuleIFace@ or @ModuleDetails@
entry in CM's maps -- those maps are only for modules in this package.
As previously mentioned, @compile@ may optionally cache @ModuleIFace@s
for foreign package modules. When reading such an interface, we don't
need to read the version info for individual symbols, since foreign
packages are assumed static.
\subsubsection*{What's in a \mbox{\tt ModuleIFace}?}
Current interface file contents?
\subsubsection*{What's in a \mbox{\tt ModuleDetails}?}
There is no global symbol table @:: Name -> ???@. To look up a
@Name@, first extract the @ModuleName@ from it, look that up in
the passed-in @FiniteMap@ @ModuleName@ @ModuleDetails@,
and finally look in the relevant @Env@.
\ToDo{Do we still have the @HoldingPen@, or is it now composed from
per-module bits too?}
\begin{verbatim}
data ModuleDetails = ModuleDetails {
moduleExports :: what it exports (Names)
-- roughly a subset of the .hi file contents
moduleEnv :: RdrName -> Name
-- maps top-level entities in this module to
-- globally distinct (Uniq-ified) Names
moduleDefs :: Bag Name -- All the things in the global symbol table
-- defined by this module
package :: Package -- what package am I in?
lastCompile :: Date -- of last compilation
instEnv :: InstEnv -- local inst env
typeEnv :: Name -> TyThing -- local tycon env?
}
-- A (globally unique) symbol table entry. Note that Ids contain
-- unfoldings.
data TyThing = AClass Class
| ATyCon TyCon
| AnId Id
\end{verbatim}
What's the stuff in @ModuleDetails@ used for?
\begin{itemize}
\item @moduleExports@ so that the stuff which is visible from outside
the module can be calculated.
\item @moduleEnv@: \ToDo{umm err}
\item @moduleDefs@: one reason we want this is so that we can nuke the
global symbol table contribs from this module when it leaves the
system. \ToDo{except ... we don't have a global symbol table any
more.}
\item @package@: we will need to chase arbitrarily deep into the
interfaces of other packages. Of course we don't want to
recompile those, but as we've read their interfaces, we may
as well cache that info. So @package@ indicates whether this
module is in the default package, or, if not, which it is in.
Also, when we come to linking, we'll need to know which
packages are demanded, so we know to load their objects.
\item @lastCompile@: When the module was last compiled. If the
source is older than that, then a recompilation can only be
required if children have changed.
\item @typeEnv@: obvious??
\item @instEnv@: the instances contributed by this module only. The
Report allegedly says that when a module is translated, the
available
instance env is all the instances in the downward closure of
itself in the module graph.
We choose to use this simple representation -- each module
holds just its own instances -- and do the naive thing when
creating an inst env for compilation with. If this turns out
to be a performance problem we'll revisit the design.
\end{itemize}
%%-----------------------------------------------------------------%%
\section{Misc text looking for a home}
\subsection*{Linking}
\ToDo{All this linking stuff is now bogus.}
There's an abstract @LinkState@, which is threaded through the linkery
bits. CM can call @addpkgs@ to notify the linker of packages
required, and it can call @addmods@ to announce modules which need to
be linked. Finally, CM calls @endlink@, after which an executable
image should be ready. The linker may link incrementally, during each
call of @addpkgs@ and @addmods@, or it can just store up names and do
all the linking when @endlink@ is called.
In order that incremental linking is possible, CM should specify
packages and module groups in dependency order, ie, from the bottom up.
\subsection*{In-memory linking of bytecode}
When being HEP-like, @compile@ will translate sources to bytecodes
in memory, with all the bytecode for a module as a contiguous lump
outside the heap. It needs to communicate the addresses of these
lumps to the linker. The linker also needs to know whether a
given module is available as in-memory bytecode, or whether it
needs to load machine code from a file.
I guess @LinkState@ needs to map module names to base addresses
of their loaded images, + the nature of the image, + whether or not
the image has been linked.
\subsection*{On disk linking of object code, to give an executable}
The @LinkState@ in this case is just a list of module and package
names, which @addpkgs@ and @addmods@ add to. The final @endlink@
call can invoke the system linker.
\subsection{Finding out about packages, dependencies, and auxiliary
objects}
Ask the @packages.conf@ file that lives with the driver at the mo.
\ToDo{policy about upward closure?}
\ToDo{record story about how in memory linking is done.}
\ToDo{linker start/stop/initialisation/persistence. Need to
say more about @LinkState@.}
\end{document}
|