summaryrefslogtreecommitdiff
path: root/compiler/GHC/Unit.hs
blob: 4e9710e239f27a083b43f452f3327d79af54ef7a (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
{-# LANGUAGE TypeSynonymInstances #-}
{-# LANGUAGE FlexibleInstances #-}
{-# LANGUAGE DeriveFunctor #-}

-- | Units are library components from Cabal packages compiled and installed in
-- a database
module GHC.Unit
   ( module GHC.Unit.Types
   , module GHC.Unit.Info
   , module GHC.Unit.Parser
   , module GHC.Unit.State
   , module GHC.Unit.Subst
   , module GHC.Unit.Module
   )
where

import GHC.Unit.Types
import GHC.Unit.Info
import GHC.Unit.Parser
import GHC.Unit.State
import GHC.Unit.Subst
import GHC.Unit.Module

{-

Note [About Units]
~~~~~~~~~~~~~~~~~~

Haskell users are used to manipulate Cabal packages. These packages are
identified by:
   - a package name :: String
   - a package version :: Version
   - (a revision number, when they are registered on Hackage)

Cabal packages may contain several components (libraries, programs,
testsuites). In GHC we are mostly interested in libraries because those are
the components that can be depended upon by other components. Components in a
package are identified by their component name. Historically only one library
component was allowed per package, hence it didn't need a name. For this
reason, component name may be empty for one library component in each
package:
   - a component name :: Maybe String

UnitId
------

Cabal libraries can be compiled in various ways (different compiler options
or Cabal flags, different dependencies, etc.), hence using package name,
package version and component name isn't enough to identify a built library.
We use another identifier called UnitId:

  package name             \
  package version          |                       ________
  component name           | hash of all this ==> | UnitId |
  Cabal flags              |                       --------
  compiler options         |
  dependencies' UnitId     /

Fortunately GHC doesn't have to generate these UnitId: they are provided by
external build tools (e.g. Cabal) with `-this-unit-id` command-line parameter.

UnitIds are important because they are used to generate internal names
(symbols, etc.).

Wired-in units
--------------

Certain libraries (ghc-prim, base, etc.) are known to the compiler and to the
RTS as they provide some basic primitives.  Hence UnitIds of wired-in libraries
are fixed. Instead of letting Cabal chose the UnitId for these libraries, their
.cabal file uses the following stanza to force it to a specific value:

   ghc-options: -this-unit-id ghc-prim    -- taken from ghc-prim.cabal

The RTS also uses entities of wired-in units by directly referring to symbols
such as "base_GHCziIOziException_heapOverflow_closure" where the prefix is
the UnitId of "base" unit.

Unit databases
--------------

Units are stored in databases in order to be reused by other codes:

   UnitKey ---> UnitInfo { exposed modules, package name, package version
                           component name, various file paths,
                           dependencies :: [UnitKey], etc. }

Because of the wired-in units described above, we can't exactly use UnitIds
as UnitKeys in the database: if we did this, we could only have a single unit
(compiled library) in the database for each wired-in library. As we want to
support databases containing several different units for the same wired-in
library, we do this:

   * for non wired-in units:
      * UnitId = UnitKey = Identifier (hash) computed by Cabal

   * for wired-in units:
      * UnitKey = Identifier computed by Cabal (just like for non wired-in units)
      * UnitId  = unit-id specified with -this-unit-id command-line flag

We can expose several units to GHC via the `package-id <unit-key>` command-line
parameter. We must use the UnitKeys of the units so that GHC can find them in
the database.

During unit loading, GHC replaces UnitKeys with UnitIds. It identifies wired
units by their package name (stored in their UnitInfo) and uses wired-in UnitIds
for them.

For example, knowing that "base", "ghc-prim" and "rts" are wired-in units, the
following dependency graph expressed with database UnitKeys will be transformed
into a similar graph expressed with UnitIds:

   UnitKeys
   ~~~~~~~~                      ----------> rts-1.0-hashABC <--
                                 |                             |
                                 |                             |
   foo-2.0-hash123 --> base-4.1-hashXYZ ---> ghc-prim-0.5.3-hashUVW

   UnitIds
   ~~~~~~~               ---------------> rts <--
                         |                      |
                         |                      |
   foo-2.0-hash123 --> base ---------------> ghc-prim


Note that "foo-2.0-hash123" isn't wired-in so its UnitId is the same as its UnitKey.


Module signatures / indefinite units / instantiated units
---------------------------------------------------------

GHC distinguishes two kinds of units:

   * definite units:
      * units without module holes and with definite dependencies
      * can be compiled into machine code (.o/.a/.so/.dll/...)

   * indefinite units:
      * units with some module holes or with some indefinite dependencies
      * can only be type-checked

Module holes are constrained by module signatures (.hsig files). Module
signatures are a kind of interface (similar to .hs-boot files). They are used in
place of some real code. GHC allows modules from other units to be used to fill
these module holes: the process is called "unit/module instantiation". The
instantiating module may either be a concrete module or a module signature. In
the latter case, the signatures are merged to form a new one.

You can think of this as polymorphism at the module level: module signatures
give constraints on the "type" of module that can be used to fill the hole
(where "type" means types of the exported module entitites, etc.).

Module signatures contain enough information (datatypes, abstract types, type
synonyms, classes, etc.) to typecheck modules depending on them but not
enough to compile them. As such, indefinite units found in databases only
provide module interfaces (the .hi ones this time), not object code.

To distinguish between indefinite and definite unit ids at the type level, we
respectively use 'IndefUnitId' and 'DefUnitId' datatypes that are basically
wrappers over 'UnitId'.

Unit instantiation / on-the-fly instantiation
---------------------------------------------

Indefinite units can be instantiated with modules from other units. The
instantiating units can also be instantiated themselves (if there are
indefinite) and so on.

On-the-fly unit instantiation is a tricky optimization explained in
http://blog.ezyang.com/2016/08/optimizing-incremental-compilation
Here is a summary:

   1. Indefinite units can only be type-checked, not compiled into real code.
   Type-checking produces interface files (.hi) which are incomplete for code
   generation (they lack unfoldings, etc.) but enough to perform type-checking
   of units depending on them.

   2. Type-checking an instantiated unit is cheap as we only have to merge
   interface files (.hi) of the instantiated unit and of the instantiating
   units, hence it can be done on-the-fly. Interface files of the dependencies
   can be concrete or produced on-the-fly recursively.

   3. When we compile a unit, we mustn't use interfaces produced by the
   type-checker (on-the-fly or not) for the instantiated unit dependencies
   because they lack some information.

   4. When we type-check an indefinite unit, we must be consistent about the
   interfaces we use for each dependency: only those produced by the
   type-checker (on-the-fly or not) or only those produced after a full
   compilation, but not both at the same time.

   It can be tricky if we have the following kind of dependency graph:

      X (indefinite) ------> D (definite, compiled) -----> I (instantiated, definite, compiled)
      |----------------------------------------------------^

   Suppose we want to type-check unit X which depends on unit I and D:
      * I is definite and compiled: we have compiled .hi files for its modules on disk
      * I is instantiated: it is cheap to produce type-checker .hi files for its modules on-the-fly

   But we must not do:

      X (indefinite) ------> D (definite, compiled) -----> I (instantiated, definite, compiled)
      |--------------------------------------------------> I (instantiated on-the-fly)

      ==> inconsistent module interfaces for I

   Nor:

      X (indefinite) ------> D (definite, compiled) -------v
      |--------------------------------------------------> I (instantiated on-the-fly)

      ==> D's interfaces may refer to things that only exist in I's *compiled* interfaces

   An alternative would be to store both type-checked and compiled interfaces
   for every compiled non-instantiated unit (instantiated unit can be done
   on-the-fly) so that we could use type-checked interfaces of D in the
   example above. But it would increase compilation time and unit size.


The 'Unit' datatype represents a unit which may have been instantiated
on-the-fly:

   data Unit = RealUnit DefUnitId         -- use compiled interfaces on disk
             | VirtUnit InstantiatedUnit  -- use on-the-fly instantiation

'InstantiatedUnit' has two interesting fields:

   * instUnitInstanceOf :: IndefUnitId
      -- ^ the indefinite unit that is instantiated

   * instUnitInsts :: [(ModuleName,(Unit,ModuleName)]
      -- ^ a list of instantiations, where an instantiation is:
           (module hole name, (instantiating unit, instantiating module name))

A 'VirtUnit' may be indefinite or definite, it depends on whether some holes
remain in the instantiated unit OR in the instantiating units (recursively).
Having a fully instantiated (i.e. definite) virtual unit can lead to some issues
if there is a matching compiled unit in the preload closure.  See Note [VirtUnit
to RealUnit improvement]

Unit database and indefinite units
----------------------------------

We don't store partially instantiated units in the unit database.  Units in the
database are either:

   * definite (fully instantiated or without holes): in this case we have
     *compiled* module interfaces (.hi) and object codes (.o/.a/.so/.dll/...).

   * fully indefinite (not instantiated at all): in this case we only have
     *type-checked* module interfaces (.hi).

Note that indefinite units are stored as an instantiation of themselves where
each instantiating module is a module variable (see Note [Representation of
module/name variables]). E.g.

   "xyz" (UnitKey) ---> UnitInfo { instanceOf       = "xyz"
                                 , instantiatedWith = [A=<A>,B=<B>...]
                                 , ...
                                 }

Note that non-instantiated units are also stored as an instantiation of
themselves.  It is a reminiscence of previous terminology (when "instanceOf" was
"componentId"). E.g.

   "xyz" (UnitKey) ---> UnitInfo { instanceOf       = "xyz"
                                 , instantiatedWith = []
                                 , ...
                                 }

TODO: We should probably have `instanceOf :: Maybe IndefUnitId` instead.


Pretty-printing UnitId
----------------------

GHC mostly deals with UnitIds which are some opaque strings. We could display
them when we pretty-print a module origin, a name, etc. But it wouldn't be
very friendly to the user because of the hash they usually contain. E.g.

   foo-4.18.1:thelib-XYZsomeUglyHashABC

Instead when we want to pretty-print a 'UnitId' we query the database to
get the 'UnitInfo' and print something nicer to the user:

   foo-4.18.1:thelib

We do the same for wired-in units.

Currently (2020-04-06), we don't thread the database into every function that
pretty-prints a Name/Module/Unit. Instead querying the database is delayed
until the `SDoc` is transformed into a `Doc` using the database that is
active at this point in time. This is an issue because we want to be able to
unload units from the database and we also want to support several
independent databases loaded at the same time (see #14335). The alternatives
we have are:

   * threading the database into every function that pretty-prints a UnitId
   for the user (directly or indirectly).

   * storing enough info to correctly display a UnitId into the UnitId
   datatype itself. This is done in the IndefUnitId wrapper (see
   'UnitPprInfo' datatype) but not for every 'UnitId'. Statically defined
   'UnitId' for wired-in units would have empty UnitPprInfo so we need to
   find some places to update them if we want to display wired-in UnitId
   correctly. This leads to a solution similar to the first one above.

Note [VirtUnit to RealUnit improvement]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Over the course of instantiating VirtUnits on the fly while typechecking an
indefinite library, we may end up with a fully instantiated VirtUnit. I.e.
one that could be compiled and installed in the database. During
type-checking we generate a virtual UnitId for it, say "abc".

Now the question is: do we have a matching installed unit in the database?
Suppose we have one with UnitId "xyz" (provided by Cabal so we don't know how
to generate it). The trouble is that if both units end up being used in the
same type-checking session, their names won't match (e.g. "abc:M.X" vs
"xyz:M.X").

As we want them to match we just replace the virtual unit with the installed
one: for some reason this is called "improvement".

There is one last niggle: improvement based on the unit database means
that we might end up developing on a unit that is not transitively
depended upon by the units the user specified directly via command line
flags.  This could lead to strange and difficult to understand bugs if those
instantiations are out of date.  The solution is to only improve a
unit id if the new unit id is part of the 'preloadClosure'; i.e., the
closure of all the units which were explicitly specified.

Note [Representation of module/name variables]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In our ICFP'16, we use <A> to represent module holes, and {A.T} to represent
name holes.  This could have been represented by adding some new cases
to the core data types, but this would have made the existing 'moduleName'
and 'moduleUnit' partial, which would have required a lot of modifications
to existing code.

Instead, we use a fake "hole" unit:

     <A>   ===> hole:A
     {A.T} ===> hole:A.T

This encoding is quite convenient, but it is also a bit dangerous too,
because if you have a 'hole:A' you need to know if it's actually a
'Module' or just a module stored in a 'Name'; these two cases must be
treated differently when doing substitutions.  'renameHoleModule'
and 'renameHoleUnit' assume they are NOT operating on a
'Name'; 'NameShape' handles name substitutions exclusively.

-}