summaryrefslogtreecommitdiff
path: root/compiler/GHC/Unit.hs
blob: 0051aa30876f25d44cc7aa349d3ca0b375f64fbc (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
{-# LANGUAGE TypeSynonymInstances #-}
{-# LANGUAGE FlexibleInstances #-}
{-# LANGUAGE DeriveFunctor #-}

-- | Units are library components from Cabal packages compiled and installed in
-- a database
module GHC.Unit
   ( module GHC.Unit.Types
   , module GHC.Unit.Info
   , module GHC.Unit.Parser
   , module GHC.Unit.State
   , module GHC.Unit.Subst
   , module GHC.Unit.Module
   )
where

import GHC.Unit.Types
import GHC.Unit.Info
import GHC.Unit.Parser
import GHC.Unit.State
import GHC.Unit.Subst
import GHC.Unit.Module

-- Note [About Units]
-- ~~~~~~~~~~~~~~~~~~
--
-- Haskell users are used to manipulate Cabal packages. These packages are
-- identified by:
--    - a package name :: String
--    - a package version :: Version
--    - (a revision number, when they are registered on Hackage)
--
-- Cabal packages may contain several components (libraries, programs,
-- testsuites). In GHC we are mostly interested in libraries because those are
-- the components that can be depended upon by other components. Components in a
-- package are identified by their component name. Historically only one library
-- component was allowed per package, hence it didn't need a name. For this
-- reason, component name may be empty for one library component in each
-- package:
--    - a component name :: Maybe String
--
-- UnitId
-- ------
--
-- Cabal libraries can be compiled in various ways (different compiler options
-- or Cabal flags, different dependencies, etc.), hence using package name,
-- package version and component name isn't enough to identify a built library.
-- We use another identifier called UnitId:
--
--   package name             \
--   package version          |                       ________
--   component name           | hash of all this ==> | UnitId |
--   Cabal flags              |                       --------
--   compiler options         |
--   dependencies' UnitId     /
--
-- Fortunately GHC doesn't have to generate these UnitId: they are provided by
-- external build tools (e.g. Cabal) with `-this-unit-id` command-line parameter.
--
-- UnitIds are important because they are used to generate internal names
-- (symbols, etc.).
--
-- Wired-in units
-- --------------
--
-- Certain libraries are known to the compiler, in that we know about certain
-- entities that reside in these libraries. The compiler needs to declare static
-- Modules and Names that refer to units built from these libraries.
--
-- Hence UnitIds of wired-in libraries are fixed. Instead of letting Cabal chose
-- the UnitId for these libraries, their .cabal file uses the following stanza to
-- force it to a specific value:
--
--    ghc-options: -this-unit-id ghc-prim    -- taken from ghc-prim.cabal
--
-- The RTS also uses entities of wired-in units by directly referring to symbols
-- such as "base_GHCziIOziException_heapOverflow_closure" where the prefix is
-- the UnitId of "base" unit.
--
-- Unit databases
-- --------------
--
-- Units are stored in databases in order to be reused by other codes:
--
--    UnitKey ---> UnitInfo { exposed modules, package name, package version
--                            component name, various file paths,
--                            dependencies :: [UnitKey], etc. }
--
-- Because of the wired-in units described above, we can't exactly use UnitIds
-- as UnitKeys in the database: if we did this, we could only have a single unit
-- (compiled library) in the database for each wired-in library. As we want to
-- support databases containing several different units for the same wired-in
-- library, we do this:
--
--    * for non wired-in units:
--       * UnitId = UnitKey = Identifier (hash) computed by Cabal
--
--    * for wired-in units:
--       * UnitKey = Identifier computed by Cabal (just like for non wired-in units)
--       * UnitId  = unit-id specified with -this-unit-id command-line flag
--
-- We can expose several units to GHC via the `package-id <UnitKey>`
-- command-line parameter. We must use the UnitKeys of the units so that GHC can
-- find them in the database.
--
-- GHC then replaces the UnitKeys with UnitIds by taking into account wired-in
-- units: these units are detected thanks to their UnitInfo (especially their
-- package name).
--
-- For example, knowing that "base", "ghc-prim" and "rts" are wired-in packages,
-- the following dependency graph expressed with UnitKeys (as found in the
-- database) will be transformed into a similar graph expressed with UnitIds
-- (that are what matters for compilation):
--
--    UnitKeys
--    ~~~~~~~~                             ---> rts-1.0-hashABC <--
--                                         |                      |
--                                         |                      |
--    foo-2.0-hash123 --> base-4.1-hashXYZ ---> ghc-prim-0.5.3-hashABC
--
--    UnitIds
--    ~~~~~~~                              ---> rts <--
--                                         |          |
--                                         |          |
--    foo-2.0-hash123 --> base ---------------> ghc-prim
--
--
-- Module signatures / indefinite units / instantiated units
-- ---------------------------------------------------------
--
-- GHC distinguishes two kinds of units:
--
--    * definite: units for which every module has an associated code object
--    (i.e. real compiled code in a .o/.a/.so/.dll/...)
--
--    * indefinite: units for which some modules are replaced by module
--    signatures.
--
-- Module signatures are a kind of interface (similar to .hs-boot files). They
-- are used in place of some real code. GHC allows real modules from other
-- units to be used to fill these module holes. The process is called
-- "unit/module instantiation".
--
-- You can think of this as polymorphism at the module level: module signatures
-- give constraints on the "type" of module that can be used to fill the hole
-- (where "type" means types of the exported module entitites, etc.).
--
-- Module signatures contain enough information (datatypes, abstract types, type
-- synonyms, classes, etc.) to typecheck modules depending on them but not
-- enough to compile them. As such, indefinite units found in databases only
-- provide module interfaces (the .hi ones this time), not object code.
--
-- To distinguish between indefinite and finite unit ids at the type level, we
-- respectively use 'IndefUnitId' and 'DefUnitId' datatypes that are basically
-- wrappers over 'UnitId'.
--
-- Unit instantiation
-- ------------------
--
-- Indefinite units can be instantiated with modules from other units. The
-- instantiating units can also be instantiated themselves (if there are
-- indefinite) and so on. The 'Unit' datatype represents a unit which may have
-- been instantiated:
--
--    data Unit = RealUnit DefUnitId
--              | VirtUnit InstantiatedUnit
--
-- 'InstantiatedUnit' has two interesting fields:
--
--    * instUnitInstanceOf :: IndefUnitId
--       -- ^ the indefinite unit that is instantiated
--
--    * instUnitInsts :: [(ModuleName,(Unit,ModuleName)]
--       -- ^ a list of instantiations, where an instantiation is:
--            (module hole name, (instantiating unit, instantiating module name))
--
-- A 'Unit' may be indefinite or definite, it depends on whether some holes
-- remain in the instantiated unit OR in the instantiating units (recursively).
--
-- Pretty-printing UnitId
-- ----------------------
--
-- GHC mostly deals with UnitIds which are some opaque strings. We could display
-- them when we pretty-print a module origin, a name, etc. But it wouldn't be
-- very friendly to the user because of the hash they usually contain. E.g.
--
--    foo-4.18.1:thelib-XYZsomeUglyHashABC
--
-- Instead when we want to pretty-print a 'UnitId' we query the database to
-- get the 'UnitInfo' and print something nicer to the user:
--
--    foo-4.18.1:thelib
--
-- We do the same for wired-in units.
--
-- Currently (2020-04-06), we don't thread the database into every function that
-- pretty-prints a Name/Module/Unit. Instead querying the database is delayed
-- until the `SDoc` is transformed into a `Doc` using the database that is
-- active at this point in time. This is an issue because we want to be able to
-- unload units from the database and we also want to support several
-- independent databases loaded at the same time (see #14335). The alternatives
-- we have are:
--
--    * threading the database into every function that pretty-prints a UnitId
--    for the user (directly or indirectly).
--
--    * storing enough info to correctly display a UnitId into the UnitId
--    datatype itself. This is done in the IndefUnitId wrapper (see
--    'UnitPprInfo' datatype) but not for every 'UnitId'. Statically defined
--    'UnitId' for wired-in units would have empty UnitPprInfo so we need to
--    find some places to update them if we want to display wired-in UnitId
--    correctly. This leads to a solution similar to the first one above.
--
-- Note [VirtUnit to RealUnit improvement]
-- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
--
-- Over the course of instantiating VirtUnits on the fly while typechecking an
-- indefinite library, we may end up with a fully instantiated VirtUnit. I.e.
-- one that could be compiled and installed in the database. During
-- type-checking we generate a virtual UnitId for it, say "abc".
--
-- Now the question is: do we have a matching installed unit in the database?
-- Suppose we have one with UnitId "xyz" (provided by Cabal so we don't know how
-- to generate it). The trouble is that if both units end up being used in the
-- same type-checking session, their names won't match (e.g. "abc:M.X" vs
-- "xyz:M.X").
--
-- As we want them to match we just replace the virtual unit with the installed
-- one: for some reason this is called "improvement".
--
-- There is one last niggle: improvement based on the package database means
-- that we might end up developing on a package that is not transitively
-- depended upon by the packages the user specified directly via command line
-- flags.  This could lead to strange and difficult to understand bugs if those
-- instantiations are out of date.  The solution is to only improve a
-- unit id if the new unit id is part of the 'preloadClosure'; i.e., the
-- closure of all the packages which were explicitly specified.

-- Note [Representation of module/name variables]
-- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-- In our ICFP'16, we use <A> to represent module holes, and {A.T} to represent
-- name holes.  This could have been represented by adding some new cases
-- to the core data types, but this would have made the existing 'moduleName'
-- and 'moduleUnit' partial, which would have required a lot of modifications
-- to existing code.
--
-- Instead, we use a fake "hole" unit:
--
--      <A>   ===> hole:A
--      {A.T} ===> hole:A.T
--
-- This encoding is quite convenient, but it is also a bit dangerous too,
-- because if you have a 'hole:A' you need to know if it's actually a
-- 'Module' or just a module stored in a 'Name'; these two cases must be
-- treated differently when doing substitutions.  'renameHoleModule'
-- and 'renameHoleUnit' assume they are NOT operating on a
-- 'Name'; 'NameShape' handles name substitutions exclusively.