diff options
-rw-r--r-- | compiler/GHC/Unit.hs | 565 |
1 files changed, 331 insertions, 234 deletions
diff --git a/compiler/GHC/Unit.hs b/compiler/GHC/Unit.hs index 0051aa3087..4e9710e239 100644 --- a/compiler/GHC/Unit.hs +++ b/compiler/GHC/Unit.hs @@ -21,237 +21,334 @@ import GHC.Unit.State import GHC.Unit.Subst import GHC.Unit.Module --- Note [About Units] --- ~~~~~~~~~~~~~~~~~~ --- --- Haskell users are used to manipulate Cabal packages. These packages are --- identified by: --- - a package name :: String --- - a package version :: Version --- - (a revision number, when they are registered on Hackage) --- --- Cabal packages may contain several components (libraries, programs, --- testsuites). In GHC we are mostly interested in libraries because those are --- the components that can be depended upon by other components. Components in a --- package are identified by their component name. Historically only one library --- component was allowed per package, hence it didn't need a name. For this --- reason, component name may be empty for one library component in each --- package: --- - a component name :: Maybe String --- --- UnitId --- ------ --- --- Cabal libraries can be compiled in various ways (different compiler options --- or Cabal flags, different dependencies, etc.), hence using package name, --- package version and component name isn't enough to identify a built library. --- We use another identifier called UnitId: --- --- package name \ --- package version | ________ --- component name | hash of all this ==> | UnitId | --- Cabal flags | -------- --- compiler options | --- dependencies' UnitId / --- --- Fortunately GHC doesn't have to generate these UnitId: they are provided by --- external build tools (e.g. Cabal) with `-this-unit-id` command-line parameter. --- --- UnitIds are important because they are used to generate internal names --- (symbols, etc.). --- --- Wired-in units --- -------------- --- --- Certain libraries are known to the compiler, in that we know about certain --- entities that reside in these libraries. The compiler needs to declare static --- Modules and Names that refer to units built from these libraries. --- --- Hence UnitIds of wired-in libraries are fixed. Instead of letting Cabal chose --- the UnitId for these libraries, their .cabal file uses the following stanza to --- force it to a specific value: --- --- ghc-options: -this-unit-id ghc-prim -- taken from ghc-prim.cabal --- --- The RTS also uses entities of wired-in units by directly referring to symbols --- such as "base_GHCziIOziException_heapOverflow_closure" where the prefix is --- the UnitId of "base" unit. --- --- Unit databases --- -------------- --- --- Units are stored in databases in order to be reused by other codes: --- --- UnitKey ---> UnitInfo { exposed modules, package name, package version --- component name, various file paths, --- dependencies :: [UnitKey], etc. } --- --- Because of the wired-in units described above, we can't exactly use UnitIds --- as UnitKeys in the database: if we did this, we could only have a single unit --- (compiled library) in the database for each wired-in library. As we want to --- support databases containing several different units for the same wired-in --- library, we do this: --- --- * for non wired-in units: --- * UnitId = UnitKey = Identifier (hash) computed by Cabal --- --- * for wired-in units: --- * UnitKey = Identifier computed by Cabal (just like for non wired-in units) --- * UnitId = unit-id specified with -this-unit-id command-line flag --- --- We can expose several units to GHC via the `package-id <UnitKey>` --- command-line parameter. We must use the UnitKeys of the units so that GHC can --- find them in the database. --- --- GHC then replaces the UnitKeys with UnitIds by taking into account wired-in --- units: these units are detected thanks to their UnitInfo (especially their --- package name). --- --- For example, knowing that "base", "ghc-prim" and "rts" are wired-in packages, --- the following dependency graph expressed with UnitKeys (as found in the --- database) will be transformed into a similar graph expressed with UnitIds --- (that are what matters for compilation): --- --- UnitKeys --- ~~~~~~~~ ---> rts-1.0-hashABC <-- --- | | --- | | --- foo-2.0-hash123 --> base-4.1-hashXYZ ---> ghc-prim-0.5.3-hashABC --- --- UnitIds --- ~~~~~~~ ---> rts <-- --- | | --- | | --- foo-2.0-hash123 --> base ---------------> ghc-prim --- --- --- Module signatures / indefinite units / instantiated units --- --------------------------------------------------------- --- --- GHC distinguishes two kinds of units: --- --- * definite: units for which every module has an associated code object --- (i.e. real compiled code in a .o/.a/.so/.dll/...) --- --- * indefinite: units for which some modules are replaced by module --- signatures. --- --- Module signatures are a kind of interface (similar to .hs-boot files). They --- are used in place of some real code. GHC allows real modules from other --- units to be used to fill these module holes. The process is called --- "unit/module instantiation". --- --- You can think of this as polymorphism at the module level: module signatures --- give constraints on the "type" of module that can be used to fill the hole --- (where "type" means types of the exported module entitites, etc.). --- --- Module signatures contain enough information (datatypes, abstract types, type --- synonyms, classes, etc.) to typecheck modules depending on them but not --- enough to compile them. As such, indefinite units found in databases only --- provide module interfaces (the .hi ones this time), not object code. --- --- To distinguish between indefinite and finite unit ids at the type level, we --- respectively use 'IndefUnitId' and 'DefUnitId' datatypes that are basically --- wrappers over 'UnitId'. --- --- Unit instantiation --- ------------------ --- --- Indefinite units can be instantiated with modules from other units. The --- instantiating units can also be instantiated themselves (if there are --- indefinite) and so on. The 'Unit' datatype represents a unit which may have --- been instantiated: --- --- data Unit = RealUnit DefUnitId --- | VirtUnit InstantiatedUnit --- --- 'InstantiatedUnit' has two interesting fields: --- --- * instUnitInstanceOf :: IndefUnitId --- -- ^ the indefinite unit that is instantiated --- --- * instUnitInsts :: [(ModuleName,(Unit,ModuleName)] --- -- ^ a list of instantiations, where an instantiation is: --- (module hole name, (instantiating unit, instantiating module name)) --- --- A 'Unit' may be indefinite or definite, it depends on whether some holes --- remain in the instantiated unit OR in the instantiating units (recursively). --- --- Pretty-printing UnitId --- ---------------------- --- --- GHC mostly deals with UnitIds which are some opaque strings. We could display --- them when we pretty-print a module origin, a name, etc. But it wouldn't be --- very friendly to the user because of the hash they usually contain. E.g. --- --- foo-4.18.1:thelib-XYZsomeUglyHashABC --- --- Instead when we want to pretty-print a 'UnitId' we query the database to --- get the 'UnitInfo' and print something nicer to the user: --- --- foo-4.18.1:thelib --- --- We do the same for wired-in units. --- --- Currently (2020-04-06), we don't thread the database into every function that --- pretty-prints a Name/Module/Unit. Instead querying the database is delayed --- until the `SDoc` is transformed into a `Doc` using the database that is --- active at this point in time. This is an issue because we want to be able to --- unload units from the database and we also want to support several --- independent databases loaded at the same time (see #14335). The alternatives --- we have are: --- --- * threading the database into every function that pretty-prints a UnitId --- for the user (directly or indirectly). --- --- * storing enough info to correctly display a UnitId into the UnitId --- datatype itself. This is done in the IndefUnitId wrapper (see --- 'UnitPprInfo' datatype) but not for every 'UnitId'. Statically defined --- 'UnitId' for wired-in units would have empty UnitPprInfo so we need to --- find some places to update them if we want to display wired-in UnitId --- correctly. This leads to a solution similar to the first one above. --- --- Note [VirtUnit to RealUnit improvement] --- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ --- --- Over the course of instantiating VirtUnits on the fly while typechecking an --- indefinite library, we may end up with a fully instantiated VirtUnit. I.e. --- one that could be compiled and installed in the database. During --- type-checking we generate a virtual UnitId for it, say "abc". --- --- Now the question is: do we have a matching installed unit in the database? --- Suppose we have one with UnitId "xyz" (provided by Cabal so we don't know how --- to generate it). The trouble is that if both units end up being used in the --- same type-checking session, their names won't match (e.g. "abc:M.X" vs --- "xyz:M.X"). --- --- As we want them to match we just replace the virtual unit with the installed --- one: for some reason this is called "improvement". --- --- There is one last niggle: improvement based on the package database means --- that we might end up developing on a package that is not transitively --- depended upon by the packages the user specified directly via command line --- flags. This could lead to strange and difficult to understand bugs if those --- instantiations are out of date. The solution is to only improve a --- unit id if the new unit id is part of the 'preloadClosure'; i.e., the --- closure of all the packages which were explicitly specified. - --- Note [Representation of module/name variables] --- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ --- In our ICFP'16, we use <A> to represent module holes, and {A.T} to represent --- name holes. This could have been represented by adding some new cases --- to the core data types, but this would have made the existing 'moduleName' --- and 'moduleUnit' partial, which would have required a lot of modifications --- to existing code. --- --- Instead, we use a fake "hole" unit: --- --- <A> ===> hole:A --- {A.T} ===> hole:A.T --- --- This encoding is quite convenient, but it is also a bit dangerous too, --- because if you have a 'hole:A' you need to know if it's actually a --- 'Module' or just a module stored in a 'Name'; these two cases must be --- treated differently when doing substitutions. 'renameHoleModule' --- and 'renameHoleUnit' assume they are NOT operating on a --- 'Name'; 'NameShape' handles name substitutions exclusively. +{- + +Note [About Units] +~~~~~~~~~~~~~~~~~~ + +Haskell users are used to manipulate Cabal packages. These packages are +identified by: + - a package name :: String + - a package version :: Version + - (a revision number, when they are registered on Hackage) + +Cabal packages may contain several components (libraries, programs, +testsuites). In GHC we are mostly interested in libraries because those are +the components that can be depended upon by other components. Components in a +package are identified by their component name. Historically only one library +component was allowed per package, hence it didn't need a name. For this +reason, component name may be empty for one library component in each +package: + - a component name :: Maybe String + +UnitId +------ + +Cabal libraries can be compiled in various ways (different compiler options +or Cabal flags, different dependencies, etc.), hence using package name, +package version and component name isn't enough to identify a built library. +We use another identifier called UnitId: + + package name \ + package version | ________ + component name | hash of all this ==> | UnitId | + Cabal flags | -------- + compiler options | + dependencies' UnitId / + +Fortunately GHC doesn't have to generate these UnitId: they are provided by +external build tools (e.g. Cabal) with `-this-unit-id` command-line parameter. + +UnitIds are important because they are used to generate internal names +(symbols, etc.). + +Wired-in units +-------------- + +Certain libraries (ghc-prim, base, etc.) are known to the compiler and to the +RTS as they provide some basic primitives. Hence UnitIds of wired-in libraries +are fixed. Instead of letting Cabal chose the UnitId for these libraries, their +.cabal file uses the following stanza to force it to a specific value: + + ghc-options: -this-unit-id ghc-prim -- taken from ghc-prim.cabal + +The RTS also uses entities of wired-in units by directly referring to symbols +such as "base_GHCziIOziException_heapOverflow_closure" where the prefix is +the UnitId of "base" unit. + +Unit databases +-------------- + +Units are stored in databases in order to be reused by other codes: + + UnitKey ---> UnitInfo { exposed modules, package name, package version + component name, various file paths, + dependencies :: [UnitKey], etc. } + +Because of the wired-in units described above, we can't exactly use UnitIds +as UnitKeys in the database: if we did this, we could only have a single unit +(compiled library) in the database for each wired-in library. As we want to +support databases containing several different units for the same wired-in +library, we do this: + + * for non wired-in units: + * UnitId = UnitKey = Identifier (hash) computed by Cabal + + * for wired-in units: + * UnitKey = Identifier computed by Cabal (just like for non wired-in units) + * UnitId = unit-id specified with -this-unit-id command-line flag + +We can expose several units to GHC via the `package-id <unit-key>` command-line +parameter. We must use the UnitKeys of the units so that GHC can find them in +the database. + +During unit loading, GHC replaces UnitKeys with UnitIds. It identifies wired +units by their package name (stored in their UnitInfo) and uses wired-in UnitIds +for them. + +For example, knowing that "base", "ghc-prim" and "rts" are wired-in units, the +following dependency graph expressed with database UnitKeys will be transformed +into a similar graph expressed with UnitIds: + + UnitKeys + ~~~~~~~~ ----------> rts-1.0-hashABC <-- + | | + | | + foo-2.0-hash123 --> base-4.1-hashXYZ ---> ghc-prim-0.5.3-hashUVW + + UnitIds + ~~~~~~~ ---------------> rts <-- + | | + | | + foo-2.0-hash123 --> base ---------------> ghc-prim + + +Note that "foo-2.0-hash123" isn't wired-in so its UnitId is the same as its UnitKey. + + +Module signatures / indefinite units / instantiated units +--------------------------------------------------------- + +GHC distinguishes two kinds of units: + + * definite units: + * units without module holes and with definite dependencies + * can be compiled into machine code (.o/.a/.so/.dll/...) + + * indefinite units: + * units with some module holes or with some indefinite dependencies + * can only be type-checked + +Module holes are constrained by module signatures (.hsig files). Module +signatures are a kind of interface (similar to .hs-boot files). They are used in +place of some real code. GHC allows modules from other units to be used to fill +these module holes: the process is called "unit/module instantiation". The +instantiating module may either be a concrete module or a module signature. In +the latter case, the signatures are merged to form a new one. + +You can think of this as polymorphism at the module level: module signatures +give constraints on the "type" of module that can be used to fill the hole +(where "type" means types of the exported module entitites, etc.). + +Module signatures contain enough information (datatypes, abstract types, type +synonyms, classes, etc.) to typecheck modules depending on them but not +enough to compile them. As such, indefinite units found in databases only +provide module interfaces (the .hi ones this time), not object code. + +To distinguish between indefinite and definite unit ids at the type level, we +respectively use 'IndefUnitId' and 'DefUnitId' datatypes that are basically +wrappers over 'UnitId'. + +Unit instantiation / on-the-fly instantiation +--------------------------------------------- + +Indefinite units can be instantiated with modules from other units. The +instantiating units can also be instantiated themselves (if there are +indefinite) and so on. + +On-the-fly unit instantiation is a tricky optimization explained in +http://blog.ezyang.com/2016/08/optimizing-incremental-compilation +Here is a summary: + + 1. Indefinite units can only be type-checked, not compiled into real code. + Type-checking produces interface files (.hi) which are incomplete for code + generation (they lack unfoldings, etc.) but enough to perform type-checking + of units depending on them. + + 2. Type-checking an instantiated unit is cheap as we only have to merge + interface files (.hi) of the instantiated unit and of the instantiating + units, hence it can be done on-the-fly. Interface files of the dependencies + can be concrete or produced on-the-fly recursively. + + 3. When we compile a unit, we mustn't use interfaces produced by the + type-checker (on-the-fly or not) for the instantiated unit dependencies + because they lack some information. + + 4. When we type-check an indefinite unit, we must be consistent about the + interfaces we use for each dependency: only those produced by the + type-checker (on-the-fly or not) or only those produced after a full + compilation, but not both at the same time. + + It can be tricky if we have the following kind of dependency graph: + + X (indefinite) ------> D (definite, compiled) -----> I (instantiated, definite, compiled) + |----------------------------------------------------^ + + Suppose we want to type-check unit X which depends on unit I and D: + * I is definite and compiled: we have compiled .hi files for its modules on disk + * I is instantiated: it is cheap to produce type-checker .hi files for its modules on-the-fly + + But we must not do: + + X (indefinite) ------> D (definite, compiled) -----> I (instantiated, definite, compiled) + |--------------------------------------------------> I (instantiated on-the-fly) + + ==> inconsistent module interfaces for I + + Nor: + + X (indefinite) ------> D (definite, compiled) -------v + |--------------------------------------------------> I (instantiated on-the-fly) + + ==> D's interfaces may refer to things that only exist in I's *compiled* interfaces + + An alternative would be to store both type-checked and compiled interfaces + for every compiled non-instantiated unit (instantiated unit can be done + on-the-fly) so that we could use type-checked interfaces of D in the + example above. But it would increase compilation time and unit size. + + +The 'Unit' datatype represents a unit which may have been instantiated +on-the-fly: + + data Unit = RealUnit DefUnitId -- use compiled interfaces on disk + | VirtUnit InstantiatedUnit -- use on-the-fly instantiation + +'InstantiatedUnit' has two interesting fields: + + * instUnitInstanceOf :: IndefUnitId + -- ^ the indefinite unit that is instantiated + + * instUnitInsts :: [(ModuleName,(Unit,ModuleName)] + -- ^ a list of instantiations, where an instantiation is: + (module hole name, (instantiating unit, instantiating module name)) + +A 'VirtUnit' may be indefinite or definite, it depends on whether some holes +remain in the instantiated unit OR in the instantiating units (recursively). +Having a fully instantiated (i.e. definite) virtual unit can lead to some issues +if there is a matching compiled unit in the preload closure. See Note [VirtUnit +to RealUnit improvement] + +Unit database and indefinite units +---------------------------------- + +We don't store partially instantiated units in the unit database. Units in the +database are either: + + * definite (fully instantiated or without holes): in this case we have + *compiled* module interfaces (.hi) and object codes (.o/.a/.so/.dll/...). + + * fully indefinite (not instantiated at all): in this case we only have + *type-checked* module interfaces (.hi). + +Note that indefinite units are stored as an instantiation of themselves where +each instantiating module is a module variable (see Note [Representation of +module/name variables]). E.g. + + "xyz" (UnitKey) ---> UnitInfo { instanceOf = "xyz" + , instantiatedWith = [A=<A>,B=<B>...] + , ... + } + +Note that non-instantiated units are also stored as an instantiation of +themselves. It is a reminiscence of previous terminology (when "instanceOf" was +"componentId"). E.g. + + "xyz" (UnitKey) ---> UnitInfo { instanceOf = "xyz" + , instantiatedWith = [] + , ... + } + +TODO: We should probably have `instanceOf :: Maybe IndefUnitId` instead. + + +Pretty-printing UnitId +---------------------- + +GHC mostly deals with UnitIds which are some opaque strings. We could display +them when we pretty-print a module origin, a name, etc. But it wouldn't be +very friendly to the user because of the hash they usually contain. E.g. + + foo-4.18.1:thelib-XYZsomeUglyHashABC + +Instead when we want to pretty-print a 'UnitId' we query the database to +get the 'UnitInfo' and print something nicer to the user: + + foo-4.18.1:thelib + +We do the same for wired-in units. + +Currently (2020-04-06), we don't thread the database into every function that +pretty-prints a Name/Module/Unit. Instead querying the database is delayed +until the `SDoc` is transformed into a `Doc` using the database that is +active at this point in time. This is an issue because we want to be able to +unload units from the database and we also want to support several +independent databases loaded at the same time (see #14335). The alternatives +we have are: + + * threading the database into every function that pretty-prints a UnitId + for the user (directly or indirectly). + + * storing enough info to correctly display a UnitId into the UnitId + datatype itself. This is done in the IndefUnitId wrapper (see + 'UnitPprInfo' datatype) but not for every 'UnitId'. Statically defined + 'UnitId' for wired-in units would have empty UnitPprInfo so we need to + find some places to update them if we want to display wired-in UnitId + correctly. This leads to a solution similar to the first one above. + +Note [VirtUnit to RealUnit improvement] +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Over the course of instantiating VirtUnits on the fly while typechecking an +indefinite library, we may end up with a fully instantiated VirtUnit. I.e. +one that could be compiled and installed in the database. During +type-checking we generate a virtual UnitId for it, say "abc". + +Now the question is: do we have a matching installed unit in the database? +Suppose we have one with UnitId "xyz" (provided by Cabal so we don't know how +to generate it). The trouble is that if both units end up being used in the +same type-checking session, their names won't match (e.g. "abc:M.X" vs +"xyz:M.X"). + +As we want them to match we just replace the virtual unit with the installed +one: for some reason this is called "improvement". + +There is one last niggle: improvement based on the unit database means +that we might end up developing on a unit that is not transitively +depended upon by the units the user specified directly via command line +flags. This could lead to strange and difficult to understand bugs if those +instantiations are out of date. The solution is to only improve a +unit id if the new unit id is part of the 'preloadClosure'; i.e., the +closure of all the units which were explicitly specified. + +Note [Representation of module/name variables] +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +In our ICFP'16, we use <A> to represent module holes, and {A.T} to represent +name holes. This could have been represented by adding some new cases +to the core data types, but this would have made the existing 'moduleName' +and 'moduleUnit' partial, which would have required a lot of modifications +to existing code. + +Instead, we use a fake "hole" unit: + + <A> ===> hole:A + {A.T} ===> hole:A.T + +This encoding is quite convenient, but it is also a bit dangerous too, +because if you have a 'hole:A' you need to know if it's actually a +'Module' or just a module stored in a 'Name'; these two cases must be +treated differently when doing substitutions. 'renameHoleModule' +and 'renameHoleUnit' assume they are NOT operating on a +'Name'; 'NameShape' handles name substitutions exclusively. + +-} |