Enhance Note [About units] for Backpack

author: Sylvain Henry <sylvain@haskus.fr> 2020-05-07 15:35:10 +0200
committer: Marge Bot <ben+marge-bot@smart-cactus.org> 2020-05-26 03:04:45 -0400
commit: cf772f19c06944f0fd03b4bdcd4a49e437084ba5 (patch)
tree: 90d224ac03752cb0500533dfad7ed4d9d89c58ed
parent: 6604906c8cfa37f5780a6d5c40506b751b1740db (diff)
download: haskell-cf772f19c06944f0fd03b4bdcd4a49e437084ba5.tar.gz
1 files changed, 331 insertions, 234 deletions
diff --git a/compiler/GHC/Unit.hs b/compiler/GHC/Unit.hs
index 0051aa3087..4e9710e239 100644
--- a/compiler/GHC/Unit.hs
+++ b/compiler/GHC/Unit.hs
@@ -21,237 +21,334 @@ import GHC.Unit.State
 import GHC.Unit.Subst
 import GHC.Unit.Module
 
--- Note [About Units]
--- ~~~~~~~~~~~~~~~~~~
---
--- Haskell users are used to manipulate Cabal packages. These packages are
--- identified by:
---    - a package name :: String
---    - a package version :: Version
---    - (a revision number, when they are registered on Hackage)
---
--- Cabal packages may contain several components (libraries, programs,
--- testsuites). In GHC we are mostly interested in libraries because those are
--- the components that can be depended upon by other components. Components in a
--- package are identified by their component name. Historically only one library
--- component was allowed per package, hence it didn't need a name. For this
--- reason, component name may be empty for one library component in each
--- package:
---    - a component name :: Maybe String
---
--- UnitId
--- ------
---
--- Cabal libraries can be compiled in various ways (different compiler options
--- or Cabal flags, different dependencies, etc.), hence using package name,
--- package version and component name isn't enough to identify a built library.
--- We use another identifier called UnitId:
---
---   package name             \
---   package version          |                       ________
---   component name           | hash of all this ==> | UnitId |
---   Cabal flags              |                       --------
---   compiler options         |
---   dependencies' UnitId     /
---
--- Fortunately GHC doesn't have to generate these UnitId: they are provided by
--- external build tools (e.g. Cabal) with `-this-unit-id` command-line parameter.
---
--- UnitIds are important because they are used to generate internal names
--- (symbols, etc.).
---
--- Wired-in units
--- --------------
---
--- Certain libraries are known to the compiler, in that we know about certain
--- entities that reside in these libraries. The compiler needs to declare static
--- Modules and Names that refer to units built from these libraries.
---
--- Hence UnitIds of wired-in libraries are fixed. Instead of letting Cabal chose
--- the UnitId for these libraries, their .cabal file uses the following stanza to
--- force it to a specific value:
---
---    ghc-options: -this-unit-id ghc-prim    -- taken from ghc-prim.cabal
---
--- The RTS also uses entities of wired-in units by directly referring to symbols
--- such as "base_GHCziIOziException_heapOverflow_closure" where the prefix is
--- the UnitId of "base" unit.
---
--- Unit databases
--- --------------
---
--- Units are stored in databases in order to be reused by other codes:
---
---    UnitKey ---> UnitInfo { exposed modules, package name, package version
---                            component name, various file paths,
---                            dependencies :: [UnitKey], etc. }
---
--- Because of the wired-in units described above, we can't exactly use UnitIds
--- as UnitKeys in the database: if we did this, we could only have a single unit
--- (compiled library) in the database for each wired-in library. As we want to
--- support databases containing several different units for the same wired-in
--- library, we do this:
---
---    * for non wired-in units:
---       * UnitId = UnitKey = Identifier (hash) computed by Cabal
---
---    * for wired-in units:
---       * UnitKey = Identifier computed by Cabal (just like for non wired-in units)
---       * UnitId  = unit-id specified with -this-unit-id command-line flag
---
--- We can expose several units to GHC via the `package-id <UnitKey>`
--- command-line parameter. We must use the UnitKeys of the units so that GHC can
--- find them in the database.
---
--- GHC then replaces the UnitKeys with UnitIds by taking into account wired-in
--- units: these units are detected thanks to their UnitInfo (especially their
--- package name).
---
--- For example, knowing that "base", "ghc-prim" and "rts" are wired-in packages,
--- the following dependency graph expressed with UnitKeys (as found in the
--- database) will be transformed into a similar graph expressed with UnitIds
--- (that are what matters for compilation):
---
---    UnitKeys
---    ~~~~~~~~                             ---> rts-1.0-hashABC <--
---                                         |                      |
---                                         |                      |
---    foo-2.0-hash123 --> base-4.1-hashXYZ ---> ghc-prim-0.5.3-hashABC
---
---    UnitIds
---    ~~~~~~~                              ---> rts <--
---                                         |          |
---                                         |          |
---    foo-2.0-hash123 --> base ---------------> ghc-prim
---
---
--- Module signatures / indefinite units / instantiated units
--- ---------------------------------------------------------
---
--- GHC distinguishes two kinds of units:
---
---    * definite: units for which every module has an associated code object
---    (i.e. real compiled code in a .o/.a/.so/.dll/...)
---
---    * indefinite: units for which some modules are replaced by module
---    signatures.
---
--- Module signatures are a kind of interface (similar to .hs-boot files). They
--- are used in place of some real code. GHC allows real modules from other
--- units to be used to fill these module holes. The process is called
--- "unit/module instantiation".
---
--- You can think of this as polymorphism at the module level: module signatures
--- give constraints on the "type" of module that can be used to fill the hole
--- (where "type" means types of the exported module entitites, etc.).
---
--- Module signatures contain enough information (datatypes, abstract types, type
--- synonyms, classes, etc.) to typecheck modules depending on them but not
--- enough to compile them. As such, indefinite units found in databases only
--- provide module interfaces (the .hi ones this time), not object code.
---
--- To distinguish between indefinite and finite unit ids at the type level, we
--- respectively use 'IndefUnitId' and 'DefUnitId' datatypes that are basically
--- wrappers over 'UnitId'.
---
--- Unit instantiation
--- ------------------
---
--- Indefinite units can be instantiated with modules from other units. The
--- instantiating units can also be instantiated themselves (if there are
--- indefinite) and so on. The 'Unit' datatype represents a unit which may have
--- been instantiated:
---
---    data Unit = RealUnit DefUnitId
---              | VirtUnit InstantiatedUnit
---
--- 'InstantiatedUnit' has two interesting fields:
---
---    * instUnitInstanceOf :: IndefUnitId
---       -- ^ the indefinite unit that is instantiated
---
---    * instUnitInsts :: [(ModuleName,(Unit,ModuleName)]
---       -- ^ a list of instantiations, where an instantiation is:
---            (module hole name, (instantiating unit, instantiating module name))
---
--- A 'Unit' may be indefinite or definite, it depends on whether some holes
--- remain in the instantiated unit OR in the instantiating units (recursively).
---
--- Pretty-printing UnitId
--- ----------------------
---
--- GHC mostly deals with UnitIds which are some opaque strings. We could display
--- them when we pretty-print a module origin, a name, etc. But it wouldn't be
--- very friendly to the user because of the hash they usually contain. E.g.
---
---    foo-4.18.1:thelib-XYZsomeUglyHashABC
---
--- Instead when we want to pretty-print a 'UnitId' we query the database to
--- get the 'UnitInfo' and print something nicer to the user:
---
---    foo-4.18.1:thelib
---
--- We do the same for wired-in units.
---
--- Currently (2020-04-06), we don't thread the database into every function that
--- pretty-prints a Name/Module/Unit. Instead querying the database is delayed
--- until the `SDoc` is transformed into a `Doc` using the database that is
--- active at this point in time. This is an issue because we want to be able to
--- unload units from the database and we also want to support several
--- independent databases loaded at the same time (see #14335). The alternatives
--- we have are:
---
---    * threading the database into every function that pretty-prints a UnitId
---    for the user (directly or indirectly).
---
---    * storing enough info to correctly display a UnitId into the UnitId
---    datatype itself. This is done in the IndefUnitId wrapper (see
---    'UnitPprInfo' datatype) but not for every 'UnitId'. Statically defined
---    'UnitId' for wired-in units would have empty UnitPprInfo so we need to
---    find some places to update them if we want to display wired-in UnitId
---    correctly. This leads to a solution similar to the first one above.
---
--- Note [VirtUnit to RealUnit improvement]
--- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
---
--- Over the course of instantiating VirtUnits on the fly while typechecking an
--- indefinite library, we may end up with a fully instantiated VirtUnit. I.e.
--- one that could be compiled and installed in the database. During
--- type-checking we generate a virtual UnitId for it, say "abc".
---
--- Now the question is: do we have a matching installed unit in the database?
--- Suppose we have one with UnitId "xyz" (provided by Cabal so we don't know how
--- to generate it). The trouble is that if both units end up being used in the
--- same type-checking session, their names won't match (e.g. "abc:M.X" vs
--- "xyz:M.X").
---
--- As we want them to match we just replace the virtual unit with the installed
--- one: for some reason this is called "improvement".
---
--- There is one last niggle: improvement based on the package database means
--- that we might end up developing on a package that is not transitively
--- depended upon by the packages the user specified directly via command line
--- flags.  This could lead to strange and difficult to understand bugs if those
--- instantiations are out of date.  The solution is to only improve a
--- unit id if the new unit id is part of the 'preloadClosure'; i.e., the
--- closure of all the packages which were explicitly specified.
-
--- Note [Representation of module/name variables]
--- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
--- In our ICFP'16, we use <A> to represent module holes, and {A.T} to represent
--- name holes.  This could have been represented by adding some new cases
--- to the core data types, but this would have made the existing 'moduleName'
--- and 'moduleUnit' partial, which would have required a lot of modifications
--- to existing code.
---
--- Instead, we use a fake "hole" unit:
---
---      <A>   ===> hole:A
---      {A.T} ===> hole:A.T
---
--- This encoding is quite convenient, but it is also a bit dangerous too,
--- because if you have a 'hole:A' you need to know if it's actually a
--- 'Module' or just a module stored in a 'Name'; these two cases must be
--- treated differently when doing substitutions.  'renameHoleModule'
--- and 'renameHoleUnit' assume they are NOT operating on a
--- 'Name'; 'NameShape' handles name substitutions exclusively.
+{-
+
+Note [About Units]
+~~~~~~~~~~~~~~~~~~
+
+Haskell users are used to manipulate Cabal packages. These packages are
+identified by:
+   - a package name :: String
+   - a package version :: Version
+   - (a revision number, when they are registered on Hackage)
+
+Cabal packages may contain several components (libraries, programs,
+testsuites). In GHC we are mostly interested in libraries because those are
+the components that can be depended upon by other components. Components in a
+package are identified by their component name. Historically only one library
+component was allowed per package, hence it didn't need a name. For this
+reason, component name may be empty for one library component in each
+package:
+   - a component name :: Maybe String
+
+UnitId
+------
+
+Cabal libraries can be compiled in various ways (different compiler options
+or Cabal flags, different dependencies, etc.), hence using package name,
+package version and component name isn't enough to identify a built library.
+We use another identifier called UnitId:
+
+  package name             \
+  package version          |                       ________
+  component name           | hash of all this ==> | UnitId |
+  Cabal flags              |                       --------
+  compiler options         |
+  dependencies' UnitId     /
+
+Fortunately GHC doesn't have to generate these UnitId: they are provided by
+external build tools (e.g. Cabal) with `-this-unit-id` command-line parameter.
+
+UnitIds are important because they are used to generate internal names
+(symbols, etc.).
+
+Wired-in units
+--------------
+
+Certain libraries (ghc-prim, base, etc.) are known to the compiler and to the
+RTS as they provide some basic primitives.  Hence UnitIds of wired-in libraries
+are fixed. Instead of letting Cabal chose the UnitId for these libraries, their
+.cabal file uses the following stanza to force it to a specific value:
+
+   ghc-options: -this-unit-id ghc-prim    -- taken from ghc-prim.cabal
+
+The RTS also uses entities of wired-in units by directly referring to symbols
+such as "base_GHCziIOziException_heapOverflow_closure" where the prefix is
+the UnitId of "base" unit.
+
+Unit databases
+--------------
+
+Units are stored in databases in order to be reused by other codes:
+
+   UnitKey ---> UnitInfo { exposed modules, package name, package version
+                           component name, various file paths,
+                           dependencies :: [UnitKey], etc. }
+
+Because of the wired-in units described above, we can't exactly use UnitIds
+as UnitKeys in the database: if we did this, we could only have a single unit
+(compiled library) in the database for each wired-in library. As we want to
+support databases containing several different units for the same wired-in
+library, we do this:
+
+   * for non wired-in units:
+      * UnitId = UnitKey = Identifier (hash) computed by Cabal
+
+   * for wired-in units:
+      * UnitKey = Identifier computed by Cabal (just like for non wired-in units)
+      * UnitId  = unit-id specified with -this-unit-id command-line flag
+
+We can expose several units to GHC via the `package-id <unit-key>` command-line
+parameter. We must use the UnitKeys of the units so that GHC can find them in
+the database.
+
+During unit loading, GHC replaces UnitKeys with UnitIds. It identifies wired
+units by their package name (stored in their UnitInfo) and uses wired-in UnitIds
+for them.
+
+For example, knowing that "base", "ghc-prim" and "rts" are wired-in units, the
+following dependency graph expressed with database UnitKeys will be transformed
+into a similar graph expressed with UnitIds:
+
+   UnitKeys
+   ~~~~~~~~                      ----------> rts-1.0-hashABC <--
+                                 |                             |
+                                 |                             |
+   foo-2.0-hash123 --> base-4.1-hashXYZ ---> ghc-prim-0.5.3-hashUVW
+
+   UnitIds
+   ~~~~~~~               ---------------> rts <--
+                         |                      |
+                         |                      |
+   foo-2.0-hash123 --> base ---------------> ghc-prim
+
+
+Note that "foo-2.0-hash123" isn't wired-in so its UnitId is the same as its UnitKey.
+
+
+Module signatures / indefinite units / instantiated units
+---------------------------------------------------------
+
+GHC distinguishes two kinds of units:
+
+   * definite units:
+      * units without module holes and with definite dependencies
+      * can be compiled into machine code (.o/.a/.so/.dll/...)
+
+   * indefinite units:
+      * units with some module holes or with some indefinite dependencies
+      * can only be type-checked
+
+Module holes are constrained by module signatures (.hsig files). Module
+signatures are a kind of interface (similar to .hs-boot files). They are used in
+place of some real code. GHC allows modules from other units to be used to fill
+these module holes: the process is called "unit/module instantiation". The
+instantiating module may either be a concrete module or a module signature. In
+the latter case, the signatures are merged to form a new one.
+
+You can think of this as polymorphism at the module level: module signatures
+give constraints on the "type" of module that can be used to fill the hole
+(where "type" means types of the exported module entitites, etc.).
+
+Module signatures contain enough information (datatypes, abstract types, type
+synonyms, classes, etc.) to typecheck modules depending on them but not
+enough to compile them. As such, indefinite units found in databases only
+provide module interfaces (the .hi ones this time), not object code.
+
+To distinguish between indefinite and definite unit ids at the type level, we
+respectively use 'IndefUnitId' and 'DefUnitId' datatypes that are basically
+wrappers over 'UnitId'.
+
+Unit instantiation / on-the-fly instantiation
+---------------------------------------------
+
+Indefinite units can be instantiated with modules from other units. The
+instantiating units can also be instantiated themselves (if there are
+indefinite) and so on.
+
+On-the-fly unit instantiation is a tricky optimization explained in
+http://blog.ezyang.com/2016/08/optimizing-incremental-compilation
+Here is a summary:
+
+   1. Indefinite units can only be type-checked, not compiled into real code.
+   Type-checking produces interface files (.hi) which are incomplete for code
+   generation (they lack unfoldings, etc.) but enough to perform type-checking
+   of units depending on them.
+
+   2. Type-checking an instantiated unit is cheap as we only have to merge
+   interface files (.hi) of the instantiated unit and of the instantiating
+   units, hence it can be done on-the-fly. Interface files of the dependencies
+   can be concrete or produced on-the-fly recursively.
+
+   3. When we compile a unit, we mustn't use interfaces produced by the
+   type-checker (on-the-fly or not) for the instantiated unit dependencies
+   because they lack some information.
+
+   4. When we type-check an indefinite unit, we must be consistent about the
+   interfaces we use for each dependency: only those produced by the
+   type-checker (on-the-fly or not) or only those produced after a full
+   compilation, but not both at the same time.
+
+   It can be tricky if we have the following kind of dependency graph:
+
+      X (indefinite) ------> D (definite, compiled) -----> I (instantiated, definite, compiled)
+      |----------------------------------------------------^
+
+   Suppose we want to type-check unit X which depends on unit I and D:
+      * I is definite and compiled: we have compiled .hi files for its modules on disk
+      * I is instantiated: it is cheap to produce type-checker .hi files for its modules on-the-fly
+
+   But we must not do:
+
+      X (indefinite) ------> D (definite, compiled) -----> I (instantiated, definite, compiled)
+      |--------------------------------------------------> I (instantiated on-the-fly)
+
+      ==> inconsistent module interfaces for I
+
+   Nor:
+
+      X (indefinite) ------> D (definite, compiled) -------v
+      |--------------------------------------------------> I (instantiated on-the-fly)
+
+      ==> D's interfaces may refer to things that only exist in I's *compiled* interfaces
+
+   An alternative would be to store both type-checked and compiled interfaces
+   for every compiled non-instantiated unit (instantiated unit can be done
+   on-the-fly) so that we could use type-checked interfaces of D in the
+   example above. But it would increase compilation time and unit size.
+
+
+The 'Unit' datatype represents a unit which may have been instantiated
+on-the-fly:
+
+   data Unit = RealUnit DefUnitId         -- use compiled interfaces on disk
+             | VirtUnit InstantiatedUnit  -- use on-the-fly instantiation
+
+'InstantiatedUnit' has two interesting fields:
+
+   * instUnitInstanceOf :: IndefUnitId
+      -- ^ the indefinite unit that is instantiated
+
+   * instUnitInsts :: [(ModuleName,(Unit,ModuleName)]
+      -- ^ a list of instantiations, where an instantiation is:
+           (module hole name, (instantiating unit, instantiating module name))
+
+A 'VirtUnit' may be indefinite or definite, it depends on whether some holes
+remain in the instantiated unit OR in the instantiating units (recursively).
+Having a fully instantiated (i.e. definite) virtual unit can lead to some issues
+if there is a matching compiled unit in the preload closure.  See Note [VirtUnit
+to RealUnit improvement]
+
+Unit database and indefinite units
+----------------------------------
+
+We don't store partially instantiated units in the unit database.  Units in the
+database are either:
+
+   * definite (fully instantiated or without holes): in this case we have
+     *compiled* module interfaces (.hi) and object codes (.o/.a/.so/.dll/...).
+
+   * fully indefinite (not instantiated at all): in this case we only have
+     *type-checked* module interfaces (.hi).
+
+Note that indefinite units are stored as an instantiation of themselves where
+each instantiating module is a module variable (see Note [Representation of
+module/name variables]). E.g.
+
+   "xyz" (UnitKey) ---> UnitInfo { instanceOf       = "xyz"
+                                 , instantiatedWith = [A=<A>,B=<B>...]
+                                 , ...
+                                 }
+
+Note that non-instantiated units are also stored as an instantiation of
+themselves.  It is a reminiscence of previous terminology (when "instanceOf" was
+"componentId"). E.g.
+
+   "xyz" (UnitKey) ---> UnitInfo { instanceOf       = "xyz"
+                                 , instantiatedWith = []
+                                 , ...
+                                 }
+
+TODO: We should probably have `instanceOf :: Maybe IndefUnitId` instead.
+
+
+Pretty-printing UnitId
+----------------------
+
+GHC mostly deals with UnitIds which are some opaque strings. We could display
+them when we pretty-print a module origin, a name, etc. But it wouldn't be
+very friendly to the user because of the hash they usually contain. E.g.
+
+   foo-4.18.1:thelib-XYZsomeUglyHashABC
+
+Instead when we want to pretty-print a 'UnitId' we query the database to
+get the 'UnitInfo' and print something nicer to the user:
+
+   foo-4.18.1:thelib
+
+We do the same for wired-in units.
+
+Currently (2020-04-06), we don't thread the database into every function that
+pretty-prints a Name/Module/Unit. Instead querying the database is delayed
+until the `SDoc` is transformed into a `Doc` using the database that is
+active at this point in time. This is an issue because we want to be able to
+unload units from the database and we also want to support several
+independent databases loaded at the same time (see #14335). The alternatives
+we have are:
+
+   * threading the database into every function that pretty-prints a UnitId
+   for the user (directly or indirectly).
+
+   * storing enough info to correctly display a UnitId into the UnitId
+   datatype itself. This is done in the IndefUnitId wrapper (see
+   'UnitPprInfo' datatype) but not for every 'UnitId'. Statically defined
+   'UnitId' for wired-in units would have empty UnitPprInfo so we need to
+   find some places to update them if we want to display wired-in UnitId
+   correctly. This leads to a solution similar to the first one above.
+
+Note [VirtUnit to RealUnit improvement]
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Over the course of instantiating VirtUnits on the fly while typechecking an
+indefinite library, we may end up with a fully instantiated VirtUnit. I.e.
+one that could be compiled and installed in the database. During
+type-checking we generate a virtual UnitId for it, say "abc".
+
+Now the question is: do we have a matching installed unit in the database?
+Suppose we have one with UnitId "xyz" (provided by Cabal so we don't know how
+to generate it). The trouble is that if both units end up being used in the
+same type-checking session, their names won't match (e.g. "abc:M.X" vs
+"xyz:M.X").
+
+As we want them to match we just replace the virtual unit with the installed
+one: for some reason this is called "improvement".
+
+There is one last niggle: improvement based on the unit database means
+that we might end up developing on a unit that is not transitively
+depended upon by the units the user specified directly via command line
+flags.  This could lead to strange and difficult to understand bugs if those
+instantiations are out of date.  The solution is to only improve a
+unit id if the new unit id is part of the 'preloadClosure'; i.e., the
+closure of all the units which were explicitly specified.
+
+Note [Representation of module/name variables]
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+In our ICFP'16, we use <A> to represent module holes, and {A.T} to represent
+name holes.  This could have been represented by adding some new cases
+to the core data types, but this would have made the existing 'moduleName'
+and 'moduleUnit' partial, which would have required a lot of modifications
+to existing code.
+
+Instead, we use a fake "hole" unit:
+
+     <A>   ===> hole:A
+     {A.T} ===> hole:A.T
+
+This encoding is quite convenient, but it is also a bit dangerous too,
+because if you have a 'hole:A' you need to know if it's actually a
+'Module' or just a module stored in a 'Name'; these two cases must be
+treated differently when doing substitutions.  'renameHoleModule'
+and 'renameHoleUnit' assume they are NOT operating on a
+'Name'; 'NameShape' handles name substitutions exclusively.
+
+-}
author	Sylvain Henry <sylvain@haskus.fr>	2020-05-07 15:35:10 +0200
committer	Marge Bot <ben+marge-bot@smart-cactus.org>	2020-05-26 03:04:45 -0400
commit	cf772f19c06944f0fd03b4bdcd4a49e437084ba5 (patch)
tree	90d224ac03752cb0500533dfad7ed4d9d89c58ed
parent	6604906c8cfa37f5780a6d5c40506b751b1740db (diff)
download	haskell-cf772f19c06944f0fd03b4bdcd4a49e437084ba5.tar.gz