summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorSylvain Henry <sylvain@haskus.fr>2020-05-07 15:35:10 +0200
committerMarge Bot <ben+marge-bot@smart-cactus.org>2020-05-26 03:04:45 -0400
commitcf772f19c06944f0fd03b4bdcd4a49e437084ba5 (patch)
tree90d224ac03752cb0500533dfad7ed4d9d89c58ed
parent6604906c8cfa37f5780a6d5c40506b751b1740db (diff)
downloadhaskell-cf772f19c06944f0fd03b4bdcd4a49e437084ba5.tar.gz
Enhance Note [About units] for Backpack
-rw-r--r--compiler/GHC/Unit.hs565
1 files changed, 331 insertions, 234 deletions
diff --git a/compiler/GHC/Unit.hs b/compiler/GHC/Unit.hs
index 0051aa3087..4e9710e239 100644
--- a/compiler/GHC/Unit.hs
+++ b/compiler/GHC/Unit.hs
@@ -21,237 +21,334 @@ import GHC.Unit.State
import GHC.Unit.Subst
import GHC.Unit.Module
--- Note [About Units]
--- ~~~~~~~~~~~~~~~~~~
---
--- Haskell users are used to manipulate Cabal packages. These packages are
--- identified by:
--- - a package name :: String
--- - a package version :: Version
--- - (a revision number, when they are registered on Hackage)
---
--- Cabal packages may contain several components (libraries, programs,
--- testsuites). In GHC we are mostly interested in libraries because those are
--- the components that can be depended upon by other components. Components in a
--- package are identified by their component name. Historically only one library
--- component was allowed per package, hence it didn't need a name. For this
--- reason, component name may be empty for one library component in each
--- package:
--- - a component name :: Maybe String
---
--- UnitId
--- ------
---
--- Cabal libraries can be compiled in various ways (different compiler options
--- or Cabal flags, different dependencies, etc.), hence using package name,
--- package version and component name isn't enough to identify a built library.
--- We use another identifier called UnitId:
---
--- package name \
--- package version | ________
--- component name | hash of all this ==> | UnitId |
--- Cabal flags | --------
--- compiler options |
--- dependencies' UnitId /
---
--- Fortunately GHC doesn't have to generate these UnitId: they are provided by
--- external build tools (e.g. Cabal) with `-this-unit-id` command-line parameter.
---
--- UnitIds are important because they are used to generate internal names
--- (symbols, etc.).
---
--- Wired-in units
--- --------------
---
--- Certain libraries are known to the compiler, in that we know about certain
--- entities that reside in these libraries. The compiler needs to declare static
--- Modules and Names that refer to units built from these libraries.
---
--- Hence UnitIds of wired-in libraries are fixed. Instead of letting Cabal chose
--- the UnitId for these libraries, their .cabal file uses the following stanza to
--- force it to a specific value:
---
--- ghc-options: -this-unit-id ghc-prim -- taken from ghc-prim.cabal
---
--- The RTS also uses entities of wired-in units by directly referring to symbols
--- such as "base_GHCziIOziException_heapOverflow_closure" where the prefix is
--- the UnitId of "base" unit.
---
--- Unit databases
--- --------------
---
--- Units are stored in databases in order to be reused by other codes:
---
--- UnitKey ---> UnitInfo { exposed modules, package name, package version
--- component name, various file paths,
--- dependencies :: [UnitKey], etc. }
---
--- Because of the wired-in units described above, we can't exactly use UnitIds
--- as UnitKeys in the database: if we did this, we could only have a single unit
--- (compiled library) in the database for each wired-in library. As we want to
--- support databases containing several different units for the same wired-in
--- library, we do this:
---
--- * for non wired-in units:
--- * UnitId = UnitKey = Identifier (hash) computed by Cabal
---
--- * for wired-in units:
--- * UnitKey = Identifier computed by Cabal (just like for non wired-in units)
--- * UnitId = unit-id specified with -this-unit-id command-line flag
---
--- We can expose several units to GHC via the `package-id <UnitKey>`
--- command-line parameter. We must use the UnitKeys of the units so that GHC can
--- find them in the database.
---
--- GHC then replaces the UnitKeys with UnitIds by taking into account wired-in
--- units: these units are detected thanks to their UnitInfo (especially their
--- package name).
---
--- For example, knowing that "base", "ghc-prim" and "rts" are wired-in packages,
--- the following dependency graph expressed with UnitKeys (as found in the
--- database) will be transformed into a similar graph expressed with UnitIds
--- (that are what matters for compilation):
---
--- UnitKeys
--- ~~~~~~~~ ---> rts-1.0-hashABC <--
--- | |
--- | |
--- foo-2.0-hash123 --> base-4.1-hashXYZ ---> ghc-prim-0.5.3-hashABC
---
--- UnitIds
--- ~~~~~~~ ---> rts <--
--- | |
--- | |
--- foo-2.0-hash123 --> base ---------------> ghc-prim
---
---
--- Module signatures / indefinite units / instantiated units
--- ---------------------------------------------------------
---
--- GHC distinguishes two kinds of units:
---
--- * definite: units for which every module has an associated code object
--- (i.e. real compiled code in a .o/.a/.so/.dll/...)
---
--- * indefinite: units for which some modules are replaced by module
--- signatures.
---
--- Module signatures are a kind of interface (similar to .hs-boot files). They
--- are used in place of some real code. GHC allows real modules from other
--- units to be used to fill these module holes. The process is called
--- "unit/module instantiation".
---
--- You can think of this as polymorphism at the module level: module signatures
--- give constraints on the "type" of module that can be used to fill the hole
--- (where "type" means types of the exported module entitites, etc.).
---
--- Module signatures contain enough information (datatypes, abstract types, type
--- synonyms, classes, etc.) to typecheck modules depending on them but not
--- enough to compile them. As such, indefinite units found in databases only
--- provide module interfaces (the .hi ones this time), not object code.
---
--- To distinguish between indefinite and finite unit ids at the type level, we
--- respectively use 'IndefUnitId' and 'DefUnitId' datatypes that are basically
--- wrappers over 'UnitId'.
---
--- Unit instantiation
--- ------------------
---
--- Indefinite units can be instantiated with modules from other units. The
--- instantiating units can also be instantiated themselves (if there are
--- indefinite) and so on. The 'Unit' datatype represents a unit which may have
--- been instantiated:
---
--- data Unit = RealUnit DefUnitId
--- | VirtUnit InstantiatedUnit
---
--- 'InstantiatedUnit' has two interesting fields:
---
--- * instUnitInstanceOf :: IndefUnitId
--- -- ^ the indefinite unit that is instantiated
---
--- * instUnitInsts :: [(ModuleName,(Unit,ModuleName)]
--- -- ^ a list of instantiations, where an instantiation is:
--- (module hole name, (instantiating unit, instantiating module name))
---
--- A 'Unit' may be indefinite or definite, it depends on whether some holes
--- remain in the instantiated unit OR in the instantiating units (recursively).
---
--- Pretty-printing UnitId
--- ----------------------
---
--- GHC mostly deals with UnitIds which are some opaque strings. We could display
--- them when we pretty-print a module origin, a name, etc. But it wouldn't be
--- very friendly to the user because of the hash they usually contain. E.g.
---
--- foo-4.18.1:thelib-XYZsomeUglyHashABC
---
--- Instead when we want to pretty-print a 'UnitId' we query the database to
--- get the 'UnitInfo' and print something nicer to the user:
---
--- foo-4.18.1:thelib
---
--- We do the same for wired-in units.
---
--- Currently (2020-04-06), we don't thread the database into every function that
--- pretty-prints a Name/Module/Unit. Instead querying the database is delayed
--- until the `SDoc` is transformed into a `Doc` using the database that is
--- active at this point in time. This is an issue because we want to be able to
--- unload units from the database and we also want to support several
--- independent databases loaded at the same time (see #14335). The alternatives
--- we have are:
---
--- * threading the database into every function that pretty-prints a UnitId
--- for the user (directly or indirectly).
---
--- * storing enough info to correctly display a UnitId into the UnitId
--- datatype itself. This is done in the IndefUnitId wrapper (see
--- 'UnitPprInfo' datatype) but not for every 'UnitId'. Statically defined
--- 'UnitId' for wired-in units would have empty UnitPprInfo so we need to
--- find some places to update them if we want to display wired-in UnitId
--- correctly. This leads to a solution similar to the first one above.
---
--- Note [VirtUnit to RealUnit improvement]
--- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
---
--- Over the course of instantiating VirtUnits on the fly while typechecking an
--- indefinite library, we may end up with a fully instantiated VirtUnit. I.e.
--- one that could be compiled and installed in the database. During
--- type-checking we generate a virtual UnitId for it, say "abc".
---
--- Now the question is: do we have a matching installed unit in the database?
--- Suppose we have one with UnitId "xyz" (provided by Cabal so we don't know how
--- to generate it). The trouble is that if both units end up being used in the
--- same type-checking session, their names won't match (e.g. "abc:M.X" vs
--- "xyz:M.X").
---
--- As we want them to match we just replace the virtual unit with the installed
--- one: for some reason this is called "improvement".
---
--- There is one last niggle: improvement based on the package database means
--- that we might end up developing on a package that is not transitively
--- depended upon by the packages the user specified directly via command line
--- flags. This could lead to strange and difficult to understand bugs if those
--- instantiations are out of date. The solution is to only improve a
--- unit id if the new unit id is part of the 'preloadClosure'; i.e., the
--- closure of all the packages which were explicitly specified.
-
--- Note [Representation of module/name variables]
--- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
--- In our ICFP'16, we use <A> to represent module holes, and {A.T} to represent
--- name holes. This could have been represented by adding some new cases
--- to the core data types, but this would have made the existing 'moduleName'
--- and 'moduleUnit' partial, which would have required a lot of modifications
--- to existing code.
---
--- Instead, we use a fake "hole" unit:
---
--- <A> ===> hole:A
--- {A.T} ===> hole:A.T
---
--- This encoding is quite convenient, but it is also a bit dangerous too,
--- because if you have a 'hole:A' you need to know if it's actually a
--- 'Module' or just a module stored in a 'Name'; these two cases must be
--- treated differently when doing substitutions. 'renameHoleModule'
--- and 'renameHoleUnit' assume they are NOT operating on a
--- 'Name'; 'NameShape' handles name substitutions exclusively.
+{-
+
+Note [About Units]
+~~~~~~~~~~~~~~~~~~
+
+Haskell users are used to manipulate Cabal packages. These packages are
+identified by:
+ - a package name :: String
+ - a package version :: Version
+ - (a revision number, when they are registered on Hackage)
+
+Cabal packages may contain several components (libraries, programs,
+testsuites). In GHC we are mostly interested in libraries because those are
+the components that can be depended upon by other components. Components in a
+package are identified by their component name. Historically only one library
+component was allowed per package, hence it didn't need a name. For this
+reason, component name may be empty for one library component in each
+package:
+ - a component name :: Maybe String
+
+UnitId
+------
+
+Cabal libraries can be compiled in various ways (different compiler options
+or Cabal flags, different dependencies, etc.), hence using package name,
+package version and component name isn't enough to identify a built library.
+We use another identifier called UnitId:
+
+ package name \
+ package version | ________
+ component name | hash of all this ==> | UnitId |
+ Cabal flags | --------
+ compiler options |
+ dependencies' UnitId /
+
+Fortunately GHC doesn't have to generate these UnitId: they are provided by
+external build tools (e.g. Cabal) with `-this-unit-id` command-line parameter.
+
+UnitIds are important because they are used to generate internal names
+(symbols, etc.).
+
+Wired-in units
+--------------
+
+Certain libraries (ghc-prim, base, etc.) are known to the compiler and to the
+RTS as they provide some basic primitives. Hence UnitIds of wired-in libraries
+are fixed. Instead of letting Cabal chose the UnitId for these libraries, their
+.cabal file uses the following stanza to force it to a specific value:
+
+ ghc-options: -this-unit-id ghc-prim -- taken from ghc-prim.cabal
+
+The RTS also uses entities of wired-in units by directly referring to symbols
+such as "base_GHCziIOziException_heapOverflow_closure" where the prefix is
+the UnitId of "base" unit.
+
+Unit databases
+--------------
+
+Units are stored in databases in order to be reused by other codes:
+
+ UnitKey ---> UnitInfo { exposed modules, package name, package version
+ component name, various file paths,
+ dependencies :: [UnitKey], etc. }
+
+Because of the wired-in units described above, we can't exactly use UnitIds
+as UnitKeys in the database: if we did this, we could only have a single unit
+(compiled library) in the database for each wired-in library. As we want to
+support databases containing several different units for the same wired-in
+library, we do this:
+
+ * for non wired-in units:
+ * UnitId = UnitKey = Identifier (hash) computed by Cabal
+
+ * for wired-in units:
+ * UnitKey = Identifier computed by Cabal (just like for non wired-in units)
+ * UnitId = unit-id specified with -this-unit-id command-line flag
+
+We can expose several units to GHC via the `package-id <unit-key>` command-line
+parameter. We must use the UnitKeys of the units so that GHC can find them in
+the database.
+
+During unit loading, GHC replaces UnitKeys with UnitIds. It identifies wired
+units by their package name (stored in their UnitInfo) and uses wired-in UnitIds
+for them.
+
+For example, knowing that "base", "ghc-prim" and "rts" are wired-in units, the
+following dependency graph expressed with database UnitKeys will be transformed
+into a similar graph expressed with UnitIds:
+
+ UnitKeys
+ ~~~~~~~~ ----------> rts-1.0-hashABC <--
+ | |
+ | |
+ foo-2.0-hash123 --> base-4.1-hashXYZ ---> ghc-prim-0.5.3-hashUVW
+
+ UnitIds
+ ~~~~~~~ ---------------> rts <--
+ | |
+ | |
+ foo-2.0-hash123 --> base ---------------> ghc-prim
+
+
+Note that "foo-2.0-hash123" isn't wired-in so its UnitId is the same as its UnitKey.
+
+
+Module signatures / indefinite units / instantiated units
+---------------------------------------------------------
+
+GHC distinguishes two kinds of units:
+
+ * definite units:
+ * units without module holes and with definite dependencies
+ * can be compiled into machine code (.o/.a/.so/.dll/...)
+
+ * indefinite units:
+ * units with some module holes or with some indefinite dependencies
+ * can only be type-checked
+
+Module holes are constrained by module signatures (.hsig files). Module
+signatures are a kind of interface (similar to .hs-boot files). They are used in
+place of some real code. GHC allows modules from other units to be used to fill
+these module holes: the process is called "unit/module instantiation". The
+instantiating module may either be a concrete module or a module signature. In
+the latter case, the signatures are merged to form a new one.
+
+You can think of this as polymorphism at the module level: module signatures
+give constraints on the "type" of module that can be used to fill the hole
+(where "type" means types of the exported module entitites, etc.).
+
+Module signatures contain enough information (datatypes, abstract types, type
+synonyms, classes, etc.) to typecheck modules depending on them but not
+enough to compile them. As such, indefinite units found in databases only
+provide module interfaces (the .hi ones this time), not object code.
+
+To distinguish between indefinite and definite unit ids at the type level, we
+respectively use 'IndefUnitId' and 'DefUnitId' datatypes that are basically
+wrappers over 'UnitId'.
+
+Unit instantiation / on-the-fly instantiation
+---------------------------------------------
+
+Indefinite units can be instantiated with modules from other units. The
+instantiating units can also be instantiated themselves (if there are
+indefinite) and so on.
+
+On-the-fly unit instantiation is a tricky optimization explained in
+http://blog.ezyang.com/2016/08/optimizing-incremental-compilation
+Here is a summary:
+
+ 1. Indefinite units can only be type-checked, not compiled into real code.
+ Type-checking produces interface files (.hi) which are incomplete for code
+ generation (they lack unfoldings, etc.) but enough to perform type-checking
+ of units depending on them.
+
+ 2. Type-checking an instantiated unit is cheap as we only have to merge
+ interface files (.hi) of the instantiated unit and of the instantiating
+ units, hence it can be done on-the-fly. Interface files of the dependencies
+ can be concrete or produced on-the-fly recursively.
+
+ 3. When we compile a unit, we mustn't use interfaces produced by the
+ type-checker (on-the-fly or not) for the instantiated unit dependencies
+ because they lack some information.
+
+ 4. When we type-check an indefinite unit, we must be consistent about the
+ interfaces we use for each dependency: only those produced by the
+ type-checker (on-the-fly or not) or only those produced after a full
+ compilation, but not both at the same time.
+
+ It can be tricky if we have the following kind of dependency graph:
+
+ X (indefinite) ------> D (definite, compiled) -----> I (instantiated, definite, compiled)
+ |----------------------------------------------------^
+
+ Suppose we want to type-check unit X which depends on unit I and D:
+ * I is definite and compiled: we have compiled .hi files for its modules on disk
+ * I is instantiated: it is cheap to produce type-checker .hi files for its modules on-the-fly
+
+ But we must not do:
+
+ X (indefinite) ------> D (definite, compiled) -----> I (instantiated, definite, compiled)
+ |--------------------------------------------------> I (instantiated on-the-fly)
+
+ ==> inconsistent module interfaces for I
+
+ Nor:
+
+ X (indefinite) ------> D (definite, compiled) -------v
+ |--------------------------------------------------> I (instantiated on-the-fly)
+
+ ==> D's interfaces may refer to things that only exist in I's *compiled* interfaces
+
+ An alternative would be to store both type-checked and compiled interfaces
+ for every compiled non-instantiated unit (instantiated unit can be done
+ on-the-fly) so that we could use type-checked interfaces of D in the
+ example above. But it would increase compilation time and unit size.
+
+
+The 'Unit' datatype represents a unit which may have been instantiated
+on-the-fly:
+
+ data Unit = RealUnit DefUnitId -- use compiled interfaces on disk
+ | VirtUnit InstantiatedUnit -- use on-the-fly instantiation
+
+'InstantiatedUnit' has two interesting fields:
+
+ * instUnitInstanceOf :: IndefUnitId
+ -- ^ the indefinite unit that is instantiated
+
+ * instUnitInsts :: [(ModuleName,(Unit,ModuleName)]
+ -- ^ a list of instantiations, where an instantiation is:
+ (module hole name, (instantiating unit, instantiating module name))
+
+A 'VirtUnit' may be indefinite or definite, it depends on whether some holes
+remain in the instantiated unit OR in the instantiating units (recursively).
+Having a fully instantiated (i.e. definite) virtual unit can lead to some issues
+if there is a matching compiled unit in the preload closure. See Note [VirtUnit
+to RealUnit improvement]
+
+Unit database and indefinite units
+----------------------------------
+
+We don't store partially instantiated units in the unit database. Units in the
+database are either:
+
+ * definite (fully instantiated or without holes): in this case we have
+ *compiled* module interfaces (.hi) and object codes (.o/.a/.so/.dll/...).
+
+ * fully indefinite (not instantiated at all): in this case we only have
+ *type-checked* module interfaces (.hi).
+
+Note that indefinite units are stored as an instantiation of themselves where
+each instantiating module is a module variable (see Note [Representation of
+module/name variables]). E.g.
+
+ "xyz" (UnitKey) ---> UnitInfo { instanceOf = "xyz"
+ , instantiatedWith = [A=<A>,B=<B>...]
+ , ...
+ }
+
+Note that non-instantiated units are also stored as an instantiation of
+themselves. It is a reminiscence of previous terminology (when "instanceOf" was
+"componentId"). E.g.
+
+ "xyz" (UnitKey) ---> UnitInfo { instanceOf = "xyz"
+ , instantiatedWith = []
+ , ...
+ }
+
+TODO: We should probably have `instanceOf :: Maybe IndefUnitId` instead.
+
+
+Pretty-printing UnitId
+----------------------
+
+GHC mostly deals with UnitIds which are some opaque strings. We could display
+them when we pretty-print a module origin, a name, etc. But it wouldn't be
+very friendly to the user because of the hash they usually contain. E.g.
+
+ foo-4.18.1:thelib-XYZsomeUglyHashABC
+
+Instead when we want to pretty-print a 'UnitId' we query the database to
+get the 'UnitInfo' and print something nicer to the user:
+
+ foo-4.18.1:thelib
+
+We do the same for wired-in units.
+
+Currently (2020-04-06), we don't thread the database into every function that
+pretty-prints a Name/Module/Unit. Instead querying the database is delayed
+until the `SDoc` is transformed into a `Doc` using the database that is
+active at this point in time. This is an issue because we want to be able to
+unload units from the database and we also want to support several
+independent databases loaded at the same time (see #14335). The alternatives
+we have are:
+
+ * threading the database into every function that pretty-prints a UnitId
+ for the user (directly or indirectly).
+
+ * storing enough info to correctly display a UnitId into the UnitId
+ datatype itself. This is done in the IndefUnitId wrapper (see
+ 'UnitPprInfo' datatype) but not for every 'UnitId'. Statically defined
+ 'UnitId' for wired-in units would have empty UnitPprInfo so we need to
+ find some places to update them if we want to display wired-in UnitId
+ correctly. This leads to a solution similar to the first one above.
+
+Note [VirtUnit to RealUnit improvement]
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Over the course of instantiating VirtUnits on the fly while typechecking an
+indefinite library, we may end up with a fully instantiated VirtUnit. I.e.
+one that could be compiled and installed in the database. During
+type-checking we generate a virtual UnitId for it, say "abc".
+
+Now the question is: do we have a matching installed unit in the database?
+Suppose we have one with UnitId "xyz" (provided by Cabal so we don't know how
+to generate it). The trouble is that if both units end up being used in the
+same type-checking session, their names won't match (e.g. "abc:M.X" vs
+"xyz:M.X").
+
+As we want them to match we just replace the virtual unit with the installed
+one: for some reason this is called "improvement".
+
+There is one last niggle: improvement based on the unit database means
+that we might end up developing on a unit that is not transitively
+depended upon by the units the user specified directly via command line
+flags. This could lead to strange and difficult to understand bugs if those
+instantiations are out of date. The solution is to only improve a
+unit id if the new unit id is part of the 'preloadClosure'; i.e., the
+closure of all the units which were explicitly specified.
+
+Note [Representation of module/name variables]
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+In our ICFP'16, we use <A> to represent module holes, and {A.T} to represent
+name holes. This could have been represented by adding some new cases
+to the core data types, but this would have made the existing 'moduleName'
+and 'moduleUnit' partial, which would have required a lot of modifications
+to existing code.
+
+Instead, we use a fake "hole" unit:
+
+ <A> ===> hole:A
+ {A.T} ===> hole:A.T
+
+This encoding is quite convenient, but it is also a bit dangerous too,
+because if you have a 'hole:A' you need to know if it's actually a
+'Module' or just a module stored in a 'Name'; these two cases must be
+treated differently when doing substitutions. 'renameHoleModule'
+and 'renameHoleUnit' assume they are NOT operating on a
+'Name'; 'NameShape' handles name substitutions exclusively.
+
+-}