summaryrefslogtreecommitdiff
path: root/compiler/GHC/Unit.hs
diff options
context:
space:
mode:
Diffstat (limited to 'compiler/GHC/Unit.hs')
-rw-r--r--compiler/GHC/Unit.hs257
1 files changed, 257 insertions, 0 deletions
diff --git a/compiler/GHC/Unit.hs b/compiler/GHC/Unit.hs
new file mode 100644
index 0000000000..0051aa3087
--- /dev/null
+++ b/compiler/GHC/Unit.hs
@@ -0,0 +1,257 @@
+{-# LANGUAGE TypeSynonymInstances #-}
+{-# LANGUAGE FlexibleInstances #-}
+{-# LANGUAGE DeriveFunctor #-}
+
+-- | Units are library components from Cabal packages compiled and installed in
+-- a database
+module GHC.Unit
+ ( module GHC.Unit.Types
+ , module GHC.Unit.Info
+ , module GHC.Unit.Parser
+ , module GHC.Unit.State
+ , module GHC.Unit.Subst
+ , module GHC.Unit.Module
+ )
+where
+
+import GHC.Unit.Types
+import GHC.Unit.Info
+import GHC.Unit.Parser
+import GHC.Unit.State
+import GHC.Unit.Subst
+import GHC.Unit.Module
+
+-- Note [About Units]
+-- ~~~~~~~~~~~~~~~~~~
+--
+-- Haskell users are used to manipulate Cabal packages. These packages are
+-- identified by:
+-- - a package name :: String
+-- - a package version :: Version
+-- - (a revision number, when they are registered on Hackage)
+--
+-- Cabal packages may contain several components (libraries, programs,
+-- testsuites). In GHC we are mostly interested in libraries because those are
+-- the components that can be depended upon by other components. Components in a
+-- package are identified by their component name. Historically only one library
+-- component was allowed per package, hence it didn't need a name. For this
+-- reason, component name may be empty for one library component in each
+-- package:
+-- - a component name :: Maybe String
+--
+-- UnitId
+-- ------
+--
+-- Cabal libraries can be compiled in various ways (different compiler options
+-- or Cabal flags, different dependencies, etc.), hence using package name,
+-- package version and component name isn't enough to identify a built library.
+-- We use another identifier called UnitId:
+--
+-- package name \
+-- package version | ________
+-- component name | hash of all this ==> | UnitId |
+-- Cabal flags | --------
+-- compiler options |
+-- dependencies' UnitId /
+--
+-- Fortunately GHC doesn't have to generate these UnitId: they are provided by
+-- external build tools (e.g. Cabal) with `-this-unit-id` command-line parameter.
+--
+-- UnitIds are important because they are used to generate internal names
+-- (symbols, etc.).
+--
+-- Wired-in units
+-- --------------
+--
+-- Certain libraries are known to the compiler, in that we know about certain
+-- entities that reside in these libraries. The compiler needs to declare static
+-- Modules and Names that refer to units built from these libraries.
+--
+-- Hence UnitIds of wired-in libraries are fixed. Instead of letting Cabal chose
+-- the UnitId for these libraries, their .cabal file uses the following stanza to
+-- force it to a specific value:
+--
+-- ghc-options: -this-unit-id ghc-prim -- taken from ghc-prim.cabal
+--
+-- The RTS also uses entities of wired-in units by directly referring to symbols
+-- such as "base_GHCziIOziException_heapOverflow_closure" where the prefix is
+-- the UnitId of "base" unit.
+--
+-- Unit databases
+-- --------------
+--
+-- Units are stored in databases in order to be reused by other codes:
+--
+-- UnitKey ---> UnitInfo { exposed modules, package name, package version
+-- component name, various file paths,
+-- dependencies :: [UnitKey], etc. }
+--
+-- Because of the wired-in units described above, we can't exactly use UnitIds
+-- as UnitKeys in the database: if we did this, we could only have a single unit
+-- (compiled library) in the database for each wired-in library. As we want to
+-- support databases containing several different units for the same wired-in
+-- library, we do this:
+--
+-- * for non wired-in units:
+-- * UnitId = UnitKey = Identifier (hash) computed by Cabal
+--
+-- * for wired-in units:
+-- * UnitKey = Identifier computed by Cabal (just like for non wired-in units)
+-- * UnitId = unit-id specified with -this-unit-id command-line flag
+--
+-- We can expose several units to GHC via the `package-id <UnitKey>`
+-- command-line parameter. We must use the UnitKeys of the units so that GHC can
+-- find them in the database.
+--
+-- GHC then replaces the UnitKeys with UnitIds by taking into account wired-in
+-- units: these units are detected thanks to their UnitInfo (especially their
+-- package name).
+--
+-- For example, knowing that "base", "ghc-prim" and "rts" are wired-in packages,
+-- the following dependency graph expressed with UnitKeys (as found in the
+-- database) will be transformed into a similar graph expressed with UnitIds
+-- (that are what matters for compilation):
+--
+-- UnitKeys
+-- ~~~~~~~~ ---> rts-1.0-hashABC <--
+-- | |
+-- | |
+-- foo-2.0-hash123 --> base-4.1-hashXYZ ---> ghc-prim-0.5.3-hashABC
+--
+-- UnitIds
+-- ~~~~~~~ ---> rts <--
+-- | |
+-- | |
+-- foo-2.0-hash123 --> base ---------------> ghc-prim
+--
+--
+-- Module signatures / indefinite units / instantiated units
+-- ---------------------------------------------------------
+--
+-- GHC distinguishes two kinds of units:
+--
+-- * definite: units for which every module has an associated code object
+-- (i.e. real compiled code in a .o/.a/.so/.dll/...)
+--
+-- * indefinite: units for which some modules are replaced by module
+-- signatures.
+--
+-- Module signatures are a kind of interface (similar to .hs-boot files). They
+-- are used in place of some real code. GHC allows real modules from other
+-- units to be used to fill these module holes. The process is called
+-- "unit/module instantiation".
+--
+-- You can think of this as polymorphism at the module level: module signatures
+-- give constraints on the "type" of module that can be used to fill the hole
+-- (where "type" means types of the exported module entitites, etc.).
+--
+-- Module signatures contain enough information (datatypes, abstract types, type
+-- synonyms, classes, etc.) to typecheck modules depending on them but not
+-- enough to compile them. As such, indefinite units found in databases only
+-- provide module interfaces (the .hi ones this time), not object code.
+--
+-- To distinguish between indefinite and finite unit ids at the type level, we
+-- respectively use 'IndefUnitId' and 'DefUnitId' datatypes that are basically
+-- wrappers over 'UnitId'.
+--
+-- Unit instantiation
+-- ------------------
+--
+-- Indefinite units can be instantiated with modules from other units. The
+-- instantiating units can also be instantiated themselves (if there are
+-- indefinite) and so on. The 'Unit' datatype represents a unit which may have
+-- been instantiated:
+--
+-- data Unit = RealUnit DefUnitId
+-- | VirtUnit InstantiatedUnit
+--
+-- 'InstantiatedUnit' has two interesting fields:
+--
+-- * instUnitInstanceOf :: IndefUnitId
+-- -- ^ the indefinite unit that is instantiated
+--
+-- * instUnitInsts :: [(ModuleName,(Unit,ModuleName)]
+-- -- ^ a list of instantiations, where an instantiation is:
+-- (module hole name, (instantiating unit, instantiating module name))
+--
+-- A 'Unit' may be indefinite or definite, it depends on whether some holes
+-- remain in the instantiated unit OR in the instantiating units (recursively).
+--
+-- Pretty-printing UnitId
+-- ----------------------
+--
+-- GHC mostly deals with UnitIds which are some opaque strings. We could display
+-- them when we pretty-print a module origin, a name, etc. But it wouldn't be
+-- very friendly to the user because of the hash they usually contain. E.g.
+--
+-- foo-4.18.1:thelib-XYZsomeUglyHashABC
+--
+-- Instead when we want to pretty-print a 'UnitId' we query the database to
+-- get the 'UnitInfo' and print something nicer to the user:
+--
+-- foo-4.18.1:thelib
+--
+-- We do the same for wired-in units.
+--
+-- Currently (2020-04-06), we don't thread the database into every function that
+-- pretty-prints a Name/Module/Unit. Instead querying the database is delayed
+-- until the `SDoc` is transformed into a `Doc` using the database that is
+-- active at this point in time. This is an issue because we want to be able to
+-- unload units from the database and we also want to support several
+-- independent databases loaded at the same time (see #14335). The alternatives
+-- we have are:
+--
+-- * threading the database into every function that pretty-prints a UnitId
+-- for the user (directly or indirectly).
+--
+-- * storing enough info to correctly display a UnitId into the UnitId
+-- datatype itself. This is done in the IndefUnitId wrapper (see
+-- 'UnitPprInfo' datatype) but not for every 'UnitId'. Statically defined
+-- 'UnitId' for wired-in units would have empty UnitPprInfo so we need to
+-- find some places to update them if we want to display wired-in UnitId
+-- correctly. This leads to a solution similar to the first one above.
+--
+-- Note [VirtUnit to RealUnit improvement]
+-- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+--
+-- Over the course of instantiating VirtUnits on the fly while typechecking an
+-- indefinite library, we may end up with a fully instantiated VirtUnit. I.e.
+-- one that could be compiled and installed in the database. During
+-- type-checking we generate a virtual UnitId for it, say "abc".
+--
+-- Now the question is: do we have a matching installed unit in the database?
+-- Suppose we have one with UnitId "xyz" (provided by Cabal so we don't know how
+-- to generate it). The trouble is that if both units end up being used in the
+-- same type-checking session, their names won't match (e.g. "abc:M.X" vs
+-- "xyz:M.X").
+--
+-- As we want them to match we just replace the virtual unit with the installed
+-- one: for some reason this is called "improvement".
+--
+-- There is one last niggle: improvement based on the package database means
+-- that we might end up developing on a package that is not transitively
+-- depended upon by the packages the user specified directly via command line
+-- flags. This could lead to strange and difficult to understand bugs if those
+-- instantiations are out of date. The solution is to only improve a
+-- unit id if the new unit id is part of the 'preloadClosure'; i.e., the
+-- closure of all the packages which were explicitly specified.
+
+-- Note [Representation of module/name variables]
+-- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+-- In our ICFP'16, we use <A> to represent module holes, and {A.T} to represent
+-- name holes. This could have been represented by adding some new cases
+-- to the core data types, but this would have made the existing 'moduleName'
+-- and 'moduleUnit' partial, which would have required a lot of modifications
+-- to existing code.
+--
+-- Instead, we use a fake "hole" unit:
+--
+-- <A> ===> hole:A
+-- {A.T} ===> hole:A.T
+--
+-- This encoding is quite convenient, but it is also a bit dangerous too,
+-- because if you have a 'hole:A' you need to know if it's actually a
+-- 'Module' or just a module stored in a 'Name'; these two cases must be
+-- treated differently when doing substitutions. 'renameHoleModule'
+-- and 'renameHoleUnit' assume they are NOT operating on a
+-- 'Name'; 'NameShape' handles name substitutions exclusively.