diff options
Diffstat (limited to 'compiler/GHC/Unit.hs')
-rw-r--r-- | compiler/GHC/Unit.hs | 257 |
1 files changed, 257 insertions, 0 deletions
diff --git a/compiler/GHC/Unit.hs b/compiler/GHC/Unit.hs new file mode 100644 index 0000000000..0051aa3087 --- /dev/null +++ b/compiler/GHC/Unit.hs @@ -0,0 +1,257 @@ +{-# LANGUAGE TypeSynonymInstances #-} +{-# LANGUAGE FlexibleInstances #-} +{-# LANGUAGE DeriveFunctor #-} + +-- | Units are library components from Cabal packages compiled and installed in +-- a database +module GHC.Unit + ( module GHC.Unit.Types + , module GHC.Unit.Info + , module GHC.Unit.Parser + , module GHC.Unit.State + , module GHC.Unit.Subst + , module GHC.Unit.Module + ) +where + +import GHC.Unit.Types +import GHC.Unit.Info +import GHC.Unit.Parser +import GHC.Unit.State +import GHC.Unit.Subst +import GHC.Unit.Module + +-- Note [About Units] +-- ~~~~~~~~~~~~~~~~~~ +-- +-- Haskell users are used to manipulate Cabal packages. These packages are +-- identified by: +-- - a package name :: String +-- - a package version :: Version +-- - (a revision number, when they are registered on Hackage) +-- +-- Cabal packages may contain several components (libraries, programs, +-- testsuites). In GHC we are mostly interested in libraries because those are +-- the components that can be depended upon by other components. Components in a +-- package are identified by their component name. Historically only one library +-- component was allowed per package, hence it didn't need a name. For this +-- reason, component name may be empty for one library component in each +-- package: +-- - a component name :: Maybe String +-- +-- UnitId +-- ------ +-- +-- Cabal libraries can be compiled in various ways (different compiler options +-- or Cabal flags, different dependencies, etc.), hence using package name, +-- package version and component name isn't enough to identify a built library. +-- We use another identifier called UnitId: +-- +-- package name \ +-- package version | ________ +-- component name | hash of all this ==> | UnitId | +-- Cabal flags | -------- +-- compiler options | +-- dependencies' UnitId / +-- +-- Fortunately GHC doesn't have to generate these UnitId: they are provided by +-- external build tools (e.g. Cabal) with `-this-unit-id` command-line parameter. +-- +-- UnitIds are important because they are used to generate internal names +-- (symbols, etc.). +-- +-- Wired-in units +-- -------------- +-- +-- Certain libraries are known to the compiler, in that we know about certain +-- entities that reside in these libraries. The compiler needs to declare static +-- Modules and Names that refer to units built from these libraries. +-- +-- Hence UnitIds of wired-in libraries are fixed. Instead of letting Cabal chose +-- the UnitId for these libraries, their .cabal file uses the following stanza to +-- force it to a specific value: +-- +-- ghc-options: -this-unit-id ghc-prim -- taken from ghc-prim.cabal +-- +-- The RTS also uses entities of wired-in units by directly referring to symbols +-- such as "base_GHCziIOziException_heapOverflow_closure" where the prefix is +-- the UnitId of "base" unit. +-- +-- Unit databases +-- -------------- +-- +-- Units are stored in databases in order to be reused by other codes: +-- +-- UnitKey ---> UnitInfo { exposed modules, package name, package version +-- component name, various file paths, +-- dependencies :: [UnitKey], etc. } +-- +-- Because of the wired-in units described above, we can't exactly use UnitIds +-- as UnitKeys in the database: if we did this, we could only have a single unit +-- (compiled library) in the database for each wired-in library. As we want to +-- support databases containing several different units for the same wired-in +-- library, we do this: +-- +-- * for non wired-in units: +-- * UnitId = UnitKey = Identifier (hash) computed by Cabal +-- +-- * for wired-in units: +-- * UnitKey = Identifier computed by Cabal (just like for non wired-in units) +-- * UnitId = unit-id specified with -this-unit-id command-line flag +-- +-- We can expose several units to GHC via the `package-id <UnitKey>` +-- command-line parameter. We must use the UnitKeys of the units so that GHC can +-- find them in the database. +-- +-- GHC then replaces the UnitKeys with UnitIds by taking into account wired-in +-- units: these units are detected thanks to their UnitInfo (especially their +-- package name). +-- +-- For example, knowing that "base", "ghc-prim" and "rts" are wired-in packages, +-- the following dependency graph expressed with UnitKeys (as found in the +-- database) will be transformed into a similar graph expressed with UnitIds +-- (that are what matters for compilation): +-- +-- UnitKeys +-- ~~~~~~~~ ---> rts-1.0-hashABC <-- +-- | | +-- | | +-- foo-2.0-hash123 --> base-4.1-hashXYZ ---> ghc-prim-0.5.3-hashABC +-- +-- UnitIds +-- ~~~~~~~ ---> rts <-- +-- | | +-- | | +-- foo-2.0-hash123 --> base ---------------> ghc-prim +-- +-- +-- Module signatures / indefinite units / instantiated units +-- --------------------------------------------------------- +-- +-- GHC distinguishes two kinds of units: +-- +-- * definite: units for which every module has an associated code object +-- (i.e. real compiled code in a .o/.a/.so/.dll/...) +-- +-- * indefinite: units for which some modules are replaced by module +-- signatures. +-- +-- Module signatures are a kind of interface (similar to .hs-boot files). They +-- are used in place of some real code. GHC allows real modules from other +-- units to be used to fill these module holes. The process is called +-- "unit/module instantiation". +-- +-- You can think of this as polymorphism at the module level: module signatures +-- give constraints on the "type" of module that can be used to fill the hole +-- (where "type" means types of the exported module entitites, etc.). +-- +-- Module signatures contain enough information (datatypes, abstract types, type +-- synonyms, classes, etc.) to typecheck modules depending on them but not +-- enough to compile them. As such, indefinite units found in databases only +-- provide module interfaces (the .hi ones this time), not object code. +-- +-- To distinguish between indefinite and finite unit ids at the type level, we +-- respectively use 'IndefUnitId' and 'DefUnitId' datatypes that are basically +-- wrappers over 'UnitId'. +-- +-- Unit instantiation +-- ------------------ +-- +-- Indefinite units can be instantiated with modules from other units. The +-- instantiating units can also be instantiated themselves (if there are +-- indefinite) and so on. The 'Unit' datatype represents a unit which may have +-- been instantiated: +-- +-- data Unit = RealUnit DefUnitId +-- | VirtUnit InstantiatedUnit +-- +-- 'InstantiatedUnit' has two interesting fields: +-- +-- * instUnitInstanceOf :: IndefUnitId +-- -- ^ the indefinite unit that is instantiated +-- +-- * instUnitInsts :: [(ModuleName,(Unit,ModuleName)] +-- -- ^ a list of instantiations, where an instantiation is: +-- (module hole name, (instantiating unit, instantiating module name)) +-- +-- A 'Unit' may be indefinite or definite, it depends on whether some holes +-- remain in the instantiated unit OR in the instantiating units (recursively). +-- +-- Pretty-printing UnitId +-- ---------------------- +-- +-- GHC mostly deals with UnitIds which are some opaque strings. We could display +-- them when we pretty-print a module origin, a name, etc. But it wouldn't be +-- very friendly to the user because of the hash they usually contain. E.g. +-- +-- foo-4.18.1:thelib-XYZsomeUglyHashABC +-- +-- Instead when we want to pretty-print a 'UnitId' we query the database to +-- get the 'UnitInfo' and print something nicer to the user: +-- +-- foo-4.18.1:thelib +-- +-- We do the same for wired-in units. +-- +-- Currently (2020-04-06), we don't thread the database into every function that +-- pretty-prints a Name/Module/Unit. Instead querying the database is delayed +-- until the `SDoc` is transformed into a `Doc` using the database that is +-- active at this point in time. This is an issue because we want to be able to +-- unload units from the database and we also want to support several +-- independent databases loaded at the same time (see #14335). The alternatives +-- we have are: +-- +-- * threading the database into every function that pretty-prints a UnitId +-- for the user (directly or indirectly). +-- +-- * storing enough info to correctly display a UnitId into the UnitId +-- datatype itself. This is done in the IndefUnitId wrapper (see +-- 'UnitPprInfo' datatype) but not for every 'UnitId'. Statically defined +-- 'UnitId' for wired-in units would have empty UnitPprInfo so we need to +-- find some places to update them if we want to display wired-in UnitId +-- correctly. This leads to a solution similar to the first one above. +-- +-- Note [VirtUnit to RealUnit improvement] +-- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +-- +-- Over the course of instantiating VirtUnits on the fly while typechecking an +-- indefinite library, we may end up with a fully instantiated VirtUnit. I.e. +-- one that could be compiled and installed in the database. During +-- type-checking we generate a virtual UnitId for it, say "abc". +-- +-- Now the question is: do we have a matching installed unit in the database? +-- Suppose we have one with UnitId "xyz" (provided by Cabal so we don't know how +-- to generate it). The trouble is that if both units end up being used in the +-- same type-checking session, their names won't match (e.g. "abc:M.X" vs +-- "xyz:M.X"). +-- +-- As we want them to match we just replace the virtual unit with the installed +-- one: for some reason this is called "improvement". +-- +-- There is one last niggle: improvement based on the package database means +-- that we might end up developing on a package that is not transitively +-- depended upon by the packages the user specified directly via command line +-- flags. This could lead to strange and difficult to understand bugs if those +-- instantiations are out of date. The solution is to only improve a +-- unit id if the new unit id is part of the 'preloadClosure'; i.e., the +-- closure of all the packages which were explicitly specified. + +-- Note [Representation of module/name variables] +-- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +-- In our ICFP'16, we use <A> to represent module holes, and {A.T} to represent +-- name holes. This could have been represented by adding some new cases +-- to the core data types, but this would have made the existing 'moduleName' +-- and 'moduleUnit' partial, which would have required a lot of modifications +-- to existing code. +-- +-- Instead, we use a fake "hole" unit: +-- +-- <A> ===> hole:A +-- {A.T} ===> hole:A.T +-- +-- This encoding is quite convenient, but it is also a bit dangerous too, +-- because if you have a 'hole:A' you need to know if it's actually a +-- 'Module' or just a module stored in a 'Name'; these two cases must be +-- treated differently when doing substitutions. 'renameHoleModule' +-- and 'renameHoleUnit' assume they are NOT operating on a +-- 'Name'; 'NameShape' handles name substitutions exclusively. |