[backpack] Package selection

Signed-off-by: Edward Z. Yang <ezyang@cs.stanford.edu>
author: Edward Z. Yang <ezyang@cs.stanford.edu> 2014-07-31 14:05:02 +0100
committer: Edward Z. Yang <ezyang@cs.stanford.edu> 2014-07-31 14:05:07 +0100
commit: a0ff1eb3c0745230cd70525853ca741e08a1f34d (patch)
tree: 5b85198270207395498647a760b1738f60e19c6a /docs/backpack
parent: 6fa6caad0cb4ba99b2c0b444b0583190e743dd63 (diff)
download: haskell-a0ff1eb3c0745230cd70525853ca741e08a1f34d.tar.gz
1 files changed, 412 insertions, 213 deletions
diff --git a/docs/backpack/backpack-impl.tex b/docs/backpack/backpack-impl.tex
index a921fc38df..0d4a5807ba 100644
--- a/docs/backpack/backpack-impl.tex
+++ b/docs/backpack/backpack-impl.tex
@@ -44,218 +44,222 @@
 The purpose of this document is to describe an implementation path
 for Backpack in GHC\@.
 
-We start off by outlining the current architecture of GHC, ghc-pkg and Cabal,
-which constitute the existing packaging system.  We then state what our subgoals
-are, since there are many similar sounding but different problems to solve.  Next,
-we describe the ``probably correct'' implementation plan, and finish off with
-some open design questions.  This is intended to be an evolving design document,
-so please contribute!
-
 \tableofcontents
 
-\section{Current packaging architecture}
-
-The overall architecture is described in Figure~\ref{fig:arch}.
-
-\begin{figure}[H]
-    \center{\scalebox{0.8}{\includegraphics{arch.png}}}
-\label{fig:arch}\caption{Architecture of GHC, ghc-pkg and Cabal. Green bits indicate additions from upcoming IHG work, red bits indicate additions from Backpack.  Orange indicates a Haskell library.}
-\end{figure}
-
-Here, arrows indicate dependencies from one component to another.  Color
-coding is as follows: orange components are libaries, green components
-are to be added with the IHG work, red components are to be added with
-Backpack.  (Thus, black and orange can be considered the current)
-
-\subsection{Installed package database}
-
-Starting from the bottom, we have the \emph{installed package database}
-(actually a collection of such databases), which stores information
-about what packages have been installed are thus available to be
-compiled against.  There is both a global database (for the system
-administrator) and a local database (for end users), which can be
-updated independently.  One way to think about the package database
-is as a \emph{cache of object code}.  In principle, one could compile
-any piece of code by repeatedly recompiling all of its dependencies;
-the installed package database describes when this can be bypassed.
-
-\begin{figure}[H]
-    \center{\scalebox{0.8}{\includegraphics{pkgdb.png}}}
-\label{fig:pkgdb}\caption{Anatomy of a package database.}
-\end{figure}
-
-In Figure~\ref{fig:pkgdb}, we show the structure of a package database.
-The installed package are created from a Cabal file through the process
-of dependency resolution and compilation.  In database terms, the primary key
-of a package database is the InstalledPackageId
-(Figure~\ref{fig:current-pkgid}).  This ID uniquely identifies an
-instance of an installed package.  The PackageId omits the ABI hash and
-is used to qualify linker exported symbols: the current value of this
-parameter is communicated to GHC using the \verb|-package-id| flag.
-
-In principle, packages with different PackageIds should be linkable
-together in the same compiled program, whereas packages with the same
-PackageId are not (even if they have different InstalledPackageIds).  In
-practice, GHC is currently only able to select one version of a package,
-as it clears out all old versions of the package in
-\ghcfile{compiler/main/Package.lhs}:applyPackageFlag.
-
-\begin{figure}
-    \center{\begin{tabular}{r l}
-        PackageId & package name, package version \\
-        InstalledPackageId & PackageId, ABI hash \\
-    \end{tabular}}
-\label{fig:current-pkgid}\caption{Current structure of package identifiers.}
-\end{figure}
-
-The database entry itself contains the information from the installed package ID,
-as well as information such as what dependencies it was linked against, where
-its compiled code and interface files live, its compilation flags, what modules
-it exposes, etc.  Much of this information is only relevant to Cabal; GHC
-uses a subset of the information in the package database.
-
-\subsection{GHC}
+\section{What we are trying to solve}
 
-The two programs which access the package database directly are GHC
-proper (for compilation) and ghc-pkg (which is a general purpose
-command line tool for manipulating the database.)  GHC relies on
-the package database in the following ways:
-
-\begin{itemize}
-    \item It imports the local and global package databases into
-        its runtime database, and applies modifications to the exposed
-        and trusted status of the entries via the flags \verb|-package|
-        and others (\ghcfile{compiler/main/Packages.lhs}).  The internal
-        package state can be seen at \verb|-v4| or higher.
-    \item It uses this package database to find the location of module
-        interfaces when it attempts to load the module info of an external
-        module (\ghcfile{compiler/iface/LoadIface.hs}).
-\end{itemize}
+While the current ecosystem has proved itself serviceable for many years,
+there are a number of major problems which causes significant headaches
+for many users.  Here are some of them:
 
-GHC itself performs a type checking phase, which generates an interface
-file representing the module (so that later invocations of GHC can load the type
-of a module), and then after compilation projects object files and linked archives
-for programs to use.
+\subsection{Package reinstalls are destructive}\label{sec:destructive}
 
-\paragraph{Original names} Original names are an important design pattern
-in GHC\@.
-Sometimes, a name can be exposed in an hi file even if its module
-wasn't exposed. Here is an example (compiled in package R):
+When attempting to install a new package, you might get an error like
+this:
 
 \begin{verbatim}
-module X where
-    import Internal (f)
-    g = f
-
-module Internal where
-    import Internal.Total (f)
+$ cabal install hakyll
+cabal: The following packages are likely to be broken by the reinstalls:
+pandoc-1.9.4.5
+Graphalyze-0.14.0.0
+Use --force-reinstalls if you want to install anyway.
 \end{verbatim}
 
-Then in X.hi:
+While this error message is understandable if you're really trying to
+reinstall a package, it is quite surprising that it can occur even if
+you didn't ask for any reinstalls!
+
+The underlying cause of this problem is related to an invariant Cabal
+currently enforces on a package database: there can only be one instance
+of a package for any given package name and version.  This means that it
+is not possible to install a package multiple times, compiled against
+different dependencies.  However, sometimes, reinstalling a package with
+different dependencies is the only way to fulfill version bounds of a
+package!  For example: say we have three packages \pname{a}, \pname{b}
+and \pname{c}.  \pname{b-1.0} is the only version of \pname{b}
+available, and it has been installed and compiled against \pname{c-1.0}.
+Later, the user installs an updated version \pname{c-1.1} and then
+attempts to install \pname{a}, which depends on the specific versions
+\pname{c-1.1} and \pname{b-1.0}.  We \emph{cannot} use the already
+installed version of \pname{b-1.0}, which depends on the wrong version
+of \pname{c}, so our only choice is to reinstall \pname{b-1.0} compiled
+against \pname{c-1.1}.  This will break any packages, e.g. \pname{d},
+which were built against the old version of \pname{b-1.0}.
+
+Our solution to this problem is to \emph{abolish} destructive package
+installs, and allow a package to be installed multiple times with the same
+package name and version.  However, allowing this poses some interesting
+user interface problems, since package IDs are now no longer unambiguous
+identifiers.
+
+\subsection{Version bounds are often over/under-constrained}
+
+When attempting to install a new package, Cabal might fail in this way:
 
 \begin{verbatim}
-g = <R.id, Internal.Total, f> (this is the original name)
+$ cabal install hledger-0.18
+Resolving dependencies...
+cabal: Could not resolve dependencies:
+# pile of output
 \end{verbatim}
 
-(The reason we refer to the package as R.id is because it's the
-full package ID, and not just R).
-
-\subsection{hs-boot}
-
-\verb|hs-boot| is a special mechanism used to support recursive linking
-of modules within a package, today.  Suppose I have a recursive module
-dependency between modules and A and B. I break one of\ldots
+There are a number of possible reasons why this could occur, but usually
+it's because some of the packages involved have over-constrained version
+bounds, which are resulting in an unsatisfiable set of constraints (or,
+at least, Cabal gave up backtracking before it found a solution.)  To
+add insult to injury, most of the time the bound is nonsense and removing
+it would result in a working compilation.  In fact, this situation is
+so common that Cabal has a flag \verb|--allow-newer| which lets you
+override the package upper bounds.
+
+However, the flip-side is when Cabal finds a satisfying set, but your
+compilation fails with a type error.  Here, you had an under-constrained
+set of version bounds which didn't actually reflect the compatible
+versions of a package, and Cabal picked a version of the package which
+was incompatible.
+
+Our solution to this problem is to use signatures instead of version
+numbers as the primary mechanism by which compatibility is determined:
+e.g., if it typechecks, it's a valid choice.  Version numbers can still
+be used to reflect semantic changes not seen in the types (in
+particular, ruling out buggy versions of a package is a useful
+operation), but these bounds are empirical observations and can be
+collected after-the-fact.
+
+\subsection{It is difficult to support multiple implementations of a type}
+
+This problem is perhaps best described by referring to a particular
+instance of it Haskell's ecosystem: the \texttt{String} data type.  Haskell,
+by default, implements strings as linked lists of integers (representing
+characters).  Many libraries use \texttt{String}, because it's very
+convenient to program against.  However, this representation is also
+very \emph{slow}, so there are alternative implementations such as
+\texttt{Text} which implement efficient, UTF-8 encoded packed byte
+arrays.
+
+Now, suppose you are writing a library and you don't care if the user of
+your library is using \texttt{String} or \texttt{Text}.  However, you
+don't want to rewrite your library twice to support both data types:
+rather, you'd like to rely on some \emph{common interface} between the
+two types, and let the user instantiate the implementation.  The only
+way to do this in today's Haskell is using type classes; however, this
+necessitates rewriting all type signatures from a nice \texttt{String ->
+String} to \texttt{StringLike s => s -> s}.  The result is less readable,
+required a large number of trivial edits to type signatures, and might
+even be less efficient, if GHC does not appropriately specialize your code
+written in this style.
+
+Our solution to this problem is to introduce a new mechanism of
+pluggability: module holes, which let us use types and functions from a
+module \texttt{Data.String} as before, but defer choosing \emph{what}
+module should be used in the implementation to some later point (or
+instantiate the code multiple times with different choices.)
+
+\subsection{Fast moving APIs are difficult to develop/develop against}
+
+Most packages that are uploaded to Hackage have package authors which pay
+some amount of attention to backwards compatibility and avoid making egregious
+breaking changes.  However, a package like the \verb|ghc-api| follows a
+very different model: the library is a treated by its developers as an
+internal component of an application (GHC), and is frequently refactored
+in a way that changes its outwards facing interface.
+
+Arguably, an application like GHC should design a stable API and
+maintain backwards compatibility against it.  However, this is a lot of
+work (including refactoring) which is only being done slowly, and in the
+meantime, the dump of all the modules gives users the functionality they
+want (even if it keeps breaking every version.)
+
+One could say that the core problem is there is no way for users to
+easily communicate to GHC authors what parts of the API they rely on.  A
+developer of GHC who is refactoring an interface will often rely on the
+typechecker to let them know which parts of the codebase they need to
+follow and update, and often could say precisely how to update code to
+use the new interface.  User applications, which live out of tree, don't
+receive this level of attention.
+
+Our solution is to make it possible to typecheck the GHC API against a
+signature.  Important consumers can publish what subsets of the GHC API
+they rely against, and developers of GHC, as part of their normal build
+process, type-check against these signatures.  If the signature breaks,
+a developer can either do the refactoring differently to avoid the
+compatibility-break, or document how to update code to use the new API\@.
+
+\section{Backpack in a nutshell}
+
+For a more in-depth tutorial about Backpack's features, check out Section 2
+of the original Backpack paper.  In this section, we briefly review the
+most important points of Backpack's design.
+
+\paragraph{Thinning and renaming at the module level}
+A user can specify a build dependency which only exposes a subset of
+modules (possibly under different names.)  By itself, it's a way for the
+user to resolve ambiguous module imports at the package level, without
+having to use the \texttt{PackageImports} syntax extension.
+
+\paragraph{Holes (abstract module definitions)}  The core
+component of Backpack's support for \emph{separate modular development}
+is the ability to specify abstract module bindings, or holes, which give
+users of the module an obligation to provide an implementation which
+fulfills the signature of the hole.  In this example:
 
-(ToDo: describe how hs-boot mechanism works)
+\begin{verbatim}
+package p where
+    A :: [ ... ]
+    B = [ import A; ... ]
+\end{verbatim}
 
-\subsection{Cabal}
+\verb|p| is an \emph{indefinite package}, which cannot be compiled until
+an implementation of \m{A} is provided.  However, we can still type check
+\m{B} without any implementation of \m{A}, by type checking it against
+the signature.  Holes can be put into signature packages and included
+(depended upon) by other packages to reuse definitions of signatures.
 
-Cabal is the build system for GHC, we can think of it as parsing a Cabal
-file describing a package, and then making (possibly multiple)
-invocations to GHC to perform the appropriate compilation.  What
-information does Cabal pass onto GHC\@?  One can get an idea for this by
-looking at a prototypical command line that Cabal invokes GHC with:
+\paragraph{Filling in holes with an implementation}
+A hole in an indefinite package can be instantiated in a \emph{mix-in}
+style: namely, if a signature and an implementation have the same name,
+they are linked together:
 
 \begin{verbatim}
-ghc --make
-    -package-name myapp-0.1
-    -hide-all-packages
-    -package-id containers-0.9-ABCD
-    Module1 Module2
+package q where
+    A = [ ... ]
+    include p -- has signature A
 \end{verbatim}
 
-There are a few things going on here.  First, Cabal has to tell GHC
-what the name of the package it's compiling (otherwise, GHC can't appropriately
-generate symbols that other code referring to this package might generate).
-There are also a number of commands which configure its in-memory view of
-the package database (GHC's view of the package database may not directly
-correspond to what is on disk).  There's also an optimization here: in principle,
-GHC can compile each module one-by-one, but instead we use the \verb|--make| flag
-because this allows GHC to reuse some data structures, resulting in a nontrivial
-speedup.
-
-(ToDo: describe cabal-install/sandbox)
+Renaming is often useful to rename a module (or a hole) so that a signature
+and implementation have the same name and are linked together.
+An indefinite package can be instantiated multiple times with different
+implementations: the \emph{applicativity} of Backpack means that if
+a package is instantiated separately with the same module, the results
+are type equal:
 
-\section{Goals}
+\begin{verbatim}
+package q' where
+    A = [ ... ]
+    include p (A, B as B1)
+    include p (A, B as B2)
+    -- B1 and B2 are equivalent
+\end{verbatim}
 
-Here are some of the high-level goals which motivate our improvements to
-the module system.
+\paragraph{Combining signatures together}
+Unlike implementations, it's valid for a multiple signatures with the
+same name to be in scope.
 
-\begin{itemize}
-    \item Solve \emph{Cabal hell}, a situation which occurs when conflicting
-        version ranges on a wide range of dependencies leads to a situation
-        where it is impossible to satisfy the constraints.  We're seeking
-        to solve this problem in two ways: first, we want to support
-        multiple instances of containers-2.9 in the database which are
-        compiled with different dependencies (and even link them
-        together), and second, we want to abolish (often inaccurate)
-        version ranges and move to a regime where packages depend on
-        signatures.  Version ranges may still be used to indicate important
-        semantic changes (e.g., bugs or bad behavior on the part of package
-        authors), but they should no longer drive dependency resolution
-        and often only be recorded after the fact.
-
-    \item Support \emph{hermetic builds with sharing}.  A hermetic build
-        system is one which simulates rebuilding every package whenever
-        it is built; on the other hand, actually rebuilding every time
-        is extremely inefficient (but what happens in practice with
-        Cabal sandboxes).  We seek to solve this problem with the IHG work,
-        by allowing multiple instances of a package in the database, where
-        the only difference is compilation parameters.  We don't care
-        about being able to link these together in a single program.
-
-    \item Support \emph{module-level pluggability} as an alternative to
-        existing (poor) usage of type classes.  The canonical example are
-        strings, where a library might want to either use the convenient
-        but inefficient native strings, or the efficient packed Text data
-        type, but would really like to avoid having to say \verb|StringLike s => ...|
-        in all of their type signatures.  While we do not plan on supporting
-        separate compilation, Cabal should understand how to automatically
-        recompile these ``indefinite'' packages when they are instantiated
-        with a new plugin.
-
-    \item Support \emph{separate modular development}, where a library and
-        an application using the library can be developed and type-checked
-        separately, intermediated by an interface.  The canonical example
-        here is the \verb|ghc-api|, which is a large, complex API that
-        the library writers (GHC developers) frequently change---the ability
-        for downstream projects to say, ``Here is the API I'm relying on''
-        without requiring these projects to actually be built would greatly
-        assist in keeping the API stable. This is applicable in
-        the pluggability example as well, where we want to ensure that all
-        of the $M \times N$ configurations of libraries versus applications
-        type check, by only running the typechecker $M + N$ times.  A closely
-        related concern is related toolchain support for extracting a signature
-        from an existing implementation, as current Haskell culture is averse
-        to explicitly writing separate signature files.
-
-    \item Subsume existing support for \emph{mutually recursive modules},
-        without the double vision problem.
-\end{itemize}
+\begin{verbatim}
+package a-sig where
+    A :: [ ... ]
+package a-sig2 where
+    A :: [ ... ]
+package q where
+    include a-sig
+    include a-sig2
+    B = [ import A; ... ]
+\end{verbatim}
 
-A \emph{non-goal} is to allow users to upgrade upstream libraries
-without recompiling downstream. This is an ABI concern and we're not
-going to worry about it.
+These signatures \emph{merge} together, providing the union of the
+functionality (assuming the types of individual entities are
+compatible.)  Backpack has a very simple merging algorithm: types must
+match exactly to be compatible (\emph{width} subtyping).
 
 \clearpage
 
@@ -718,38 +722,233 @@ but we must record this tree \emph{even} when our package has no holes.
 %   As a final example, the full module
 %   identity of \m{B1} in Figure~\ref{fig:granularity} may actually be $\pname{p-0.9(q-1.0[p-0.9]:A1)}$:\m{B}.
 
-
-\subsection{Implementation}
-
-In GHC's current packaging system, a single package compiles into a
-single entry in the installed package database, indexed by the package
-key.  This property is preserved by package-level granularity, as we
-assign the same package key to all modules.  Package keys provide an
-easy mechanism for sharing to occur: when an indefinite package is fully
-instantiated, we can check if we already have its package key installed
-in the installed package database.  (At the end of this section, we'll
-briefly discuss some of the problems actually implementing Paper Backpack.)
-It is also important to note that we are \emph{willing to duplicate code};
-processes like this already happen in other parts of the compiler
-(such as inlining.)
-
-\paragraph{Relaxing package selection restrictions}  As mentioned
-previously, GHC is unable to select multiple packages with the same
-package name (but different package keys).  This restriction needs to be
-lifted.  We should add a new flag \verb|-package-key|.  GHC also knows
-about version numbers and will mask out old versions of a library when
-you make another version visible; this behavior needs to be modified.
-
 \paragraph{Linker symbols} As we increase the amount of information in
 PackageId, it's important to be careful about the length of these IDs,
 as they are used for exported linker symbols (e.g.
 \verb|base_TextziReadziLex_zdwvalDig_info|).  Very long symbol names
 hurt compile and link time, object file sizes, GHCi startup time,
-dynamic linking, and make gdb hard to use.  As such, the current plan is
-to do away with full package names and versions, and instead use just a
-base-62 encoded hash, perhaps with the first four characters of the package
+dynamic linking, and make gdb hard to use.  As such, we are going to
+do away with full package names and versions and instead use just a
+base-62 encoded hash, with the first five characters of the package
 name for user-friendliness.
 
+\subsection{Package selection}
+
+When I fire up \texttt{ghci} with no arguments, GHC somehow creates
+out of thin air some consistent set of packages, whose modules I can
+load using \texttt{:m}.  This functionality is extremely handy for
+exploratory work, but actually GHC has to work quite hard in order
+to generate this set of packages, the contents of which are all
+dumped into a global namespace.  For example, GHC doesn't have access
+to Cabal's dependency solver, nor does it know \emph{which} packages
+the user is going to ask for, so it can't just run a constraint solver,
+get a set of consistent packages to offer and provide them to the user.\footnote{Some might
+argue that depending on a global environment in this fashion is wrong, because
+when you perform a build in this way, you have absolutely no ideas what
+dependencies you actually ended up using.  But the fact remains that for
+end users, this functionality is very useful.}
+
+To make matters worse, while in the current design of the package database,
+a package is uniquely identified by its package name and version, in
+the Backpack design, it is \emph{mandatory} that we support multiple
+packages installed in the database with the same package name and version,
+and this can result in complications in the user model.  This further
+complicates GHC's default package selection algorithm.
+
+In this section, we describe how the current algorithm operates (including
+what invariants it tries to uphold and where it goes wrong), and how
+to replace the algorithm to handle generalization to
+multiple instances in the package database.  We'll also try to tease
+apart the relationship between package keys and installed package IDs in
+the database.
+
+\paragraph{The current algorithm} Abstractly, GHC's current package
+selection algorithm operates as follows.  For every package name, select
+the package with the latest version (recall that this is unique) which
+is also \emph{valid}.  A package is valid if:
+
+\begin{itemize}
+    \item It exists in the package database,
+    \item All of its dependencies are valid,
+    \item It is not shadowed by a package with the same package ID\footnote{Recall that currently, a package ID uniquely identifies a package in the package database} in
+        another package database (unless it is in the transitive closure
+        of a package named by \texttt{-package-id}), and
+    \item It is not ignored with \texttt{-ignore-package}.
+\end{itemize}
+
+Package validity is probably the minimal criterion for to GHC to ensure
+that it can actually \emph{use} a package.  If the package is missing,
+GHC can't find the interface files or object code associated with the
+package.  Ignoring packages is a way of pretending that a package is
+missing from the database.
+
+Package validity is also a very weak criterion.  Another criterion we
+might hope holds is \emph{consistency}: when we consider the transitive
+closure of all selected packages, for any given package name, there
+should only be one instance providing that package.  It is trivially
+easy to break this property: suppose that I have packages \pname{a-1.0},
+\pname{b-1.0} compiled against \pname{a-1.0}, and \pname{a-1.1}.  GHC
+will happily load \pname{b-1.0} and \pname{a-1.1} together in the same
+interactive session (they are both valid and the latest versions), even
+though \pname{b-1.0}'s dependency is inconsistent with another package
+that was loaded.  The user will notice if they attempt to treat entities
+from \pname{a} reexported by \pname{b-1.0} and entities from
+\pname{a-1.1} as type equal.  Here is one user who had this problem:
+\url{http://stackoverflow.com/questions/12576817/}.  In some cases, the
+problem is easy to work around (there is only one offending package
+which just needs to be hidden), but if the divergence is deep in two
+separate dependency hierarchies, it is often easier to just blow away
+the package database and try again.
+
+Perversely, destructive reinstallation helps prevent these sorts of
+inconsistent databases.  While inconsistencies can arise when multiple
+versions of a package are installed, multiple versions will frequently
+lead to the necessity of reinstalls.  In the previous example, if a user
+attempts to Cabal install a package which depends on \pname{a-1.1} and
+\pname{b-1.0}, Cabal's dependency solver will propose reinstalling
+\pname{b-1.0} compiled against \pname{a-1.1}, in order to get a
+consistent set of dependencies.  If this reinstall is accepted, we
+invalidate all packages in the database which were previously installed
+against \pname{b-1.0} and \pname{a-1.0}, excluding them from GHC's
+selection process and making it more likely that the user will see a
+consistent view of the database.
+
+\paragraph{Enforcing consistent dependencies}  From the user's
+perspective, it would be desirable if GHC never loaded a set of packages
+whose dependencies were inconsistent.
+There are two ways we can go
+about doing this.  First, we can improve GHC's logic so that it doesn't
+pick an inconsistent set.  However, as a point of design, we'd like to
+keep whatever resolution GHC does as simple as possible (in an ideal
+world, we'd skip the validity checks entirely, but they ended up being
+necessary to prevent broken database from stopping GHC from starting up
+at all). In particular, GHC should \emph{not} learn how to do
+backtracking constraint solving: that's in the domain of Cabal.  Second,
+we can modify the logic of Cabal to enforce that the package database is
+always kept in a consistent state, similar to the consistency check
+Cabal applies to sandboxes, where it refuses to install a package to a
+sandbox if the resulting dependencies would not be consistent.
+
+The second alternative is a appealing, but Cabal sandboxes are currently
+designed for small, self-contained single projects, as opposed to the
+global ``universe'' that a default environment is intended to provide.
+For example, with a Cabal sandbox environment, it's impossible to
+upgrade a dependency to a new version without blowing away the sandbox
+and starting again.  To support upgrades, Cabal needs to do some work:
+when a new version is put in the default set, all of the
+reverse-dependencies of the old version are now inconsistent.  Cabal
+should offer to hide these packages or reinstall them compiled against
+the latest version.  Cabal should also be able to snapshot the older
+environment which captures the state of the universe prior to the
+installation, in case the user wants to revert back.
+
+\paragraph{Modifying the default environment}  Currently, after GHC
+calculates the default package environment, a user may further modify
+the environment by passing package flags to GHC, which can be used to
+explicitly hide or expose packages.  How do these flags interact with
+our Cabal-managed environments?  Hiding packages is simple enough,
+but exposing packages is a bit dicier.  If a user asks for a different
+version of a package than in the default set, it will probably be
+inconsistent with the rest of the dependencies.  Cabal would have to
+be consulted to figure out a maximal set of consistent packages with
+the constraints given.
+
+However, this use-case is rare.  Usually, it's not because they want a
+specific version: the package is hidden simply because we're not
+interested in loading it by default (\pname{ghc-api} is the canonical
+example, since it dumps a lot of modules in the top level namespace).
+If we distinguish packages which are consistent but hidden, their
+loads can be handled appropriately.
+
+\paragraph{Consistency in Backpack} We have stated as an implicit
+assumption that if we have both \pname{foo-1.0} and \pname{foo-1.1}
+available, only one should be loaded at a time.  What are the
+consequences if both of these packages are loaded at the same time?  An
+import of \m{Data.Foo} provided by both packages would be ambiguous and
+the user might find some type equalities they expect to hold would not.
+However, the result is not \emph{unsound}: indeed, we might imagine a
+user purposely wanting two different versions of a library in the same
+program, renaming the modules they provided so that they could be
+referred to unambiguously.  As another example, suppose that we have an
+indefinite package with a hole that is instantiated multiple times.  In
+this case, a user absolutely may want to refer to both instantiations,
+once again renaming modules so that they have unique names.
+
+There are two consequences of this.  First, while the default package
+set may enforce consistency, a user should still be able to explicitly
+ask for a package instance, renamed so that its modules don't conflict,
+and then use it in their program.  Second, instantiated indefinite packages
+should \emph{never} be placed in the default set, since it's impossible
+to know which instantiation is the one the user prefers.  A definite package
+can reexport an instantiated module under an unambiguous name if the user
+so pleases.
+
+\paragraph{Shadowing, installed package IDs, ABI hashes and package
+keys} Shadowing plays an important role for maintaining the soundness of
+compilation; call this the \emph{compatibility} of the package set.  The
+problem it addresses is when there are two distinct implementations of a
+module, but because their package ID (or package key, in the new world
+order) are the same, they are considered type equal.  It is absolutely
+wrong for a single program to include both implementations
+simultaneously (the symbols would conflict and GHC would incorrectly
+conclude things were type equal when they're not), so \emph{shadowing}'s
+job is to ensure that only one instance is picked, and all the other
+instances considered invalid (and their reverse-dependencies, etc.)
+Recall that in current GHC, within a package database, a package
+instance is uniquely identified by its package ID\@; thus, shadowing
+only needs to take place between package databases.  An interesting
+corner case is when the same package ID occurs in both databases, but
+the installed package IDs are the \emph{same}.  Because the installed
+package ID is currently simply an ABI hash, we skip shadowing, because
+the packages are---in principle---interchangeable.
+
+There are currently a number of proposed changes to this state of affairs:
+
+\begin{itemize}
+    \item Change installed package IDs to not be based on ABI hashes.
+        ABI hashes have a number of disadvantages as identifiers for
+        packages in the database.  First, they cannot be computed until
+        after compilation, which gave the multi-instance GSoC project a
+        few years some headaches.  Second, it's not really true that
+        programs with identical ABI hashes are interchangeable: a new
+        package may be ABI compatible but have different semantics.
+        Thus, installed package IDs are a poor unique identifier for
+        packages in the package database.  However, because GHC does not
+        give ABI stability guarantees, it would not be possible to
+        assume from here that packages with the same installed package
+        ID are ABI compatible.
+
+    \item Relaxing the uniqueness constraint on package IDs.  There are
+        actually two things that could be done here.  First, since we
+        have augmented package IDs with dependency resolution
+        information to form package keys, we could simply state that
+        package keys uniquely identify a package in a database.
+        Shadowing rules can be implemented in the same way as before, by
+        preferring the instance topmost on the stack.  Second, we could
+        also allow \emph{same-database} shadowing: that is, not even
+        package keys are guaranteed to be unique in a database: instead,
+        installed package IDs are the sole unique identifier of a
+        package.  The motivation behind this architecture is to treat
+        the package database more like a cache rather than a database:
+        information about shadowing is separately maintained and used.
+\end{itemize}
+
+Edward thinks same-database shadowing is wrong.  What same-database
+shadowing implies is that there are multiple incompatible ``package
+hierarchies'' (possibly with a shared root), one of which shadows the
+other hierarchy.  It is now absolutely essential to somehow identify
+which hierarchy should be visible (the rest being shadowed).  It seems
+better to me to explicitly reify this hierarchy as a hierarchy of
+package databases.  For example, instead of having (installed package
+IDs) \texttt{foo-1.0-hash1} and \texttt{foo-1.0-hash2} in the same
+database, have a separate database for each, and the respective dependencies
+which are built against those packages. (Notice that all of these dependencies
+are incompatible with one another.)  Furthermore, because of the precedence
+of shadowing, we can store one of these installed package IDs in the primary
+database, and then layer the second on top of it (as it takes precedence,
+it automatically invalidates all of the packages depending on \texttt{foo-1.0-hash1},
+while keeping packages which are otherwise compatible.)
+
 \section{Shapeless Backpack}\label{sec:simplifying-backpack}
 
 Backpack as currently defined always requires a \emph{shaping} pass,
author	Edward Z. Yang <ezyang@cs.stanford.edu>	2014-07-31 14:05:02 +0100
committer	Edward Z. Yang <ezyang@cs.stanford.edu>	2014-07-31 14:05:07 +0100
commit	a0ff1eb3c0745230cd70525853ca741e08a1f34d (patch)
tree	5b85198270207395498647a760b1738f60e19c6a /docs/backpack
parent	6fa6caad0cb4ba99b2c0b444b0583190e743dd63 (diff)
download	haskell-a0ff1eb3c0745230cd70525853ca741e08a1f34d.tar.gz