\documentclass{article}

\usepackage{pifont}
\usepackage{graphicx} %[pdftex] OR [dvips]
\usepackage{fullpage}
\usepackage{wrapfig}
\usepackage{float}
\usepackage{titling}
\usepackage{hyperref}
\usepackage{tikz}
\usepackage{color}
\usepackage{footnote}
\usepackage{float}
\usepackage{algorithm}
\usepackage{algpseudocode}
\usetikzlibrary{arrows}
\usetikzlibrary{positioning}
\setlength{\droptitle}{-6em}

\input{commands-new-new.tex}

\newcommand{\nuAA}{\nu_\mathit{AA}}
\newcommand{\nuAB}{\nu_\mathit{AB}}
\newcommand{\nuGA}{\nu_\mathit{GA}}
\newcommand{\nuGB}{\nu_\mathit{GB}}
\newcommand{\betaPL}{\beta_\mathit{PL}}
\newcommand{\betaAA}{\beta_\mathit{AA}}
\newcommand{\betaAS}{\beta_\mathit{AS}}
\newcommand{\thinandalso}{\hspace{.45cm}}
\newcommand{\thinnerandalso}{\hspace{.38cm}}

\input{commands-rebindings.tex}

\newcommand{\var}[1]{\textsf{#1}}

\newcommand{\ghcfile}[1]{\textsl{#1}}

\title{Implementing Backpack}

\begin{document}

\maketitle

The purpose of this document is to describe an implementation path
for Backpack in GHC\@.

\tableofcontents

\section{What we are trying to solve}

While the current ecosystem has proved itself serviceable for many years,
there are a number of major problems which causes significant headaches
for many users.  Here are some of them:

\subsection{Package reinstalls are destructive}\label{sec:destructive}

When attempting to install a new package, you might get an error like
this:

\begin{verbatim}
$ cabal install hakyll
cabal: The following packages are likely to be broken by the reinstalls:
pandoc-1.9.4.5
Graphalyze-0.14.0.0
Use --force-reinstalls if you want to install anyway.
\end{verbatim}

While this error message is understandable if you're really trying to
reinstall a package, it is quite surprising that it can occur even if
you didn't ask for any reinstalls!

The underlying cause of this problem is related to an invariant Cabal
currently enforces on a package database: there can only be one instance
of a package for any given package name and version.  This means that it
is not possible to install a package multiple times, compiled against
different dependencies.  However, sometimes, reinstalling a package with
different dependencies is the only way to fulfill version bounds of a
package!  For example: say we have three packages \pname{a}, \pname{b}
and \pname{c}.  \pname{b-1.0} is the only version of \pname{b}
available, and it has been installed and compiled against \pname{c-1.0}.
Later, the user installs an updated version \pname{c-1.1} and then
attempts to install \pname{a}, which depends on the specific versions
\pname{c-1.1} and \pname{b-1.0}.  We \emph{cannot} use the already
installed version of \pname{b-1.0}, which depends on the wrong version
of \pname{c}, so our only choice is to reinstall \pname{b-1.0} compiled
against \pname{c-1.1}.  This will break any packages, e.g. \pname{d},
which were built against the old version of \pname{b-1.0}.

Our solution to this problem is to \emph{abolish} destructive package
installs, and allow a package to be installed multiple times with the same
package name and version.  However, allowing this poses some interesting
user interface problems, since package IDs are now no longer unambiguous
identifiers.

\subsection{Version bounds are often over/under-constrained}

When attempting to install a new package, Cabal might fail in this way:

\begin{verbatim}
$ cabal install hledger-0.18
Resolving dependencies...
cabal: Could not resolve dependencies:
# pile of output
\end{verbatim}

There are a number of possible reasons why this could occur, but usually
it's because some of the packages involved have over-constrained version
bounds, which are resulting in an unsatisfiable set of constraints (or,
at least, Cabal gave up backtracking before it found a solution.)  To
add insult to injury, most of the time the bound is nonsense and removing
it would result in a working compilation.  In fact, this situation is
so common that Cabal has a flag \verb|--allow-newer| which lets you
override the package upper bounds.

However, the flip-side is when Cabal finds a satisfying set, but your
compilation fails with a type error.  Here, you had an under-constrained
set of version bounds which didn't actually reflect the compatible
versions of a package, and Cabal picked a version of the package which
was incompatible.

Our solution to this problem is to use signatures instead of version
numbers as the primary mechanism by which compatibility is determined:
e.g., if it typechecks, it's a valid choice.  Version numbers can still
be used to reflect semantic changes not seen in the types (in
particular, ruling out buggy versions of a package is a useful
operation), but these bounds are empirical observations and can be
collected after-the-fact.

\subsection{It is difficult to support multiple implementations of a type}

This problem is perhaps best described by referring to a particular
instance of it Haskell's ecosystem: the \texttt{String} data type.  Haskell,
by default, implements strings as linked lists of integers (representing
characters).  Many libraries use \texttt{String}, because it's very
convenient to program against.  However, this representation is also
very \emph{slow}, so there are alternative implementations such as
\texttt{Text} which implement efficient, UTF-8 encoded packed byte
arrays.

Now, suppose you are writing a library and you don't care if the user of
your library is using \texttt{String} or \texttt{Text}.  However, you
don't want to rewrite your library twice to support both data types:
rather, you'd like to rely on some \emph{common interface} between the
two types, and let the user instantiate the implementation.  The only
way to do this in today's Haskell is using type classes; however, this
necessitates rewriting all type signatures from a nice \texttt{String ->
String} to \texttt{StringLike s => s -> s}.  The result is less readable,
required a large number of trivial edits to type signatures, and might
even be less efficient, if GHC does not appropriately specialize your code
written in this style.

Our solution to this problem is to introduce a new mechanism of
pluggability: module holes, which let us use types and functions from a
module \texttt{Data.String} as before, but defer choosing \emph{what}
module should be used in the implementation to some later point (or
instantiate the code multiple times with different choices.)

\subsection{Fast moving APIs are difficult to develop/develop against}

Most packages that are uploaded to Hackage have package authors which pay
some amount of attention to backwards compatibility and avoid making egregious
breaking changes.  However, a package like the \verb|ghc-api| follows a
very different model: the library is a treated by its developers as an
internal component of an application (GHC), and is frequently refactored
in a way that changes its outwards facing interface.

Arguably, an application like GHC should design a stable API and
maintain backwards compatibility against it.  However, this is a lot of
work (including refactoring) which is only being done slowly, and in the
meantime, the dump of all the modules gives users the functionality they
want (even if it keeps breaking every version.)

One could say that the core problem is there is no way for users to
easily communicate to GHC authors what parts of the API they rely on.  A
developer of GHC who is refactoring an interface will often rely on the
typechecker to let them know which parts of the codebase they need to
follow and update, and often could say precisely how to update code to
use the new interface.  User applications, which live out of tree, don't
receive this level of attention.

Our solution is to make it possible to typecheck the GHC API against a
signature.  Important consumers can publish what subsets of the GHC API
they rely against, and developers of GHC, as part of their normal build
process, type-check against these signatures.  If the signature breaks,
a developer can either do the refactoring differently to avoid the
compatibility-break, or document how to update code to use the new API\@.

\section{Backpack in a nutshell}

For a more in-depth tutorial about Backpack's features, check out Section 2
of the original Backpack paper.  In this section, we briefly review the
most important points of Backpack's design.

\paragraph{Thinning and renaming at the module level}
A user can specify a build dependency which only exposes a subset of
modules (possibly under different names.)  By itself, it's a way for the
user to resolve ambiguous module imports at the package level, without
having to use the \texttt{PackageImports} syntax extension.

\paragraph{Holes (abstract module definitions)}  The core
component of Backpack's support for \emph{separate modular development}
is the ability to specify abstract module bindings, or holes, which give
users of the module an obligation to provide an implementation which
fulfills the signature of the hole.  In this example:

\begin{verbatim}
package p where
    A :: [ ... ]
    B = [ import A; ... ]
\end{verbatim}

\verb|p| is an \emph{indefinite package}, which cannot be compiled until
an implementation of \m{A} is provided.  However, we can still type check
\m{B} without any implementation of \m{A}, by type checking it against
the signature.  Holes can be put into signature packages and included
(depended upon) by other packages to reuse definitions of signatures.

\paragraph{Filling in holes with an implementation}
A hole in an indefinite package can be instantiated in a \emph{mix-in}
style: namely, if a signature and an implementation have the same name,
they are linked together:

\begin{verbatim}
package q where
    A = [ ... ]
    include p -- has signature A
\end{verbatim}

Renaming is often useful to rename a module (or a hole) so that a signature
and implementation have the same name and are linked together.
An indefinite package can be instantiated multiple times with different
implementations: the \emph{applicativity} of Backpack means that if
a package is instantiated separately with the same module, the results
are type equal:

\begin{verbatim}
package q' where
    A = [ ... ]
    include p (A, B as B1)
    include p (A, B as B2)
    -- B1 and B2 are equivalent
\end{verbatim}

\paragraph{Combining signatures together}
Unlike implementations, it's valid for a multiple signatures with the
same name to be in scope.

\begin{verbatim}
package a-sig where
    A :: [ ... ]
package a-sig2 where
    A :: [ ... ]
package q where
    include a-sig
    include a-sig2
    B = [ import A; ... ]
\end{verbatim}

These signatures \emph{merge} together, providing the union of the
functionality (assuming the types of individual entities are
compatible.)  Backpack has a very simple merging algorithm: types must
match exactly to be compatible (\emph{width} subtyping).

\clearpage

\section{Module and package identity}

\begin{figure}[H]
\begin{tabular}{p{0.45\textwidth} p{0.45\textwidth}}
\begin{verbatim}
package p where
    A :: [ data X ]
    P = [ import A; data Y = Y X ]
package q where
    A1 = [ data X = X1 ]
    A2 = [ data X = X2 ]
    include p (A as A1, P as P1)
    include p (A as A2, P as P2)
\end{verbatim}
&
\begin{verbatim}
package p where
    A :: [ data X ]
    P = [ data T = T ] -- no A import!
package q where
    A1 = [ data X = X1 ]
    A2 = [ data X = X2 ]
    include p (A as A1, P as P1)
    include p (A as A2, P as P2)
\end{verbatim}
\\
(a) Type equality must consider holes\ldots &
(b) \ldots but how do we track dependencies? \\
\end{tabular}
\caption{Two similar examples}\label{fig:simple-ex}
\end{figure}

One of the central questions one encounters when type checking Haskell
code is: \emph{when are two types equal}?  In ordinary Haskell, the
answer is simple: ``They are equal if their \emph{original names} (i.e.,
where they were originally defined) are the same.''  However, in
Backpack, the situation is murkier due to the presence of \emph{holes}.
Consider the pair of examples in Figure~\ref{fig:simple-ex}.
In Figure~\ref{fig:simple-ex}a,  the types \m{B1}.\verb|Y| and \m{B2}.\verb|Y| should not be
considered equal, even though na\"\i vely their original names are
\pname{p}:\m{B}.\verb|Y|, since their arguments are different \verb|X|'s!
On the other hand, if we instantiated \pname{p} twice with the same \m{A}
(e.g., change the second include to \texttt{include p (A as A1, P as P2)}),
we might consider the two resulting \verb|Y|'s
equal, an \emph{applicative} semantics of identity instantiation.  In
Figure~\ref{fig:simple-ex}b, we see that even though \m{A} was instantiated differently,
we might reasonably wonder if \texttt{T} should still be considered the same,
since it has no dependence on the actual choice of \m{A}.

In fact, there are quite a few different choices that can be made here.
Figures~\ref{fig:applicativity}~and~\ref{fig:granularity} summarize the various
choices on two axes: the granularity of applicativity (under what circumstances
do we consider two types equal) and the granularity of dependency (what circumstances
do we consider two types not equal)?  A \ding{52} means the design we have chosen
answers the question affirmatively---\ding{54}, negatively---but all of these choices
are valid points on the design space.

\subsection{The granularity of applicativity}

An applicative semantics of package instantiation states that if a package is
instantiated with the ``same arguments'', then the resulting entities it defines
should also be considered equal.  Because Backpack uses \emph{mix-in modules},
it is very natural to consider the arguments of a package instantiation as the
modules, as shown in Figure~\ref{fig:applicativity}b: the same module \m{A} is
linked for both instantiations, so \m{P1} and \m{P2} are considered equal.

However, we consider the situation at a finer granularity, we might say, ``Well,
for a declaration \texttt{data Y = Y X}, only the definition of type \verb|X| matters.
If they are the same, then \verb|Y| is the same.''  In that case, we might accept
that in Figure~\ref{fig:applicativity}a, even though \pname{p} is instantiated
with different modules, at the end of the day, the important component \verb|X| is
the same in both cases, so \verb|Y| should also be the same.  This is a sort of
``extreme'' view of modular development, where every declaration is desugared
into a separate module.  In our design, we will be a bit more conservative, and
continue with module level applicativity, in the same manner as Paper Backpack.

\paragraph{Implementation considerations}
Compiling Figure~\ref{fig:applicativity}b to dynamic libraries poses an
interesting challenge, if every package compiles to a dynamic library.
When we compile package \pname{q}, the libraries we end up producing are \pname{q}
and an instance of \pname{p} (instantiated with \pname{q}:\m{A}).  Furthermore,
\pname{q} refers to code in \pname{p} (the import in \m{Q}), and vice versa (the usage
of the instantiated hole \m{A}).  When building static libraries, this circular
dependency doesn't matter: when we link the executable, we can resolve all
of the symbols in one go.  However, when the libraries in question are
dynamic libraries \verb|libHSq.so| and \verb|libHSp(q:A).so|, we now have
a \emph{circular dependency} between the two dynamic libraries, and most dynamic
linkers will not be able to load either of these libraries.

To break the circularity in Figure~\ref{fig:applicativity}b, we have to \emph{inline}
the entire module \m{A} into the instance of \pname{p}.  Since the code is exactly
the same, we can still consider the instance of \m{A} in \pname{q} and in \pname{p}
type equal.  However, in Figure~\ref{fig:applicativity}c, applicativity has been
done at a coarser level: although we are using Backpack's module mixin syntax,
morally, this example is filling in the holes with the \emph{package} \pname{a}
(rather than a module).  In this case, we can achieve code sharing, since
\pname{p} can refer directly to \pname{a}, breaking the circularity.

\newcolumntype{C}{>{\centering\arraybackslash}p{0.3\textwidth}}
    \begin{savenotes}
\begin{figure}
    \begin{tabular}{C C C}
\begin{verbatim}
package q where
  A = [ data X = X ]
  A1 = [ import A; x = 0 ]
  A2 = [ import A; x = 1 ]
  include p (A as A1, P as P1)
  include p (A as A2, P as P2)
  Q = [ import P1; ... ]
\end{verbatim}
&
\begin{verbatim}
package q where
  A = [ data X = X ]


  include p (A, P as P1)
  include p (A, P as P2)
  Q = [ import P1; ... ]
\end{verbatim}
&
\begin{verbatim}
package a where
  A = [ data X = X ]
package q where
  include a
  include p (A, P as P1)
  include p (A, P as P2)
  Q = [ import P1; ... ]
\end{verbatim}
  \\
  (a) Declaration applicativity \ding{54} &
  (b) Module applicativity \ding{52} &
  (c) Package applicativity \ding{52} \\
\end{tabular}
\caption{Choices of granularity of applicativity on \pname{p}: given \texttt{data Y = Y X},  is \m{P1}.\texttt{Y} equal to \m{P2}.\texttt{Y}?}\label{fig:applicativity}
\end{figure}
\end{savenotes}

\subsection{The granularity of dependency}

\begin{savenotes}
\newcolumntype{C}{>{\centering\arraybackslash}p{0.3\textwidth}}
\begin{figure}
    \begin{tabular}{C C C}
\begin{verbatim}
package p(A,P) where
  A :: [ data X ]
  P = [
    import A
    data T = T
    data Y = Y X
  ]
\end{verbatim}
&
\begin{verbatim}
package p(A,P) where
  A :: [ data X ]
  B = [ data T = T ]
  C = [
    import A
    data Y = Y X
  ]
  P = [
    import B
    import C
  ]
\end{verbatim}
&
\begin{verbatim}
package b where
  B = [ data T = T ]
package c where
  A :: [ data X ]
  C = [
    import A
    data Y = Y X
  ]
package p(A,P) where
  include b; include c
  P = [ import B; import C ]
\end{verbatim}
  \\
  (a) Declaration granularity \ding{54} &
  (b) Module granularity \ding{54} &
  (c) Package granularity \ding{52} \\
\end{tabular}
\caption{Choices of granularity for dependency: is the identity of \texttt{T} independent of how \m{A} is instantiated?}\label{fig:granularity}
\end{figure}

\end{savenotes}

In the previous section, we considered \emph{what} entities may be considered for
computing dependency; in this section we consider \emph{which} entities are actually
considered as part of the dependencies for the declaration/module/package we're writing.
Figure~\ref{fig:granularity} contains a series of examples which exemplify the choice
of whether or not to collect dependencies on a per-declaration, per-module or per-package basis:

\begin{itemize}
    \item Package-level granularity states that the modules in a package are
considered to depend on \emph{all} of the holes in the package, even if
the hole is never imported.  Figure~\ref{fig:granularity}c is factored so that
\verb|T| is defined in a distinct package \pname{b} with no holes, so no matter
the choice of \m{A}, \m{B}.\verb|T| will be the same.  On the other hand, in
Figure~\ref{fig:granularity}b, there is a hole in the package defining \m{B},
so the identity of \verb|T| will depend on the choice of \m{A}.

\item Module-level granularity states that each module has its own dependency,
computed by looking at its import statements.  In this setting, \verb|T| in Figure~\ref{fig:granularity}b
is independent of \m{A}, since the hole is never imported in \m{B}.  But once again, in
Figure~\ref{fig:granularity}a, there is an import in the module defining \verb|T|,
so the identity of \verb|T| once again depends on the choice of \m{A}.

\item Finally, at the finest level of granularity, one could chop up \pname{p} in
Figure~\ref{fig:granularity}a, looking at the type declaration-level dependency
to suss out whether or not \verb|T| depends on \m{A}.  It doesn't refer to
anything in \m{A}, so it is always considered the same.
\end{itemize}

It is well worth noting that the system described by Paper Backpack tracks dependencies per module;
however, we have decided that we will implement tracking per package instead:
a coarser grained granularity which accepts less programs.

Is a finer form of granularity \emph{better?} Not necessarily!  For
one, we can always split packages into further subpackages (as was done
in Figure~\ref{fig:granularity}c) which better reflect the internal hole
dependencies, so it is always possible to rewrite a program to make it
typecheck---just with more packages.  Additionally, the finer the
granularity of dependency, the more work I have to do to understand what
the identities of entities in a module are.  In Paper Backpack, I have
to understand the imports of all modules in a package; with
declaration-granularity, I have to understand the entire code.  This is
a lot of work for the developer to think about; a more granular model is
easier to remember and reason about.  Finally, package-level granularity
is much easier to implement, as it preserves the previous compilation
model, \emph{one library per package}.  At a fine level of granularity, we
may end up repeatedly compiling a module which actually should be considered
``the same'' as any other instance of it.

Nevertheless, finer granularity can be desirable from an end-user perspective.
Usually, these circumstances arise when library-writers are forced to split their
components into many separate packages, when they would much rather have written
a single package.  For example, if I define a data type in my library, and would
like to define a \verb|Lens| instance for it, I would create a new package just
for the instance, in order to avoid saddling users who aren't interested in lenses
with an extra dependency.  Another example is test suites, which have dependencies
on various test frameworks that a user won't care about if they are not planning
on testing the code. (Cabal has a special case for this, allowing the user
to write effectively multiple packages in a single Cabal file.)

\subsection{Summary}

We can summarize all of the various schemes by describing the internal data
types that would be defined by GHC under each regime.  First, we have
the shared data structures, which correspond closely to what users are
used to seeing:

\begin{verbatim}
<pkg-name>   ::= containers, ...
<pkg-version ::= 1.0, ...
<pkg-id>     ::= <pkg-name>-<pkg-version>
<mod-name>   ::= Data.Set, ...
<occ>        ::= empty, ...
\end{verbatim}

Changing the \textbf{granularity of applicativity} modifies how we represent the
list of dependencies associated with an entity.  With module applicativity,
we list module identities (not yet defined); with declaration applicativity
we actually list the original names (i.e., ids).

\begin{verbatim}
<deps>       ::= <id>, ...      # Declaration applicativity
<deps>       ::= <module>, ...  # Module applicativity
\end{verbatim}

Changing the \textbf{granularity of dependency} affects how we compute
the lists of dependencies, and what entities are well defined:

\begin{verbatim}
# Package-level granularity
<pkg-key>    ::= hash(<pkg-id> + <deps for pkg>)
<module>     ::= <pkg-key> : <mod-name>
<id>         ::= <module> . <occ>

# Module-level granularity
<pkg-key>    not defined
<module>     ::= hash(<pkg-id> : <mod-name> + <deps for mod>)
<id>         ::= <module-key> . <occ>

# Declaration-level granularity
<pkg-key>    not defined
<module>     not defined
<id>         ::= hash(<pkg-id> : <mod-name> . <occ> + <deps for decl>)
\end{verbatim}

Notice that as we increase the granularity, the notion of a ``package'' and a ``module''
become undefined.  This is because, for example, with module-level granularity, a single
``package'' may result in several modules, each of which have different sets of
dependencies.  It doesn't make much sense to refer to the package as a monolithic entity,
because the point of splitting up the dependencies was so that if a user relies only
on a single module, it has a correspondingly restricted set of dependencies.
\subsection{The new scheme, formally}

\begin{wrapfigure}{R}{0.5\textwidth}
\begin{myfig}
\[
\begin{array}{@{}lr@{\;}c@{\;}l@{}}
    \text{Package Names (\texttt{PkgName})} & P &\in& \mathit{PkgNames} \\
    \text{Module Path Names (\texttt{ModName})} & p &\in& \mathit{ModPaths} \\
    \text{Module Identity Vars} & \alpha,\beta &\in& \mathit{IdentVars} \\
    \text{Package Key (\texttt{PackageKey})} & \K &::=& P(\vec{p\mapsto\nu}) \\
    \text{Module Identities (\texttt{Module})} & \nu &::=&
      \alpha ~|~
      \K\colon\! p \\
    \text{Module Identity Substs} & \phi,\theta &::=&
      \{\vec{\alpha \coloneqq \nu}\} \\
\end{array}
\]
\caption{Module Identities}
\label{fig:mod-idents}
\end{myfig}
\end{wrapfigure}

In this section, we give a formal treatment of our choice in the design space, in the
same style as the Backpack paper, but omitting mutual recursion, as it follows straightforwardly.
Physical module
identities $\nu$, the \texttt{Module} component of \emph{original names} in GHC, are either (1) \emph{variables} $\alpha$, which are
used to represent holes\footnote{In practice, these will just be fresh paths in a special package key for variables.} or (2) a concrete module $p$ defined in package
$P$, with holes instantiated with other module identities (might be
empty)\footnote{In Paper Backpack, we would refer to just $P$:$p$ as the identity
constructor.  However, we've written the subterms specifically next to $P$ to highlight the semantic difference of these terms.}.

As in traditional Haskell, every package contains a number of module
files at some module path $p$; within a package these paths are
guaranteed to be unique.\footnote{In Paper Backpack, the module expressions themselves are used to refer to globally unique identifiers for each literal.  This makes the metatheory simpler, but for implementation purposes it is convenient to conflate the \emph{original} module path that a module is defined at with its physical identity.}  When we write inline module definitions, we assume
that they are immediately assigned to a module path $p$ which is incorporated
into their identity.  A module identity $\nu$ simply augments this
with subterms $\vec{p\mapsto\nu}$ representing how \emph{all} holes in the package $P$
were instantiated.\footnote{In Paper Backpack, we do not distinguish between holes/non-holes, and we consider all imports of the \emph{module}, not the package.}  This naming is stable because the current Backpack surface syntax does not allow a logical path in a package
to be undefined.  A package key is $P(\vec{p\mapsto\nu})$.

Here is the very first example from
Section 2 of the original Backpack paper, \pname{ab-1}:

\begin{example}
\Pdef{ab-1}{
    \Pmod{A}{x = True}
    \Pmod{B}{\Mimp{A}; y = not x}
    % \Pmodd{C}{\mname{A}}
}
\end{example}

The identities of \m{A} and \m{B} are
\pname{ab-1}:\mname{A} and \pname{ab-1}:\mname{B}, respectively.\footnote{In Paper Backpack, the identity for \mname{B} records its import of \mname{A}, but since it is definite, this is strictly redundant.} In a package with holes, each
hole (within the package definition) gets a fresh variable as its
identity, and all of the holes associated with package $P$ are recorded. Consider \pname{abcd-holes-1}:

\begin{example}
\Pdef{abcd-holes-1}{
    \Psig{A}{x :: Bool} % chktex 26
    \Psig{B}{y :: Bool} % chktex 26
    \Pmod{C}{x = False}
    \Pmodbig{D}{
        \Mimpq{A}\\
        \Mimpq{C}\\
        % \Mexp{\m{A}.x, z}\\
        z = \m{A}.x \&\& \m{C}.x
    }
}
\end{example}

The identities of the four modules
are, in order, $\alpha_a$, $\alpha_b$, $\pname{abcd-holes-1}(\alpha_a,\alpha_b)$:\mname{C}, and
$\pname{abcd-holes-1}(\alpha_a,\alpha_b)$:\mname{D}.\footnote{In Paper Backpack, the granularity is at the module level, so the subterms of \mname{C} and \mname{D} can differ.}  We include both $\alpha_a$ and $\alpha_b$ in both \mname{C} and \mname{D}, regardless of the imports.  When we link the package against an implementation of the hole, these variables are replaced with the identities of the modules we linked against.

Shaping proceeds in the same way as in Paper Backpack, except that the
shaping judgment must also accept the package key
$P(\vec{p\mapsto\alpha})$ so we can create identifiers with
\textsf{mkident}.  This implies we must know ahead of time what the holes
of a package are.

\paragraph{A full Backpack comparison}
If you're curious about how the rest of the Backpack examples translate,
look no further than this section.

First, consider the module identities in the \m{Graph} instantiations in
\pname{multinst}, shown in Figure 2 of the original Backpack paper.
In the definition of \pname{structures}, assume that the variables for
\m{Prelude} and \m{Array} are $\alpha_P$ and $\alpha_A$ respectively.
The identity of \m{Graph} is $\pname{structures}(\alpha_P, \alpha_A)$:\m{Graph}. Similarly, the identities of the two array implementations
are $\nu_{AA} = \pname{arrays-a}(\alpha_P)$:\m{Array} and
$\nu_{AB} = \pname{arrays-b}(\alpha_P)$:\m{Array}.\footnote{Notice that the subterms coincide with Paper Backpack!  A sign that module level granularity is not necessary for many use-cases.}

The package \pname{graph-a} is more interesting because it
\emph{links} the packages \pname{arrays-a} and \pname{structures}
together, with the implementation of \m{Array} from \pname{arrays-a}
\emph{instantiating} the hole \m{Array} from \pname{structures}.  This
linking is reflected in the identity of the \m{Graph} module in
\pname{graph-a}: whereas in \pname{structures} it was $\nu_G =
\pname{structures}(\alpha_P, \alpha_A)$:\m{Graph}, in \pname{graph-a} it is
$\nu_{GA} = \nu_G[\nu_{AA}/\alpha_A] = \pname{structures}(\alpha_P, \nu_{AA})$:\m{Graph}.  Similarly, the identity of \m{Graph} in
\pname{graph-b} is $\nu_{GB} = \nu_G[\nu_{AB}/\alpha_A] =
\pname{structures}(\alpha_P, \nu_{AB})$:\m{Graph}.  Thus, linking consists
of substituting the variable identity of a hole by the concrete
identity of the module filling that hole.

Lastly, \pname{multinst} makes use of both of these \m{Graph}
modules, under the aliases \m{GA} and \m{GB}, respectively.
Consequently, in the \m{Client} module, \code{\m{GA}.G} and
\code{\m{GB}.G} will be correctly viewed as distinct types since they
originate in modules with distinct identities.

As \pname{multinst} illustrates, module identities effectively encode
dependency graphs at the package level.\footnote{In Paper Backpack, module identities
encode dependency graphs at the module level.  In both cases, however, what is being
depended on is always a module.}  Like in Paper Backpack, we have an \emph{applicative}
semantics of instantiation, and the applicativity example in Figure 3 of the
Backpack paper still type checks.  However, because we are operating at a coarser
granularity, modules may have spurious dependencies on holes that they don't
actually depend on, which means less type equalities may hold.


\subsection{Cabal dependency resolution}

Currently, when we compile a Cabal
package, Cabal goes ahead and resolves \verb|build-depends| entries with actual
implementations, which we compile against.  A planned addition to the package key,
independent of Backpack, is to record the transitive dependency tree selected
during this dependency resolution process, so that we can install \pname{libfoo-1.0}
twice compiled against different versions of its dependencies.
What is the relationship to this transitive dependency tree of \emph{packages},
with the subterms of our package identities which are \emph{modules}?  Does one
subsume the other?  In fact, these are separate mechanisms---two levels of indirections,
so to speak.

To illustrate, suppose I write a Cabal file with \verb|build-depends: foobar|.  A reasonable assumption is that this translates into a
Backpack package which has \verb|include foobar|.  However, this is not
actually a Paper Backpack package: Cabal's dependency solver has to
rewrite all of these package references into versioned references
\verb|include foobar-0.1|.  For example, this is a pre-package:

\begin{verbatim}
package foo where
    include bar
\end{verbatim}

and this is a Paper Backpack package:

\begin{verbatim}
package foo-0.3[bar-0.1[baz-0.2]] where
    include bar-0.1[baz-0.2]
\end{verbatim}

This tree is very similar to the one tracking dependencies for holes,
but we must record this tree \emph{even} when our package has no holes.
%   As a final example, the full module
%   identity of \m{B1} in Figure~\ref{fig:granularity} may actually be $\pname{p-0.9(q-1.0[p-0.9]:A1)}$:\m{B}.

\paragraph{Linker symbols} As we increase the amount of information in
PackageId, it's important to be careful about the length of these IDs,
as they are used for exported linker symbols (e.g.
\verb|base_TextziReadziLex_zdwvalDig_info|).  Very long symbol names
hurt compile and link time, object file sizes, GHCi startup time,
dynamic linking, and make gdb hard to use.  As such, we are going to
do away with full package names and versions and instead use just a
base-62 encoded hash, with the first five characters of the package
name for user-friendliness.

\subsection{Package selection}

When I fire up \texttt{ghci} with no arguments, GHC somehow creates
out of thin air some consistent set of packages, whose modules I can
load using \texttt{:m}.  This functionality is extremely handy for
exploratory work, but actually GHC has to work quite hard in order
to generate this set of packages, the contents of which are all
dumped into a global namespace.  For example, GHC doesn't have access
to Cabal's dependency solver, nor does it know \emph{which} packages
the user is going to ask for, so it can't just run a constraint solver,
get a set of consistent packages to offer and provide them to the user.\footnote{Some might
argue that depending on a global environment in this fashion is wrong, because
when you perform a build in this way, you have absolutely no ideas what
dependencies you actually ended up using.  But the fact remains that for
end users, this functionality is very useful.}

To make matters worse, while in the current design of the package database,
a package is uniquely identified by its package name and version, in
the Backpack design, it is \emph{mandatory} that we support multiple
packages installed in the database with the same package name and version,
and this can result in complications in the user model.  This further
complicates GHC's default package selection algorithm.

In this section, we describe how the current algorithm operates (including
what invariants it tries to uphold and where it goes wrong), and how
to replace the algorithm to handle generalization to
multiple instances in the package database.  We'll also try to tease
apart the relationship between package keys and installed package IDs in
the database.

\paragraph{The current algorithm} Abstractly, GHC's current package
selection algorithm operates as follows.  For every package name, select
the package with the latest version (recall that this is unique) which
is also \emph{valid}.  A package is valid if:

\begin{itemize}
    \item It exists in the package database,
    \item All of its dependencies are valid,
    \item It is not shadowed by a package with the same package ID\footnote{Recall that currently, a package ID uniquely identifies a package in the package database} in
        another package database (unless it is in the transitive closure
        of a package named by \texttt{-package-id}), and
    \item It is not ignored with \texttt{-ignore-package}.
\end{itemize}

Package validity is probably the minimal criterion for to GHC to ensure
that it can actually \emph{use} a package.  If the package is missing,
GHC can't find the interface files or object code associated with the
package.  Ignoring packages is a way of pretending that a package is
missing from the database.

Package validity is also a very weak criterion.  Another criterion we
might hope holds is \emph{consistency}: when we consider the transitive
closure of all selected packages, for any given package name, there
should only be one instance providing that package.  It is trivially
easy to break this property: suppose that I have packages \pname{a-1.0},
\pname{b-1.0} compiled against \pname{a-1.0}, and \pname{a-1.1}.  GHC
will happily load \pname{b-1.0} and \pname{a-1.1} together in the same
interactive session (they are both valid and the latest versions), even
though \pname{b-1.0}'s dependency is inconsistent with another package
that was loaded.  The user will notice if they attempt to treat entities
from \pname{a} reexported by \pname{b-1.0} and entities from
\pname{a-1.1} as type equal.  Here is one user who had this problem:
\url{http://stackoverflow.com/questions/12576817/}.  In some cases, the
problem is easy to work around (there is only one offending package
which just needs to be hidden), but if the divergence is deep in two
separate dependency hierarchies, it is often easier to just blow away
the package database and try again.

Perversely, destructive reinstallation helps prevent these sorts of
inconsistent databases.  While inconsistencies can arise when multiple
versions of a package are installed, multiple versions will frequently
lead to the necessity of reinstalls.  In the previous example, if a user
attempts to Cabal install a package which depends on \pname{a-1.1} and
\pname{b-1.0}, Cabal's dependency solver will propose reinstalling
\pname{b-1.0} compiled against \pname{a-1.1}, in order to get a
consistent set of dependencies.  If this reinstall is accepted, we
invalidate all packages in the database which were previously installed
against \pname{b-1.0} and \pname{a-1.0}, excluding them from GHC's
selection process and making it more likely that the user will see a
consistent view of the database.

\paragraph{Enforcing consistent dependencies}  From the user's
perspective, it would be desirable if GHC never loaded a set of packages
whose dependencies were inconsistent.
There are two ways we can go
about doing this.  First, we can improve GHC's logic so that it doesn't
pick an inconsistent set.  However, as a point of design, we'd like to
keep whatever resolution GHC does as simple as possible (in an ideal
world, we'd skip the validity checks entirely, but they ended up being
necessary to prevent broken database from stopping GHC from starting up
at all). In particular, GHC should \emph{not} learn how to do
backtracking constraint solving: that's in the domain of Cabal.  Second,
we can modify the logic of Cabal to enforce that the package database is
always kept in a consistent state, similar to the consistency check
Cabal applies to sandboxes, where it refuses to install a package to a
sandbox if the resulting dependencies would not be consistent.

The second alternative is a appealing, but Cabal sandboxes are currently
designed for small, self-contained single projects, as opposed to the
global ``universe'' that a default environment is intended to provide.
For example, with a Cabal sandbox environment, it's impossible to
upgrade a dependency to a new version without blowing away the sandbox
and starting again.  To support upgrades, Cabal needs to do some work:
when a new version is put in the default set, all of the
reverse-dependencies of the old version are now inconsistent.  Cabal
should offer to hide these packages or reinstall them compiled against
the latest version.  Cabal should also be able to snapshot the older
environment which captures the state of the universe prior to the
installation, in case the user wants to revert back.

\paragraph{Modifying the default environment}  Currently, after GHC
calculates the default package environment, a user may further modify
the environment by passing package flags to GHC, which can be used to
explicitly hide or expose packages.  How do these flags interact with
our Cabal-managed environments?  Hiding packages is simple enough,
but exposing packages is a bit dicier.  If a user asks for a different
version of a package than in the default set, it will probably be
inconsistent with the rest of the dependencies.  Cabal would have to
be consulted to figure out a maximal set of consistent packages with
the constraints given.

However, this use-case is rare.  Usually, it's not because they want a
specific version: the package is hidden simply because we're not
interested in loading it by default (\pname{ghc-api} is the canonical
example, since it dumps a lot of modules in the top level namespace).
If we distinguish packages which are consistent but hidden, their
loads can be handled appropriately.

\paragraph{Consistency in Backpack} We have stated as an implicit
assumption that if we have both \pname{foo-1.0} and \pname{foo-1.1}
available, only one should be loaded at a time.  What are the
consequences if both of these packages are loaded at the same time?  An
import of \m{Data.Foo} provided by both packages would be ambiguous and
the user might find some type equalities they expect to hold would not.
However, the result is not \emph{unsound}: indeed, we might imagine a
user purposely wanting two different versions of a library in the same
program, renaming the modules they provided so that they could be
referred to unambiguously.  As another example, suppose that we have an
indefinite package with a hole that is instantiated multiple times.  In
this case, a user absolutely may want to refer to both instantiations,
once again renaming modules so that they have unique names.

There are two consequences of this.  First, while the default package
set may enforce consistency, a user should still be able to explicitly
ask for a package instance, renamed so that its modules don't conflict,
and then use it in their program.  Second, instantiated indefinite packages
should \emph{never} be placed in the default set, since it's impossible
to know which instantiation is the one the user prefers.  A definite package
can reexport an instantiated module under an unambiguous name if the user
so pleases.

\paragraph{Shadowing, installed package IDs, ABI hashes and package
keys} Shadowing plays an important role for maintaining the soundness of
compilation; call this the \emph{compatibility} of the package set.  The
problem it addresses is when there are two distinct implementations of a
module, but because their package ID (or package key, in the new world
order) are the same, they are considered type equal.  It is absolutely
wrong for a single program to include both implementations
simultaneously (the symbols would conflict and GHC would incorrectly
conclude things were type equal when they're not), so \emph{shadowing}'s
job is to ensure that only one instance is picked, and all the other
instances considered invalid (and their reverse-dependencies, etc.)
Recall that in current GHC, within a package database, a package
instance is uniquely identified by its package ID\@; thus, shadowing
only needs to take place between package databases.  An interesting
corner case is when the same package ID occurs in both databases, but
the installed package IDs are the \emph{same}.  Because the installed
package ID is currently simply an ABI hash, we skip shadowing, because
the packages are---in principle---interchangeable.

There are currently a number of proposed changes to this state of affairs:

\begin{itemize}
    \item Change installed package IDs to not be based on ABI hashes.
        ABI hashes have a number of disadvantages as identifiers for
        packages in the database.  First, they cannot be computed until
        after compilation, which gave the multi-instance GSoC project a
        few years some headaches.  Second, it's not really true that
        programs with identical ABI hashes are interchangeable: a new
        package may be ABI compatible but have different semantics.
        Thus, installed package IDs are a poor unique identifier for
        packages in the package database.  However, because GHC does not
        give ABI stability guarantees, it would not be possible to
        assume from here that packages with the same installed package
        ID are ABI compatible.

    \item Relaxing the uniqueness constraint on package IDs.  There are
        actually two things that could be done here.  First, since we
        have augmented package IDs with dependency resolution
        information to form package keys, we could simply state that
        package keys uniquely identify a package in a database.
        Shadowing rules can be implemented in the same way as before, by
        preferring the instance topmost on the stack.  Second, we could
        also allow \emph{same-database} shadowing: that is, not even
        package keys are guaranteed to be unique in a database: instead,
        installed package IDs are the sole unique identifier of a
        package.  The motivation behind this architecture is to treat
        the package database more like a cache rather than a database:
        information about shadowing is separately maintained and used.
\end{itemize}

Edward thinks same-database shadowing is wrong.  What same-database
shadowing implies is that there are multiple incompatible ``package
hierarchies'' (possibly with a shared root), one of which shadows the
other hierarchy.  It is now absolutely essential to somehow identify
which hierarchy should be visible (the rest being shadowed).  It seems
better to me to explicitly reify this hierarchy as a hierarchy of
package databases.  For example, instead of having (installed package
IDs) \texttt{foo-1.0-hash1} and \texttt{foo-1.0-hash2} in the same
database, have a separate database for each, and the respective dependencies
which are built against those packages. (Notice that all of these dependencies
are incompatible with one another.)  Furthermore, because of the precedence
of shadowing, we can store one of these installed package IDs in the primary
database, and then layer the second on top of it (as it takes precedence,
it automatically invalidates all of the packages depending on \texttt{foo-1.0-hash1},
while keeping packages which are otherwise compatible.)

\section{Shapeless Backpack}\label{sec:simplifying-backpack}

Backpack as currently defined always requires a \emph{shaping} pass,
which calculates the shapes of all modules defined in a package.
The shaping pass is critical to the solution of the double-vision problem
in recursive module linking, but it also presents a number of unpalatable
implementation problems:

\begin{itemize}

    \item \emph{Shaping is a lot of work.} A module shape specifies the
        providence of all data types and identifiers defined by a
        module. To calculate this, we must preprocess and parse all
        modules, even before we do the type-checking pass.  (Fortunately,
        shaping doesn't require a full parse of a module, only enough
        to get identifiers.  However, it does have to understand import
        statements at the same level of detail as GHC's renamer.)

    \item \emph{Shaping must be done upfront.} In the current Backpack
        design, all shapes must be computed before any typechecking can
        occur.  While performing the shaping pass upfront is necessary
        in order to solve the double vision problem (where a module
        identity may be influenced by later definitions), it means
        that GHC must first do a shaping pass, and then revisit every module and
        compile them proper.  Nor is it (easily) possible to skip the
        shaping pass when it is unnecessary, as one might expect to be
        the case in the absence of mutual recursion.  Shaping is not
        a ``pay as you go'' language feature.

    \item \emph{GHC can't compile all programs shaping accepts.}  Shaping
        accepts programs that GHC, with its current hs-boot mechanism, cannot
        compile.  In particular, GHC requires that any data type or function
        in a signature actually be \emph{defined} in the module corresponding
        to that file (i.e., an original name can be assigned to these entities
        immediately.)  Shaping permits unrestricted exports to implement
        modules; this shows up in the formalism as $\beta$ module variables.

    \item \emph{Shaping encourages inefficient program organization.}
        Shaping is designed to enable mutually recursive modules, but as
        currently implemented, mutual recursion is less efficient than
        code without recursive dependencies. Programmers should avoid
        this code organization, except when it is absolutely necessary.

    \item \emph{GHC is architecturally ill-suited for directly
        implementing shaping.}  Shaping implies that GHC's internal
        concept of an ``original name'' be extended to accommodate
        module variables.  This is an extremely invasive change to all
        aspects of GHC, since the original names assumption is baked
        quite deeply into the compiler.  Plausible implementations of
        shaping requires all these variables to be skolemized outside
        of GHC\@.

\end{itemize}

To be clear, the shaping pass is fundamentally necessary for some
Backpack packages.  Here is the example which convinced Simon:

\begin{verbatim}
package p where
    A :: [data T; f :: T -> T]
    B = [export T(MkT), h; import A(f); data T = MkT; h x = f MkT]
    A = [export T(MkT), f, h; import B; f MkT = MkT]
\end{verbatim}

The key to this example is that B \emph{may or may not typecheck} depending
on the definition of A. Because A reexports B's definition T, B will
typecheck; but if A defined T on its own, B would not typecheck.  Thus,
we \emph{cannot} typecheck B until we have done some analysis of A (the
shaping analysis!)

Thus, it is beneficial (from an optimization point of view) to
consider a subset of Backpack for which shaping is not necessary.
Here is a programming discipline which does just that, which we will call the \textbf{linking restriction}: \emph{Module implementations must be declared before
signatures.}  Formally, this restriction modifies the rule for merging
polarized module shapes ($\widetilde{\tau}_1^{m_1} \oplus \widetilde{\tau}_2^{m_2}$) so that
$\widetilde{\tau}_1^- \oplus \widetilde{\tau}_2^+$ is always undefined.\footnote{This seemed to be the crispest way of defining the restriction, although this means an error happens a bit later than I'd like it to: I'd prefer if we errored while merging logical contexts, but we don't know what is a hole at that point.}

Here is an example of the linking restriction. Consider these two packages:

\begin{verbatim}
package random where
    System.Random = [ ... ].hs

package monte-carlo where
    System.Random :: ...
    System.MonteCarlo = [ ... ].hs
\end{verbatim}

Here, random is a definite package which may have been compiled ahead
of time; monte-carlo is an indefinite package with a dependency
on any package which provides \verb|System.Random|.

Now, to link these two applications together, only one ordering
is permissible:

\begin{verbatim}
package myapp where
    include random
    include monte-carlo
\end{verbatim}

If myapp wants to provide its own random implementation, it can do so:

\begin{verbatim}
package myapp2 where
    System.Random = [ ... ].hs
    include monte-carlo
\end{verbatim}

In both cases, all of \verb|monte-carlo|'s holes have been filled in by the time
it is included.  The alternate ordering is not allowed.

Why does this discipline prevent mutually recursive modules?  Intuitively,
a hole is the mechanism by which we can refer to an implementation
before it is defined; otherwise, we can only refer to definitions which
preceed our definition. If there are never any holes \emph{which get filled},
implementation links can only go backwards, ruling out circularity.

It's easy to see how mutual recursion can occur if we break this discipline:

\begin{verbatim}
package myapp2 where
    include monte-carlo
    System.Random = [ import System.MonteCarlo ].hs
\end{verbatim}

\subsection{Typechecking of definite modules without shaping}

If we are not carrying out a shaping pass, we need to be able to calculate
$\widetilde{\Xi}_{\mathsf{pkg}}$ on the fly.  In the case that we are
compiling a package---there will be no holes in the final package---we
can show that shaping is unnecessary quite easily, since with the
linking restriction, everything is definite from the get-go.

Observe the following invariant: at any given step of the module
bindings, the physical context $\widetilde{\Phi}$ contains no
holes.  We can thus conclude that there are no module variables in any
type shapes.  As the only time a previously calculated package shape can
change is due to unification, the incrementally computed shape is in
fact the true one.

As far as the implementation is concerned, we never have to worry
about handling module variables; we only need to do extra typechecks
against (renamed) interface files.

\subsection{Compiling definite packages}\label{sec:compiling}

% New definitions
\algnewcommand\algorithmicswitch{\textbf{switch}}
\algnewcommand\algorithmiccase{\textbf{case}}
\algnewcommand\algorithmicassert{\texttt{assert}}
% New "environments"
\algdef{SE}[SWITCH]{Switch}{EndSwitch}[1]{\algorithmicswitch\ #1\ \algorithmicdo}{\algorithmicend\ \algorithmicswitch}%
\algdef{SE}[CASE]{Case}{EndCase}[1]{\algorithmiccase\ ``#1''}{\algorithmicend\ \algorithmiccase}%
\algtext*{EndSwitch}%
\algtext*{EndCase}%

\begin{algorithm}
    \caption{Compilation of definite packages (assume \texttt{-hide-all-packages} on all \texttt{ghc} invocations)}\label{alg:compile}
\begin{algorithmic}
    \Procedure{Compile}{\textbf{package} $P$ \textbf{where} $\vec{B}$, $H$, $flags_P$}\Comment{}$H$ maps hole module names to identities
    \State$flags\gets flags_P$
    \State$\mathcal{K}$ $\gets$ \Call{Hash}{$P + H$}
    \State$cwd\gets \textrm{fresh working directory for compilation}$
    \For{$B$ \textbf{in} $\vec{B}$}
        \Case{$p = p\texttt{.hs}$}
            \State\Call{Exec}{\texttt{ghc -c} $p$\texttt{.hs} \texttt{-package-name} $\mathcal{K}$ $flags$}
        \EndCase%
        \Case{$p$ $\cc$ $p$\texttt{.hsig}}
            \State\Call{Exec}{\texttt{ghc -c} $p$\texttt{.hsig} \texttt{--sig-of} $H(p)$ $flags$}
        \EndCase%
        \Case{$p = p'$}
            \State$flags\gets flags$ \texttt{-alias} $p$ $p'$
            \State\Call{Exec}{\texttt{ghc --check} $flags$}
        \EndCase%
        \Case{\Cinc{$P'$} $\langle\vec{p_H\mapsto p_H'}, \vec{p\mapsto p'} \rangle$}
            \State\textbf{let} $H'(p_H) = $ \Call{Exec}{\texttt{ghc --resolve-module} $p_H'$ $flags$}
            \State$\mathcal{K}'\gets$ \Call{Compile}{$P'$, $H'$, $flags_P$ \texttt{-package-dir} $\mathcal{K}$ $cwd$ $\langle\rangle$}\Comment{}Nota bene: not $flags$
            \State$flags\gets flags$ \texttt{-package} $\mathcal{K}'$ $\langle\vec{p\mapsto p'}\rangle$
            \State\Call{Exec}{\texttt{ghc --check} $flags$}
        \EndCase%
    \EndFor%
    \State\Call{Exec}{\texttt{ghc-pkg copy \&\& ghc-pkg register}} \Comment{Records if $P$ exports a thinning}
    \State\Return$\mathcal{K}$
    \EndProcedure%
\end{algorithmic}
\end{algorithm}

The full recursive procedure for compiling a Backpack package is given
in Figure~\ref{alg:compile}.  We recursively walk through Backpack descriptions,
processing each line by invoking GHC and/or modifying our package state.
Here is a more in-depth description of the algorith, line-by-line:

\paragraph{The parameters} To compile a package description for package
$P$, we need to know $H$, the mapping of holes $p_H$ in package $P$ to
physical module identities $\nu$ which are implementing them; this
mapping is used to calculate the package key $\mathcal{K}$ for the
package in question.  Furthermore, because we are running an actual
compiler, and some of these implementations may not be in the package
database, we also need an extra set of flags $flags_P$ which tells us
where to find the interface and object files of any parent packages
which are currently compiling.

\subsection{Compiling implementations}

We compile modules in the same way we do today, but with some extra
package visibility $flags$, which let GHC know how to resolve imports
and look up original names.  We'll describe what the new flags are
and also discuss some subtleties with module lookup.

\paragraph{New package resolution algorithm}  Currently, GHC has a
complicated mechanism for processing \texttt{-package} (and related
flags), motivated by the fact that without any packages, GHC would like
try to make available \emph{all} packages in the package database.  Here
is a major simplification we can make: let's assume that \emph{no
packages are available by default.}  Now process each package flag in
turn, building up a set of selected packages which is initially empty.

For \texttt{-package} and \texttt{-package-id}, get the set of installed
packages which match the flag. (If multiple package names apply, process
each in turn as if it were a separate flag.)  Compute the transitive
closure of dependencies for all of them, and filter out all choices
which have dependencies which are inconsistent with the current set of
selected packages.  In the current GHC world, a dependency is consistent
with the selected packages if there is no package with the same package
name but different installed package database in the selected package
database.  Thus, in \texttt{-package} \pname{foo-0.2} \texttt{-package} \pname{bar},
selections of \pname{bar} which depended on \pname{foo-0.1} would be excluded.
In Backpack, we will make the model is a little more sophisticated.

If there is still more than one choice, tiebreak by version, which
database and time of install. (The latter tiebreak should not be
necessary unless there are multiple instances of a package with the same
package ID, as might be the case when their dependencies differ.)
If there are no choices, error, unless the particular package that was
asked for is an older version of a package already in the set
(e.g., \texttt{-package} \pname{containers-0.9} \texttt{-package} \pname{containers-0.8}).

\paragraph{Thinning and renaming on packages}  We augment the syntax of
\texttt{-package} to also accept a list of thinnings and renamings, e.g.
\texttt{-package} \pname{containers} $\langle\m{Data.Set}, \m{Data.Map}\mapsto \m{Map}\rangle$ says to make visible for import \m{Data.Set} and
\m{Map} (which is \m{Data.Map} renamed.)
With thinning and renaming, we have to slightly augment the package resolution
algorithm in two ways.  First, as we are processing \texttt{-package} flags,
we now need to concurrently maintain a mapping of visible module bindings.
We either put all exposed modules in these mapping, or just the ones mentioned
by the thinning and the renaming.

Furthermore, it makes sense to refine our notion of conflict to apply at
the module level.  Rather than maintain a mapping of package names to
package choices, we instead look at whether or not there are any
\emph{module binding} conflicts.  Thus, it would make sense to say
\texttt{-package} \pname{containers-0.9} $\langle\m{Data.Set}\mapsto
\m{Set09}\rangle$ \texttt{-package} \pname{containers-0.8}
$\langle\m{Data.Set}\mapsto \m{Set08}\rangle$ and then use both modules
concurrently, since the modules no longer are mapped onto the same name.
This is different from existing GHC behavior, where two packages which
export the same module can be loaded, but now GHC gives an ambiguous import
error if you try to import the module. (Perhaps this could still be done
here, by reporting errors lazily.)

Additionally, it's important to note that two packages exporting the
same module do not \emph{necessarily} cause a conflict; the modules
may link together.  For example, \texttt{-package} \pname{containers} $\langle\m{Data.Set}\rangle$
\texttt{-package} \pname{containers} $\langle\m{Data.Set}\rangle$ is fine, because
precisely the same implementation of \m{Data.Set} is loaded in both cases.
A similar situation can occur with signatures:

\begin{verbatim}
package p where
    A :: [ x :: Int ]
package q
    include p
    A :: [ y :: Int ]
    B = [ import A; z = x + y ] -- *
package r where
    A = [ x = 0; y = 0 ]
    include q
\end{verbatim}

Here, both \pname{p} and \pname{q} are visible when compiling the starred
module, which compiles with the flags \texttt{-package} \pname{p}, but there
are two interface files available: one available locally, and one from \pname{p}.
Both of these interface files are \emph{forwarding} to the original implementation
\pname{r} (more on this in the ``Compiling signatures'' file), so rather than
reporting an ambiguous import, we instead have to merge the two interface files
together and use the result as the interface for the module.  (This could be done
on the fly, or we could generate merged interface files as we go along.)

Note that we need to merge signatures with an implementation, just use the
implementation interface.  E.g.

\begin{verbatim}
package p where
    A :: ...
package q where
    A = ...
    include p
    B = [ import A ]    -- *
\end{verbatim}

Here, \m{A} is available both from \pname{p} and \pname{q}, but the use in the
starred module should be done with respect to the full implementation.

\paragraph{The \texttt{-alias} flag}  We introduce a new flag
\texttt{-alias} for aliasing modules today.  Aliasing is analogous to
the merging that can occur when we include packages, but it also applies
to modules which are locally defined.  When we alias a module $p$ with
$p'$, we require that $p'$ exists in the current module mapping, and then
we attempt to add an entry for it at entry $p$.  If there is no mapping for
$p$, this succeeds; otherwise, we apply the same conflict resolution algorithm.

\paragraph{The \texttt{-package-dir} flag}  We introduce a new flag \texttt{-package-dir}
which takes a package key $\mathcal{K}$, a directory and a list of thinning/renamings
as before.  \texttt{-package-dir} works nearly the same as a \texttt{-package} flag,
except that the package in question does \emph{not} have to already be installed in the
package database.  Instead, the files associated with the package are checked for in
the directory in question.  We will use this flag to refer to files in partially compiled
packages which have not been installed to the package database.

\subsection{Compiling signatures}

Signature compilation is triggered when we compile a signature file.
This mode similar to how we process \verb|hs-boot| files, except
we pass an extra flag \verb|--sig-of| which specifies what the
actual implementation of the signature is (according to our $H$
mapping).  This is guaranteed to exist, due to the linking
restriction.  (Additionally, the implementation will probably by find by looking
up the package key against a \texttt{-package-dir} flag.)  In this case, we output an \texttt{hisig} file which, for all declarations the
signature exposes, forwards their definitions to the original
implementation file.  The intent is that any code in the current package
which compiles against this signature will use this \texttt{hisig} file,
not the original one \texttt{hi} file.
For example, the \texttt{hisig} file produced when compiling the starred interface
points to the implementation in package \pname{q}.

\begin{verbatim}
package p where
    A :: ...    -- *
    B = [ import A; ... ]
package q where
    A = [ ... ]
    include p
\end{verbatim}

\paragraph{Sometimes \texttt{hisig} is unnecessary}
In the following package:

\begin{verbatim}
package p where
    P = ...
    P :: ...
\end{verbatim}

Paper Backpack specifies that we check the signature \m{P} against implementation
\m{P}, but otherwise no changes are made (i.e., the signature does not narrow
the implementation.) In this case, it is not necessary to generate an \texttt{hisig} file.

\paragraph{Multiple signatures}  As a simplification, we assume that there
is only one signature per logical name in a package.  (This prevents
us from expressing mutual recursion in signatures, but let's not worry
about it for now.)

\paragraph{Restricted recursive modules ala hs-boot}\label{sec:hs-boot-restrict}
When we compile an \texttt{hsig} file without any \texttt{--sig-of} flag (because
no implementation is known), we fall back to old-style GHC mutual recursion.
Na\"\i vely, a shaping pass would be necessary;
so we adopt an existing constraint that
already applies to hs-boot files: \emph{at the time we define a signature,
we must know what the original name for all data types is}.  In practice,
GHC enforces this by stating that: (1) an hs-boot file must be
accompanied with an implementation, and (2) the implementation must
in fact define (and not reexport) all of the declarations in the signature.
We can discover if a signature is intended to break a recursive module loop
when we discover that $p\notin flags_H$; in this case, we fallback to the
old hs-boot behavior. (Alternatively, the user can explicitly ask for it.)

Why does this not require a shaping pass? The reason is that the
signature is not really polymorphic: we require that the $\alpha$ module
variable be resolved to a concrete module later in the same package, and
that all the $\beta$ module variables be unified with $\alpha$. Thus, we
know ahead of time the original names and don't need to deal with any
renaming.\footnote{This strategy doesn't completely resolve the problem
of cross-package mutual recursion, because we need to first compile a
bit of the first package (signatures), then the second package, and then
the rest of the first package.}

Compiling packages in this way gives the tantalizing possibility
of true separate compilation: the only thing we don't know is what the actual
package name of an indefinite package will be, and what the correct references
to have are.  This is a very minor change to the assembly, so one could conceive
of dynamically rewriting these references at the linking stage.  But
separate compilation achieved in this fashion would not be able to take
advantage of cross-module optimizations.

\subsection{Compiling aliases}

Aliasing simply adds an extra \texttt{-alias} flag to the compilation flags.  To
eagerly report errors, we run a special command \texttt{ghc --check} which checks
to make sure $flags$ is consistent (e.g., no linking conflicts.)

\subsection{Compiling includes}

Includes are the most interesting part of the compilation process, as we have
calculate how the holes of the subpackage we are filling in are compiled $H'$
and modify our flags to make the exports of the include visible to subsequently
compiled modules.  We consider the case with renaming, since includes with
no renaming are straightforward.

First, we assume that we know \emph{a priori} what the holes of a
package $p_H$ are (either by some sort of pre-pass, or explicit
declaration.)  For each of their \emph{renamed targets} $p'_H$, we look
up the module in the current $flags$ environment, retrieving the
physical module identity by consulting GHC with the
\texttt{--resolve-module} flag and storing it in $H'$.  (This can be done in batch.)
For example:

\begin{verbatim}
package p where
    A :: ...
    ...
package q where
    A = [ ... ]
    B = [ ... ]
    include p (A as B)
\end{verbatim}

When computing the entry $H(\pname{A})$, we run the command \texttt{ghc --resolve-module} \pname{B}.

Next, we recursively call \textsc{Compile} with the computed $H'$,
but registering the current working directory as the place to look
for any original names containing the parent package key $\mathcal{K}$.
For example, in this situation:

\begin{verbatim}
package p where
    B :: ...
    C = [ import B; ... ]
package q where
    A = [ ... ]
    B = [ import A; ... ]
    include p
    D = [ import C; ... ]
\end{verbatim}

When we recursively process package \pname{p}, $H$ will refer to \pname{q}:\m{B}, and we need
to know where to find it (\pname{q} is only partially processed and so not installed
in the package database.)  Furthermore, the interface file for \m{B} may refer to \pname{q}:\m{A},
and thus we likewise need to know how to find its interface file.

Note that we do \emph{not} import any of the modules (the
thinning/renaming is empty, masking all modules); we only want to
register the package key in the in-memory database. Otherwise, this
example would improperly compile:

\begin{verbatim}
package p where
    B = [ import A; ... ]
package q where
    A = ...
    include p
\end{verbatim}

\pname{p} does not compile on its own, so it should not compile if it is recursively
invoked from \pname{q}.  However, if we add \texttt{-package-dir} \pname{q} $cwd$ to
the flags without masking out all modules, \m{A} is now suddenly resolvable.

Finally, once the subpackage is compiled, we can add it to our $flags$ so later
modules we compile see its (appropriately thinned and renamed) modules, and like
aliasing, we eagerly check it for consistency.  This consistency check would catch
the ambiguity in this example:

\begin{verbatim}
package p where
    A = ...
package q where
    A = ...
    include p
\end{verbatim}

(Note that \pname{p} is a perfectly good package on its own, it is only its inclusion in \pname{q} which is problematic.)

If we adopt the system where we merge signature files as we go along, this checking
phase would also be a good time to merge any \texttt{hisig} files from the subpackage
with any in the current $flags$ environment.

\paragraph{Absence of an \texttt{hi} file}
It is important that \texttt{--resolve-module} truly looks up the \emph{implementor}
of a module, and not just a signature which is providing it at some name.
Sometimes, a little extra work is necessary to compute this, for example:

\begin{verbatim}
package p where
    A :: [ y :: Int ]
package q where
    A :: [ x :: Int ]
    include p -- *
package r where
    A = [ x = 0; y = 1 ]
    include q
\end{verbatim}

When computing $H'$ for the starred include, our $flags$ only include
\texttt{-package-dir} \pname{r} $cwd_r$ $\langle\rangle$: with a thinning
that excludes all modules!  The only interface file we can pick up with these
$flags$ is the local definition of \m{A}.  However, we \emph{absolutely}
should set $H'(\m{A})=\pname{q}:\m{A}$; if we do so, then we will incorrectly
conclude when compiling the signature in \pname{p} that the implementation
doesn't export enough identifiers to fulfill the signature (\texttt{y} is not
available from just the signature in \pname{q}).  Instead, we have to look
up the original implementor of \m{A} in \pname{r}, and use that in $H'$.

\subsection{Commentary}

\paragraph{Just because it compiled, doesn't mean the individual packages type check}
The compilation mechanism described is slightly more permissive than vanilla Backpack.
Here is a simple example:

\begin{verbatim}
package p where
    A :: [ data T = T ]
    B :: [ data T = T ]
    C = [
        import A
        import B
        x = A.T :: B.T
    ]
package q where
    A = [ data T = T ]
    B = A
    include p
\end{verbatim}

Here, we incorrectly decide \m{A}\verb|.T| and \m{B}\verb|.T| are type
equal when typechecking \m{C}, because the \verb|hisig| files we
generate for them all point to the same original implementation.  However,
\pname{p} should not typecheck.

The problem here is that type checking asks ``does it compile with respect
to all possible instantiations of the holes'', whereas compilation asks
``does it compile with respect to this particular instantiation of holes.''
It's a bit unavoidable, really.

\section{Shaped Backpack}

Despite the simplicity of shapeless Backpack with the linking
restriction in the absence of holes, we will find that when we have
holes, it will be very difficult to do type-checking without
some form of shaping.  This section is very much a work in progress,
but the ability to typecheck against holes, even with the linking restriction,
is a very important part of modular separate development, so we will need
to support it at some point.

\subsection{Efficient shaping}

(These are Edward's opinion, he hasn't convinced other folks that this is
the right way to do it.)

In this section, I want to argue that, although shaping constitutes
a pre-pass which must be run before compilation in earnest, it is only
about as bad as the dependency resolution analysis that GHC already does
in \verb|ghc -M| or \verb|ghc --make|.

In Paper Backpack, what information does shaping compute? It looks at
exports, imports, data declarations and value declarations (but not the
actual expressions associated with these values.)  As a matter of fact,
GHC already must look at the imports associated with a package in order
to determine the dependency graph, so that it can have some order to compile
modules in.  There is a specialized parser which just parses these statements,
and then ignores the rest of the file.

A bit of background: the \emph{renamer} is responsible for resolving
imports and figuring out where all of these entities actually come from.
SPJ would really like to avoid having to run the renamer in order to perform
a shaping pass.

\paragraph{Is it necessary to run the Renamer to do shaping?}
Edward and Scott believe the answer is no, well, partially.
Shaping needs to know the original names of all entities exposed by a
module/signature. Then it needs to know (a) which entities a module/signature
defines/declares locally and (b) which entities that module/signature exports.
The former, (a), can be determined by a straightforward inspection of a parse
tree of the source file.\footnote{Note that no expression or type parsing
is necessary. We only need names of local values, data types, and data
constructors.} The latter, (b), is a bit trickier. Right now it's the Renamer
that interprets imports and exports into original names, so we would still
rely on that implementation. However, the Renamer does other, harder things
that we don't need, so ideally we could factor out the import/export
resolution from the Renamer for use in shaping.

Unfortunately the Renamer's import resolution analyzes \verb|.hi| files, but for
local modules, which haven't yet been typechecked, we don't have those.
Instead, we could use a new file format, \verb|.hsi| files, to store the shape of
a locally defined module. (Defined packages are bundled with their shapes,
so included modules have \verb|.hsi| files as well.) (What about the logical
vs.~physical distinction in file names?) If we refactor the import/export
resolution code, could we rewrite it to generically operate on both
\verb|.hi| files and \verb|.hsi| files?

Alternatively, rather than storing shapes on a per-source basis, we could
store (in memory) the entire package shape. Similarly, included packages
could have a single shape file for the entire package. Although this approach
would make shaping non-incremental, since an entire package's shape would
be recomputed any time a constituent module's shape changes, we do not expect
shaping to be all that expensive.

\subsection{Typechecking of indefinite modules}\label{sec:typechecking-indefinite}

Recall in our argument in the definite case, where we showed there are
no holes in the physical context.  With indefinite modules, this is no
longer true. While (with the linking restriction) these holes will never
be linked against a physical implementation, they may be linked against
other signatures.  (Note: while disallowing signature linking would
solve our problem, it would disallow a wide array of useful instances of
signature reuse, for example, a package mylib that implements both
mylib-1x-sig and mylib-2x-sig.)

With holes, we must handle module variables, and we sometimes must unify them:

\begin{verbatim}
package p where
    A :: [ data A ]
package q where
    A :: [ data A ]
package r where
    include p
    include q
\end{verbatim}

In this package, it is not possible to a priori assign original names to
module A in p and q, because in package r, they should have the same
original name.  When signature linking occurs, unification may occur,
which means we have to rename all relevant original names. (A similar
situation occurs when a module is typechecked against a signature.)

An invariant which would be nice to have is this: when typechecking a
signature or including a package, we may apply renaming to the entities
being brought into context.  But once we've picked an original name for
our entities, no further renaming should be necessary. (Formally, in the
unification for semantic object shapes, apply the unifier to the second
shape, but not the first one.)

However, there are plenty of counterexamples here:

\begin{verbatim}
package p where
    A :: [ data A ]
    B :: [ data A ]
    M = ...
    A = B
\end{verbatim}

In this package, does module M know that A.A and B.A are type equal?  In
fact, the shaping pass will have assigned equal module identities to A
and B, so M \emph{equates these types}, despite the aliasing occurring
after the fact.

We can make this example more sophisticated, by having a later
subpackage which causes the aliasing; now, the decision is not even a
local one (on the other hand, the equality should be evident by inspection
of the package interface associated with q):

\begin{verbatim}
package p where
    A :: [ data A ]
    B :: [ data A ]
package q where
    A :: [ data A ]
    B = A
package r where
    include p
    include q
\end{verbatim}

Another possibility is that it might be acceptable to do a mini-shaping
pass, without parsing modules or signatures, \emph{simply} looking at
names and aliases.  But logical names are not the only mechanism by
which unification may occur:

\begin{verbatim}
package p where
    C :: [ data A ]
    A = [ data A = A ]
    B :: [ import A(A) ]
    C = B
\end{verbatim}

It is easy to conclude that the original names of C and B are the same.  But
more importantly, C.A must be given the original name of p:A.A.  This can only
be discovered by looking at the signature definition for B. In any case, it
is worth noting that this situation parallels the situation with hs-boot
files (although there is no mutual recursion here).

The conclusion is that you will probably, in fact, have to do real
shaping in order to typecheck all of these examples.

\paragraph{Hey, these signature imports are kind of tricky\ldots}

When signatures and modules are interleaved, the interaction can be
complex.  Here is an example:

\begin{verbatim}
package p where
    C :: [ data A ]
    M = [ import C; ... ]
    A = [ import M; data A = A ]
    C :: [ import A(A) ]
\end{verbatim}

Here, the second signature for C refers to a module implementation A
(this is permissible: it simply means that the original name for p:C.A
is p:A.A).  But wait! A relies on M, and M relies on C. Circularity?
Fortunately not: a client of package p will find it impossible to have
the hole C implemented in advance, since they will need to get their hands on module
A\ldots but it will not be defined prior to package p.

In any case, however, it would be good to emit a warning if a package
cannot be compiled without mutual recursion.

\subsection{Rename on entry}

Consider the following example:

\begin{verbatim}
package p where
    A :: [ data T = T ]
    B = [ import A; x = T ]
package q where
    C :: ...
    A = [ data T = T ]
    include p
    D = [
        import qualified A
        import qualified B
        import C
        x = B.T :: A.T
    ]
\end{verbatim}

We are interested in type-checking \pname{q}, which is an indefinite package
on account of the uninstantiated hole \m{C}.  Furthermore, let's suppose that
\pname{p} has already been independently typechecked, and its interface files
installed in some global location with $\alpha_A$ used as the module identity
of \m{A}. (To simplify this example, we'll assume $\beta_{AT}=\alpha_A$.)

The first three lines of \pname{q} type check in the normal way, but \m{D}
now poses a problem: if we load the interface file for \m{B} the normal way,
we will get a reference to type \texttt{T} with the original name $\alpha_A$.\texttt{T},
whereas from \m{A} we have an original name \pname{q}:\m{A}.\texttt{T}.

Let's suppose that we already have the result of a shaping pass, which
maps our identity variables to their true identities.
Let's consider the possible options here:

\begin{itemize}
    \item We could re-typecheck \pname{p}, feeding it the correct instantiations
        for its variables.  However, this seems wasteful: we typechecked the
        package already, and up-to-renaming, the interface files are exactly
        what we need to type check our application.
    \item We could make copies of all the interface files, renamed to have the
        right original names.  This also seems wasteful: why should we have to
        create a new copy of every interface file in a library we depend on?
    \item When \emph{reading in} the interface file to GHC, we could apply the
        renaming according to the shaping pass and store that in memory.
\end{itemize}

That last solution is pretty appealing, however, there are still circumstances
we need to create new interface files; these exactly mirror the cases described
in Section~\ref{sec:compiling}.

\subsection{Incremental typechecking}
We want to typecheck modules incrementally, i.e., when something changes in
a package, we only want to re-typecheck the modules that care about that
change. GHC already does this today.%
\footnote{\url{https://ghc.haskell.org/trac/ghc/wiki/Commentary/Compiler/RecompilationAvoidance}}
Is the same mechanism sufficient for Backpack? Edward and Scott think that it
is, mostly. Our conjecture is that a module should be re-typechecked if the
existing mechanism says it should \emph{or} if the logical shape
context (which maps logical names to physical names) has changed. The latter
condition is due to aliases that affect typechecking of modules.

Let's look again at an example from before:
\begin{verbatim}
package p where
    A :: [ data A ]
    B :: [ data A ]
    M = [ import A; import B; ... ]
\end{verbatim}
Let's say that \verb|M| is typechecked successfully. Now we add an alias binding
at the end of the package, \verb|A = B|. Does \verb|M| need to be
re-typechecked? Yes! (Well, it seems so, but let's just assert ``yes'' for now.
Certainly in the reverse case---if we remove the alias and then ask---this
is true, since \verb|M| might have depended on the two \verb|A| types
being the same.)
The logical shape context changed to say that \verb|A| and
\verb|B| now map to the same physical module identity. But does the existing
recompilation avoidance mechanism say that \verb|M| should be re-typechecked?
It's unclear. The \verb|.hi| file for \verb|M| records that it imported \verb|A| and
\verb|B| with particular ABIs, but does it also know about the physical module
identities (or rather, original module names) of these modules?

Scott thinks this highlights the need for us to get our story straight about
the connection between logical names, physical module identities, and file
names!


\subsection{Installing indefinite packages}\label{sec:installing-indefinite}

If an indefinite package contains no code at all, we only need
to install the interface file for the signatures.  However, if
they include code, we must provide all of the
ingredients necessary to compile them when the holes are linked against
actual implementations.  (Figure~\ref{fig:pkgdb})

\paragraph{Source tarball or preprocessed source?}  What is the representation of the source that is saved is.  There
are a number of possible choices:

\begin{itemize}
    \item The original tarballs downloaded from Hackage,
    \item Preprocessed source files,
    \item Some sort of internal, type-checked representation of Haskell code (maybe the output of the desugarer).
\end{itemize}

Storing the tarballs is the simplest and most straightforward mechanism,
but we will have to be very certain that we can recompile the module
later in precisely the same we compiled it originally, to ensure the hi
files match up (fortunately, it should be simple to perform an optional
sanity check before proceeding.) The appeal of saving preprocessed
source, or even the IRs, is that this is conceptually this is exactly
what an indefinite package is: we have paused the compilation process
partway, intending to finish it later.  However, our compilation strategy
for definite packages requires us to run this step using a \emph{different}
choice of original names, so it's unclear how much work could actually be reused.

\section{Surface syntax}

In the Backpack paper, a brand new module language is presented, with
syntax for inline modules and signatures.  This syntax is probably worth implementing,
because it makes it easy to specify compatibility packages, whose module
definitions in general may be very short:

\begin{verbatim}
package ishake-0.12-shake-0.13 where
    include shake-0.13
    Development.Shake.Sys = Development.Shake.Cmd
    Development.Shake = [ (**>) = (&>) ; (*>>) = (|*>)]
    Development.Shake.Rule = [ defaultPriority = rule . priority 0.5 ]
    include ishake-0.12
\end{verbatim}

However, there are a few things that are less than ideal about the
surface syntax proposed by Paper Backpack:

\begin{itemize}
    \item It's completely different from the current method users
        specify packages. There's nothing wrong with this per se
        (one simply needs to support both formats) but the smaller
        the delta, the easier the new packaging format is to explain
        and implement.

    \item Sometimes order matters (relative ordering of signatures and
        module implementations), and other times it does not (aliases).
        This can be confusing for users.

    \item Users have to order module definitions topologically,
        whereas in current Cabal modules can be listed in any order, and
        GHC figures out an appropriate order to compile them.
\end{itemize}

Here is an alternative proposal, closely based on Cabal syntax.  Given
the following Backpack definition:

\begin{verbatim}
package libfoo(A, B, C, Foo) where
    include base
    -- renaming and thinning
    include libfoo (Foo, Bar as Baz)
    -- holes
    A :: [ a :: Bool ].hsig
    A2 :: [ b :: Bool ].hsig
    -- normal module
    B = [
        import {-# SOURCE #-} A
        import Foo
        import Baz
        ...
    ].hs
    -- recursively linked pair of modules, one is private
    C :: [ data C ].hsig
    D = [ import {-# SOURCE #-} C; data D = D C ].hs
    C = [ import D; data C = C D ].hs
    -- alias
    A = A2
\end{verbatim}

We can write the following Cabal-like syntax instead (where
all of the signatures and modules are placed in appropriately
named files):

\begin{verbatim}
package: libfoo
...
build-depends: base, libfoo (Foo, Bar as Baz)
holes: A A2 -- deferred for now
exposed-modules: Foo B C
aliases: A = A2
other-modules: D
\end{verbatim}

Notably, all of these lists are \emph{insensitive} to ordering!
The key idea is use of the \verb|{-# SOURCE #-}| pragma, which
is enough to solve the important ordering constraint between
signatures and modules.

Here is how the elaboration works.  For simplicity, in this algorithm
description, we assume all packages being compiled have no holes
(including \verb|build-depends| packages). Later, we'll discuss how to
extend the algorithm to handle holes in both subpackages and the main
package itself.

\begin{enumerate}

    \item At the top-level with \verb|package| $p$ and
        \verb|exposed-modules| $ms$, record \verb|package p (ms) where|

    \item For each package $p$ with thinning/renaming $ms$ in
        \verb|build-depends|, record a \verb|include p (ms)| in the
        Backpack package.  The ordering of these includes does not
        matter, since none of these packages have holes.

    \item Take all modules $m$ in \verb|other-modules| and
        \verb|exposed-modules| which were not exported by build
        dependencies, and create a directed graph where hs and hs-boot
        files are nodes and imports are edges (the target of an edge is
        an hs file if it is a normal import, and an hs-boot file if it
        is a SOURCE import).  Topologically sort this graph, erroring if
        this graph contains cycles (even with recursive modules, the
        cycle should have been broken by an hs-boot file).  For each
        node, in this order, record \verb|M = [ ... ]| or \verb|M :: [ ... ]|
        depending on whether or not it is an hs or hs-boot.  If possible,
        sort signatures before implementations when there is no constraint
        otherwise.

\end{enumerate}

Here is a simple example which shows how SOURCE can be used to disambiguate
between two important cases. Suppose we have these modules:

\begin{verbatim}
-- A1.hs
import {-# SOURCE #-} B

-- A2.hs
import B

-- B.hs
x = True

-- B.hs-boot
x :: Bool
\end{verbatim}

Then we translate the following packages as follows:

\begin{verbatim}
exposed-modules: A1 B
-- translates to
B :: [ x :: Bool ]
A1 = [ import B ]
B = [ x = True ]
\end{verbatim}

but

\begin{verbatim}
exposed-modules: A2 B
-- translates to
B = [ x = True ]
B :: [ x :: Bool ]
A2 = [ import B ]
\end{verbatim}

The import controls placement between signature and module, and in A1 it
forces B's signature to be sorted before B's implementation (whereas in
the second section, there is no constraint so we preferentially place
the B's implementation first)

\paragraph{Holes in the database} In the presence of holes,
\verb|build-depends| resolution becomes more complicated.  First,
let's consider the case where the package we are building is
definite, but the package database contains indefinite packages with holes.
In order to maintain the linking restriction, we now have to order packages
from step (2) of the previous elaboration.  We can do this by creating
a directed graph, where nodes are packages and edges are from holes to the
package which implements them.  If there is a cycle, this indicates a mutually
recursive package.  In the absence of cycles, a topological sorting of this
graph preserves the linking invariant.

One subtlety to consider is the fact that an entry in \verb|build-depends|
can affect how a hole is instantiated by another entry.  This might be a
bit weird to users, who might like to explicitly say how holes are
filled when instantiating a package.  Food for thought, surface syntax wise.

\paragraph{Holes in the package} Actually, this is quite simple: the
ordering of includes goes as before, but some indefinite packages in the
database are less constrained as they're ``dependencies'' are fulfilled
by the holes at the top-level of this package.  It's also worth noting
that some dependencies will go unresolved, since the following package
is valid:

\begin{verbatim}
package a where
    A :: ...
package b where
    include a
\end{verbatim}

\paragraph{Multiple signatures}  In Backpack syntax, it's possible to
define a signature multiple times, which is necessary for mutually
recursive signatures:

\begin{verbatim}
package a where
    A :: [ data A ]
    B :: [ import A; data B = B A ]
    A :: [ import B; data A = A B ]
\end{verbatim}

Critically, notice that we can see the constructors for both module B and A
after the signatures are linked together.  This is not possible in GHC
today, but could be possible by permitting multiple hs-boot files.  Now
the SOURCE pragma indicating an import must \emph{disambiguate} which
hs-boot file it intends to include.  This might be one way of doing it:

\begin{verbatim}
-- A.hs-boot2
data A

-- B.hs-boot
import {-# SOURCE hs-boot2 #-} A

-- A.hs-boot
import {-# SOURCE hs-boot #-} B
\end{verbatim}

\paragraph{Explicit or implicit reexports}  One annoying property of
this proposal is that, looking at the \verb|exposed-modules| list, it is
not immediately clear what source files one would expect to find in the
current package.  It's not obvious what the proper way to go about doing
this is.

\paragraph{Better syntax for SOURCE}  If we enshrine the SOURCE import
as a way of solving Backpack ordering problems, it would be nice to have
some better syntax for it. One possibility is:

\begin{verbatim}
abstract import Data.Foo
\end{verbatim}

which makes it clear that this module is pluggable, typechecking against
a signature.  Note that this only indicates how type checking should be
done: when actually compiling the module we will compile against the
interface file for the true implementation of the module.

It's worth noting that the SOURCE annotation was originally made a
pragma because, in principle, it should have been possible to compile
some recursive modules without needing the hs-boot file at all. But if
we're moving towards boot files as signatures, this concern is less
relevant.

\section{Type classes and type families}

\subsection{Background}

Before we talk about how to support type classes in Backpack, it's first
worth talking about what we are trying to achieve in the design.  Most
would agree that \emph{type safety} is the cardinal law that should be
preserved (in the sense that segfaults should not be possible), but
there are many instances of ``bad behavior'' (top level mutable state,
weakening of abstraction guarantees, ambiguous instance resolution, etc)
which various Haskellers may disagree on the necessity of ruling out.

With this in mind, it is worth summarizing what kind of guarantees are
presently given by GHC with regards to type classes and type families,
as well as characterizing the \emph{cultural} expectations of the
Haskell community.

\paragraph{Type classes}  When discussing type class systems, there are
several properties that one may talk about.
A set of instances is \emph{confluent} if, no matter what order
constraint solving is performed, GHC will terminate with a canonical set
of constraints that must be satisfied for any given use of a type class.
In other words, confluence says that we won't conclude that a program
doesn't type check just because we swapped in a different constraint
solving algorithm.

Confluence's closely related twin is \emph{coherence} (defined in ``Type
classes: exploring the design space''). This property states that
``every different valid typing derivation of a program leads to a
resulting program that has the same dynamic semantics.''  Why could
differing typing derivations result in different dynamic semantics?  The
answer is that context reduction, which picks out type class instances,
elaborates into concrete choices of dictionaries in the generated code.
Confluence is a prerequisite for coherence, since one
can hardly talk about the dynamic semantics of a program that doesn't
type check.

In the vernacular, confluence and coherence are often incorrectly used
to refer to another related property: \emph{global uniqueness of instances},
which states that in a fully compiled program, for any type, there is at most one
instance resolution for a given type class.  Languages with local type
class instances such as Scala generally do not have this property, and
this assumption is frequently used for abstraction.

So, what properties does GHC enforce, in practice?
In the absence of any type system extensions, GHC's employs a set of
rules (described in ``Exploring the design space'') to ensure that type
class resolution is confluent and coherent.  Intuitively, it achieves
this by having a very simple constraint solving algorithm (generate
wanted constraints and solve wanted constraints) and then requiring the
set of instances to be \emph{nonoverlapping}, ensuring there is only
ever one way to solve a wanted constraint.  Overlap is a
more stringent restriction than either confluence or coherence, and
via the \verb|OverlappingInstances| and \verb|IncoherentInstances|, GHC
allows a user to relax this restriction ``if they know what they're doing.''

Surprisingly, however, GHC does \emph{not} enforce global uniqueness of
instances.  Imported instances are not checked for overlap until we
attempt to use them for instance resolution.  Consider the following program:

\begin{verbatim}
-- T.hs
data T = T
-- A.hs
import T
instance Eq T where
-- B.hs
import T
instance Eq T where
-- C.hs
import A
import B
\end{verbatim}

When compiled with one-shot compilation, \verb|C| will not report
overlapping instances unless we actually attempt to use the \verb|Eq|
instance in C.\footnote{When using batch compilation, GHC reuses the
    instance database and is actually able to detect the duplicated
    instance when compiling B.  But if you run it again, recompilation
avoidance skips A, and it finishes compiling!  See this bug:
\url{https://ghc.haskell.org/trac/ghc/ticket/5316}}  This is by
design\footnote{\url{https://ghc.haskell.org/trac/ghc/ticket/2356}}:
ensuring that there are no overlapping instances eagerly requires
eagerly reading all the interface files a module may depend on.

We might summarize these three properties in the following manner.
Culturally, the Haskell community expects \emph{global uniqueness of instances}
to hold: the implicit global database of instances should be
confluent and coherent.  GHC, however, does not enforce uniqueness of
instances: instead, it merely guarantees that the \emph{subset} of the
instance database it uses when it compiles any given module is confluent and coherent.  GHC does do some
tests when an instance is declared to see if it would result in overlap
with visible instances, but the check is by no means
perfect\footnote{\url{https://ghc.haskell.org/trac/ghc/ticket/9288}};
truly, \emph{type-class constraint resolution} has the final word.  One
mitigating factor is that in the absence of \emph{orphan instances}, GHC is
guaranteed to eagerly notice when the instance database has overlap.\footnote{Assuming that the instance declaration checks actually worked\ldots}

Clearly, the fact that GHC's lazy behavior is surprising to most
Haskellers means that the lazy check is mostly good enough: a user
is likely to discover overlapping instances one way or another.
However, it is relatively simple to construct example programs which
violate global uniqueness of instances in an observable way:

\begin{verbatim}
-- A.hs
module A where
data U = X | Y deriving (Eq, Show)

-- B.hs
module B where
import Data.Set
import A

instance Ord U where
compare X X = EQ
compare X Y = LT
compare Y X = GT
compare Y Y = EQ

ins :: U -> Set U -> Set U
ins = insert

-- C.hs
module C where
import Data.Set
import A

instance Ord U where
compare X X = EQ
compare X Y = GT
compare Y X = LT
compare Y Y = EQ

ins' :: U -> Set U -> Set U
ins' = insert

-- D.hs
module Main where
import Data.Set
import A
import B
import C

test :: Set U
test = ins' X $ ins X $ ins Y $ empty

main :: IO ()
main = print test

-- OUTPUT
$ ghc -Wall -XSafe -fforce-recomp --make D.hs
[1 of 4] Compiling A ( A.hs, A.o )
[2 of 4] Compiling B ( B.hs, B.o )

B.hs:5:10: Warning: Orphan instance: instance [safe] Ord U
[3 of 4] Compiling C ( C.hs, C.o )

C.hs:5:10: Warning: Orphan instance: instance [safe] Ord U
[4 of 4] Compiling Main ( D.hs, D.o )
Linking D ...
$ ./D
fromList [X,Y,X]
\end{verbatim}

Locally, all type class resolution was coherent: in the subset of
instances each module had visible, type class resolution could be done
unambiguously.  Furthermore, the types of \verb|ins| and \verb|ins'|
discharge type class resolution, so that in \verb|D| when the database
is now overlapping, no resolution occurs, so the error is never found.

It is easy to dismiss this example as an implementation wart in GHC, and
continue pretending that global uniqueness of instances holds.  However,
the problem with \emph{global uniqueness of instances} is that they are
inherently nonmodular: you might find yourself unable to compose two
components because they accidentally defined the same type class
instance, even though these instances are plumbed deep in the
implementation details of the components.

As it turns out, there is already another feature in Haskell which
must enforce global uniqueness, to prevent segfaults.
We now turn to type classes' close cousin: type families.

\paragraph{Type families}  With type families, confluence is the primary
property of interest.  (Coherence is not of much interest because type
families are elaborated into coercions, which don't have any
computational content.)  Rather than considering what the set of
constraints we reduce to, confluence for type families considers the
reduction of type families.  The overlap checks for type families
can be quite sophisticated, especially in the case of closed type
families.

Unlike type classes, however, GHC \emph{does} check the non-overlap
of type families eagerly.  The analogous program does \emph{not} type check:

\begin{verbatim}
-- F.hs
type family F a :: *
-- A.hs
import F
type instance F Bool = Int
-- B.hs
import F
type instance F Bool = Bool
-- C.hs
import A
import B
\end{verbatim}

The reason is that it is \emph{unsound} to ever allow any overlap
(unlike in the case of type classes where it just leads to incoherence.)
Thus, whereas one might imagine dropping the global uniqueness of instances
invariant for type classes, it is absolutely necessary to perform global
enforcement here.  There's no way around it!

\subsection{Local type classes}

Here, we say \textbf{NO} to global uniqueness.

This design is perhaps best discussed in relation to modular type
classes, which shares many similar properties.  Instances are now
treated as first class objects (in MTCs, they are simply modules)---we
may explicitly hide or include instances for type class resolution (in
MTCs, this is done via the \verb|using| top-level declaration).  This is
essentially what was sketched in Section 5 of the original Backpack
paper.  As a simple example:

\begin{verbatim}
package p where
    A :: [ data T = T ]
    B = [ import A; instance Eq T where ... ]

package q where
    A = [ data T = T; instance Eq T where ... ]
    include p
\end{verbatim}

Here, \verb|B| does not see the extra instance declared by \verb|A|,
because it was thinned from its signature of \verb|A| (and thus never
declared canonical.)  To declare an instance without making it
canonical, it must placed in a separate (unimported) module.

Like modular type classes, Backpack does not give rise to incoherence,
because instance visibility can only be changed at the top level module
language, where it is already considered best practice to provide
explicit signatures.  Here is the example used in the Modular Type
Classes paper to demonstrate the problem:

\begin{verbatim}
structure A = using EqInt1 in
    struct ...fun f x = eq(x,x)... end
structure B = using EqInt2 in
    struct ...val y = A.f(3)... end
\end{verbatim}

Is the type of f \verb|int -> bool|, or does it have a type-class
constraint?  Because type checking proceeds over the entire program, ML
could hypothetically pick either.  However, ported to Haskell, the
example looks like this:

\begin{verbatim}
EqInt1 :: [ instance Eq Int ]
EqInt2 :: [ instance Eq Int ]
A = [
    import EqInt1
    f x = x == x
]
B = [
    import EqInt2
    import A hiding (instance Eq Int)
    y = f 3
]
\end{verbatim}

There may be ambiguity, yes, but it can be easily resolved by the
addition of a top-level type signature to \verb|f|, which is considered
best-practice anyway.  Additionally, Haskell users are trained to expect
a particular inference for \verb|f| in any case (the polymorphic one).

Here is another example which might be considered surprising:

\begin{verbatim}
package p where
    A :: [ data T = T ]
    B :: [ data T = T ]
    C = [
        import qualified A
        import qualified B
        instance Show A.T where show T = "A"
        instance Show B.T where show T = "B"
        x :: String
        x = show A.T ++ show B.T
    ]
\end{verbatim}

In the original Backpack paper, it was implied that module \verb|C|
should not type check if \verb|A.T = B.T| (failing at link time).
However, if we set aside, for a moment, the issue that anyone who
imports \verb|C| in such a context will now have overlapping instances,
there is no reason in principle why the module itself should be
problematic.  Here is the example in MTCs, which I have good word from
Derek does type check.

\begin{verbatim}
signature SIG = sig
    type t
    val mk : t
end
signature SHOW = sig
    type t
    val show : t -> string
end
functor Example (A : SIG) (B : SIG) =
    let structure ShowA : SHOW = struct
        type t = A.t
        fun show _ = "A"
    end in
    let structure ShowB : SHOW = struct
        type t = B.t
        fun show _ = "B"
    end in
    using ShowA, ShowB in
    struct
        val x = show A.mk ++ show B.mk
    end : sig val x : string end
\end{verbatim}

The moral of the story is, even though in a later context the instances
are overlapping, inside the functor, the type-class resolution is unambiguous
and should be done (so \verb|x = "AB"|).

Up until this point, we've argued why MTCs and this Backpack design are similar.
However, there is an important sociological difference between modular type-classes
and this proposed scheme for Backpack.  In the presentation ``Why Applicative
Functors Matter'', Derek mentions the canonical example of defining a set:

\begin{verbatim}
signature ORD = sig type t; val cmp : t -> t -> bool end
signature SET = sig type t; type elem;
    val empty : t;
    val insert : elem -> t -> t ...
end
functor MkSet (X : ORD) :> SET where type elem = X.t
  = struct ... end
\end{verbatim}

This is actually very different from how sets tend to be defined in
Haskell today.  If we directly encoded this in Backpack, it would
look like this:

\begin{verbatim}
package mk-set where
    X :: [
        data T
        cmp :: T -> T-> Bool
    ]
    Set :: [
        data Set
        empty :: Set
        insert :: T -> Set -> Set
    ]
    Set = [
        import X
        ...
    ]
\end{verbatim}

It's also informative to consider how MTCs would encode set as it is written
today in Haskell:

\begin{verbatim}
signature ORD = sig type t; val cmp : t -> t -> bool end
signature SET = sig type 'a t;
    val empty : 'a t;
    val insert : (X : ORD) => X.t -> X.t t -> X.t t
end
struct MkSet :> SET = struct ... end
\end{verbatim}

Here, it is clear to see that while functor instantiation occurs for
implementation, it is not occuring for types.  This is a big limitation
with the Haskell approach, and it explains why Haskellers, in practice,
find global uniqueness of instances so desirable.

Implementation-wise, this requires some subtle modifications to how we
do type class resolution.  Type checking of indefinite modules works as
before, but when go to actually compile them against explicit
implementations, we need to ``forget'' that two types are equal when
doing instance resolution.  This could probably be implemented by
associating type class instances with the original name that was
utilized when typechecking, so that we can resolve ambiguous matches
against types which have the same original name now that we are
compiling.

As we've mentioned previously, this strategy is unsound for type families.

\subsection{Globally unique}

Here, we say \textbf{YES} to global uniqueness.

When we require the global uniqueness of instances (either because
that's the type class design we chose, or because we're considering
the problem of type families), we will need to reject declarations like the
one cited above when \verb|A.T = B.T|:

\begin{verbatim}
A :: [ data T ]
B :: [ data T ]
C :: [
    import qualified A
    import qualified B
    instance Show A.T where show T = "A"
    instance Show B.T where show T = "B"
]
\end{verbatim}

The paper mentions that a link-time check is sufficient to prevent this
case from arising.  While in the previous section, we've argued why this
is actually unnecessary when local instances are allowed, the link-time
check is a good match in the case of global instances, because any
instance \emph{must} be declared in the signature.  The scheme proceeds
as follows: when some instances are typechecked initially, we type check
them as if all of variable module identities were distinct.  Then, when
we perform linking (we \verb|include| or we unify some module
identities), we check again if to see if we've discovered some instance
overlap.  This linking check is akin to the eager check that is
performed today for type families; it would need to be implemented for
type classes as well: however, there is a twist: we are \emph{redoing}
the overlap check now that some identities have been unified.

As an implementation trick, one could deferring the check until \verb|C|
is compiled, keeping in line with GHC's lazy ``don't check for overlap
until the use site.''  (Once again, unsound for type families.)

\paragraph{What about module inequalities?}  An older proposal was for
signatures to contain ``module inequalities'', i.e., assertions that two
modules are not equal.  (Technically: we need to be able to apply this
assertion to $\beta$ module variables, since \verb|A != B| while
\verb|A.T = B.T|).  Currently, Edward thinks that module inequalities
are only marginal utility with local instances (i.e., not enough to
justify the implementation cost) and not useful at all in the world of
global instances!

With local instances, module inequalities could be useful to statically
rule out examples like \verb|show A.T ++ show B.T|.  Because such uses
are not necessarily reflected in the signature, it would be a violation
of separate module development to try to divine the constraint from the
implementation itself.  I claim this is of limited utility, however, because,
as we mentioned earlier, we can compile these ``incoherent'' modules perfectly
coherently.  With global instances, all instances must be in the signature, so
while it might be aesthetically displeasing to have the signature impose
extra restrictions on linking identities, we can carry this out without
violating the linking restriction.

\section{Bits and bobs}

\subsection{Abstract type synonyms}

In Paper Backpack, abstract type synonyms are not permitted, because GHC doesn't
understand how to deal with them.  The purpose of this section is to describe
one particularly nastiness of abstract type synonyms, by way of the occurs check:

\begin{verbatim}
A :: [ type T ]
B :: [ import qualified A; type T = [A.T] ]
\end{verbatim}

At this point, it is illegal for \verb|A = B|, otherwise this type synonym would
fail the occurs check.  This seems like pretty bad news, since every instance
of the occurs check in the type-checker could constitute a module inequality.

\section{Open questions}\label{sec:open-questions}

Here are open problems about the implementation which still require
hashing out.

\begin{itemize}

    \item In Section~\ref{sec:simplifying-backpack}, we argued that we
        could implement Backpack without needing a shaping pass. We're
        pretty certain that this will work for typechecking and
        compiling fully definite packages with no recursive linking, but
        in Section~\ref{sec:typechecking-indefinite}, we described some
        of the prevailing difficulties of supporting signature linking.
        Renaming is not an insurmountable problem, but backwards flow of
        shaping information can be, and it is unclear how best to
        accommodate this.  This is probably the most important problem
        to overcome.

    \item In Section~\ref{sec:installing-indefinite}, a few choices for how to
        store source code were pitched, however, there is not consensus on which
        one is best.

    \item What is the impact of the multiplicity of PackageIds on
        dependency solving in Cabal?  Old questions of what to prefer
        when multiple package-versions are available (Cabal originally
        only needed to solve this between different versions of the same
        package, preferring the oldest version), but with signatures,
        there are more choices.  Should there be a complex solver that
        does all signature solving, or a preprocessing step that puts
        things back into the original Cabal version.  Authors may want
        to suggest policy for what packages should actually link against
        signatures (so a crypto library doesn't accidentally link
        against a null cipher package).

\end{itemize}

\end{document}