diff options
Diffstat (limited to 'docs/comm/the-beast/data-types.html')
-rw-r--r-- | docs/comm/the-beast/data-types.html | 242 |
1 files changed, 242 insertions, 0 deletions
diff --git a/docs/comm/the-beast/data-types.html b/docs/comm/the-beast/data-types.html new file mode 100644 index 0000000000..fef4852d4d --- /dev/null +++ b/docs/comm/the-beast/data-types.html @@ -0,0 +1,242 @@ +<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN"> +<html> + <head> + <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=ISO-8859-1"> + <title>The GHC Commentary - Data types and data constructors</title> + </head> + + <body BGCOLOR="FFFFFF"> + <h1>The GHC Commentary - Data types and data constructors</h1> + <p> + +This chapter was thoroughly changed Feb 2003. + +<h2>Data types</h2> + +Consider the following data type declaration: + +<pre> + data T a = MkT !(a,a) !(T a) | Nil + + f x = case x of + MkT p q -> MkT p (q+1) + Nil -> Nil +</pre> +The user's source program mentions only the constructors <tt>MkT</tt> +and <tt>Nil</tt>. However, these constructors actually <em>do</em> something +in addition to building a data value. For a start, <tt>MkT</tt> evaluates +its arguments. Secondly, with the flag <tt>-funbox-strict-fields</tt> GHC +will flatten (or unbox) the strict fields. So we may imagine that there's the +<em>source</em> constructor <tt>MkT</tt> and the <em>representation</em> constructor +<tt>MkT</tt>, and things start to get pretty confusing. +<p> +GHC now generates three unique <tt>Name</tt>s for each data constructor: +<pre> + ---- OccName ------ + String Name space Used for + --------------------------------------------------------------------------- + The "source data con" MkT DataName The DataCon itself + The "worker data con" MkT VarName Its worker Id + aka "representation data con" + The "wrapper data con" $WMkT VarName Its wrapper Id (optional) +</pre> +Recall that each occurrence name (OccName) is a pair of a string and a +name space (see <a href="names.html">The truth about names</a>), and +two OccNames are considered the same only if both components match. +That is what distinguishes the name of the name of the DataCon from +the name of its worker Id. To keep things unambiguous, in what +follows we'll write "MkT{d}" for the source data con, and "MkT{v}" for +the worker Id. (Indeed, when you dump stuff with "-ddumpXXX", if you +also add "-dppr-debug" you'll get stuff like "Foo {- d rMv -}". The +"d" part is the name space; the "rMv" is the unique key.) +<p> +Each of these three names gets a distinct unique key in GHC's name cache. + +<h2>The life cycle of a data type</h2> + +Suppose the Haskell source looks like this: +<pre> + data T a = MkT !(a,a) !Int | Nil + + f x = case x of + Nil -> Nil + MkT p q -> MkT p (q+1) +</pre> +When the parser reads it in, it decides which name space each lexeme comes +from, thus: +<pre> + data T a = MkT{d} !(a,a) !Int | Nil{d} + + f x = case x of + Nil{d} -> Nil{d} + MkT{d} p q -> MkT{d} p (q+1) +</pre> +Notice that in the Haskell source <em>all data contructors are named via the "source data con" MkT{d}</em>, +whether in pattern matching or in expressions. +<p> +In the translated source produced by the type checker (-ddump-tc), the program looks like this: +<pre> + f x = case x of + Nil{d} -> Nil{v} + MkT{d} p q -> $WMkT p (q+1) + +</pre> +Notice that the type checker replaces the occurrence of MkT by the <em>wrapper</em>, but +the occurrence of Nil by the <em>worker</em>. Reason: Nil doesn't have a wrapper because there is +nothing to do in the wrapper (this is the vastly common case). +<p> +Though they are not printed out by "-ddump-tc", behind the scenes, there are +also the following: the data type declaration and the wrapper function for MkT. +<pre> + data T a = MkT{d} a a Int# | Nil{d} + + $WMkT :: (a,a) -> T a -> T a + $WMkT p t = case p of + (a,b) -> seq t (MkT{v} a b t) +</pre> +Here, the <em>wrapper</em> <tt>$WMkT</tt> evaluates and takes apart the argument <tt>p</tt>, +evaluates the argument <tt>t</tt>, and builds a three-field data value +with the <em>worker</em> constructor <tt>MkT{v}</tt>. (There are more notes below +about the unboxing of strict fields.) The worker $WMkT is called an <em>implicit binding</em>, +because it's introduced implicitly by the data type declaration (record selectors +are also implicit bindings, for example). Implicit bindings are injected into the code +just before emitting code or External Core. +<p> +After desugaring into Core (-ddump-ds), the definition of <tt>f</tt> looks like this: +<pre> + f x = case x of + Nil{d} -> Nil{v} + MkT{d} a b r -> let { p = (a,b); q = I# r } in + $WMkT p (q+1) +</pre> +Notice the way that pattern matching has been desugared to take account of the fact +that the "real" data constructor MkT has three fields. +<p> +By the time the simplifier has had a go at it, <tt>f</tt> will be transformed to: +<pre> + f x = case x of + Nil{d} -> Nil{v} + MkT{d} a b r -> MkT{v} a b (r +# 1#) +</pre> +Which is highly cool. + + +<h2> The constructor wrapper functions </h2> + +The wrapper functions are automatically generated by GHC, and are +really emitted into the result code (albeit only after CorePre; see +<tt>CorePrep.mkImplicitBinds</tt>). +The wrapper functions are inlined very +vigorously, so you will not see many occurrences of the wrapper +functions in an optimised program, but you may see some. For example, +if your Haskell source has +<pre> + map MkT xs +</pre> +then <tt>$WMkT</tt> will not be inlined (because it is not applied to anything). +That is why we generate real top-level bindings for the wrapper functions, +and generate code for them. + + +<h2> The constructor worker functions </h2> + +Saturated applications of the constructor worker function MkT{v} are +treated specially by the code generator; they really do allocation. +However, we do want a single, shared, top-level definition for +top-level nullary constructors (like True and False). Furthermore, +what if the code generator encounters a non-saturated application of a +worker? E.g. <tt>(map Just xs)</tt>. We could declare that to be an +error (CorePrep should saturate them). But instead we currently +generate a top-level defintion for each constructor worker, whether +nullary or not. It takes the form: +<pre> + MkT{v} = \ p q r -> MkT{v} p q r +</pre> +This is a real hack. The occurrence on the RHS is saturated, so the code generator (both the +one that generates abstract C and the byte-code generator) treats it as a special case and +allocates a MkT; it does not make a recursive call! So now there's a top-level curried +version of the worker which is available to anyone who wants it. +<p> +This strange defintion is not emitted into External Core. Indeed, you might argue that +we should instead pass the list of <tt>TyCon</tt>s to the code generator and have it +generate magic bindings directly. As it stands, it's a real hack: see the code in +CorePrep.mkImplicitBinds. + + +<h2> External Core </h2> + +When emitting External Core, we should see this for our running example: + +<pre> + data T a = MkT a a Int# | Nil{d} + + $WMkT :: (a,a) -> T a -> T a + $WMkT p t = case p of + (a,b) -> seq t (MkT a b t) + + f x = case x of + Nil -> Nil + MkT a b r -> MkT a b (r +# 1#) +</pre> +Notice that it makes perfect sense as a program all by itself. Constructors +look like constructors (albeit not identical to the original Haskell ones). +<p> +When reading in External Core, the parser is careful to read it back in just +as it was before it was spat out, namely: +<pre> + data T a = MkT{d} a a Int# | Nil{d} + + $WMkT :: (a,a) -> T a -> T a + $WMkT p t = case p of + (a,b) -> seq t (MkT{v} a b t) + + f x = case x of + Nil{d} -> Nil{v} + MkT{d} a b r -> MkT{v} a b (r +# 1#) +</pre> + + +<h2> Unboxing strict fields </h2> + +If GHC unboxes strict fields (as in the first argument of <tt>MkT</tt> above), +it also transforms +source-language case expressions. Suppose you write this in your Haskell source: +<pre> + case e of + MkT p t -> ..p..t.. +</pre> +GHC will desugar this to the following Core code: +<pre> + case e of + MkT a b t -> let p = (a,b) in ..p..t.. +</pre> +The local let-binding reboxes the pair because it may be mentioned in +the case alternative. This may well be a bad idea, which is why +<tt>-funbox-strict-fields</tt> is an experimental feature. +<p> +It's essential that when importing a type <tt>T</tt> defined in some +external module <tt>M</tt>, GHC knows what representation was used for +that type, and that in turn depends on whether module <tt>M</tt> was +compiled with <tt>-funbox-strict-fields</tt>. So when writing an +interface file, GHC therefore records with each data type whether its +strict fields (if any) should be unboxed. + +<h2> Labels and info tables </h2> + +<em>Quick rough notes: SLPJ March 2003</em>. +<p> +Every data constructor <tt>C</tt>has two info tables: +<ul> +<li> The static info table (label <tt>C_static_info</tt>), used for statically-allocated constructors. + +<li> The dynamic info table (label <tt>C_con_info</tt>), used for dynamically-allocated constructors. +</ul> +Statically-allocated constructors are not moved by the garbage collector, and therefore have a different closure +type from dynamically-allocated constructors; hence they need +a distinct info table. +Both info tables share the same entry code, but since the entry code is phyiscally juxtaposed with the +info table, it must be duplicated (<tt>C_static_entry</tt> and <tt>C_con_entry</tt> respectively). + + </body> +</html> + |