notes for profiling work

author: Simon Marlow <marlowsd@gmail.com> 2011-09-16 13:38:44 +0100
committer: Simon Marlow <marlowsd@gmail.com> 2011-09-16 13:38:44 +0100
commit: 1127c0bfb4e0070bec5b3cab9d7ebd8650d0bc68 (patch)
tree: 63a14d73def2838a001cd14db5fb108adc9ce574
parent: 418140bd3d18331de697b8d27e2cf2fe762e5696 (diff)
download: haskell-1127c0bfb4e0070bec5b3cab9d7ebd8650d0bc68.tar.gz
1 files changed, 420 insertions, 0 deletions
diff --git a/PROF-NOTES b/PROF-NOTES
new file mode 100644
index 0000000000..b312b58db9
--- /dev/null
+++ b/PROF-NOTES
@@ -0,0 +1,420 @@
+TOPICS to think about:
+
+  - I think we fixed #1531 - close it (test is result001)
+
+  - First: make sure HPC still works (and that the code isn't a lot
+    worse)
+
+  - Get the entry counts right for pattern-bound variables (need to
+    generate ticks in mkSelectorBinds)
+
+  - Restore the CCS in the update code
+
+  - Check that PAPs work right
+
+  - Check performance of profiled code
+     - with/without -prof-auto-all,
+     - with/without -fno-count-entries
+     - against GHC 7.2.1
+
+  - Find out why we can't seem to get a timer resolution better than 10ms
+
+  - Make sure that we can inline values inside _scc_.  Right now in
+    OccurAnal we annotate everything occurring inside an _scc_ with
+    NoOccInfo.
+
+  - We are inlining and floating non-HNFs, which can change the shape
+    of the profile.  e.g. inlining a CAF at its single usage site
+    will change the stack from <CAF,c> to <MAIN,...,c>.  Example:
+    scc002, but put manual SCCs on yan and the lambda expression, because
+    -prof-auto-all adds an SCC to main which prevents the inlining.
+    TODO: add a test/ticket, or fix.
+
+  - how to represent recursive and mutually recursive calls.  Truncate
+    the stack on a recursive call, as now, or use the "..." notation
+    from Allwood's paper?
+
+  - getting useful costs for CAFs.  Also, keeping track of where a CAF
+    was entered from, so that we can get accurate information if a CAF
+    raises an exception. (foo = error "bar").
+
+  - profiling flags:
+    - another flag to annotate all *bindings*, not just all *functions*?
+    - we added -fno-count-entries (document it)
+    - we added -prof-auto-all, -prof-auto-top, -prof-auto-exported, -prof-no-auto
+      (document them)
+    - deprecate the old flags (fix documentation)
+
+  - Tests:
+    - test scc002 with various combinations of -prof-* flags and
+      -fprof-no-count-entries
+
+  - profiling with multiple Capabilities
+
+  - do we need to keep the XML output format?
+
+  - look into code size of profiling
+
+  - reinstate PushCC so we can compile  let x = scc "foo" (C a b)
+    properly (attribute the constructor to foo:CCCS; don't allocate
+    a thunk).
+
+  - bugs to fix: #4414 (entry counts on pat binds etc.)
+
+-----------------------------------------------------------------------------
+-- Differences between evaluation and lexical scoping
+
+      f = g . h
+
+    f will not appear in the call stack where it might be expected
+    (see profile output for clausify for some examples of this)
+
+Transformations:
+
+  (scc s (C x1..xn))  => C x1 .. xn     (both)
+
+Constant changes in cost attribution, no change in stack shape:
+
+  ((scc s e) x)   =>  scc s (e x)   (lexical scoping only)
+  (scc s \x . e)  =>  \x . scc s e  (lexical scoping only)
+
+  (scc s \x . e)  =>  \x . e        (eval scoping only)
+
+  scc c x         => x              (eval scoping, and lex scoping iff x is not
+                                     top-level)
+
+  scc c (let x = e in e')
+     => let x = scc c e in scc c e' (both)
+
+  inlining a lambda                 (eval scoping only, lex scoping can inline
+                                     top-level lambdas)
+
+  floating a lambda                 (eval scoping only)
+
+  floating non-lambda??
+
+  inlining non-lambda               (neither)
+
+-----------------------------------------------------------------------------
+
+IMPORTANT PRINCIPLE: The call stack/graph that we see should be
+independent of the evaluation order.  Adding strictness does not
+change the call stack/graph.
+   Why?  Because the compiler changes strictness and evaluation order
+   all the time, we don't want these transformations to affect the
+   call stacks, which are a property of the source code.
+
+This is reasonably straightforward to implement: in a thunk we save
+the current stack, and restore it when the thunk is evaluated.
+
+Dealing with function calls is where all the action is.
+
+
+-----------------------------------------------------------------------------
+BASIC IDEA
+
+  CALL ccs_app ccs_lam = ccs_app
+
+ie. we don't change the cost centre at all when calling a function.
+However, the function is expected to push a CC (or several CCs) on the
+stack.  A preprocessing phase copies sccs inside each lambda:
+
+  \x . e  --->  \x . scc "c1" ... scc "cn" . e
+
+where "c1" .. "cn" are the stack of sccs lexically enclosing the
+lambda IN THE ORIGINAL SOURCE CODE.
+
+We could add them on every lambda, but when there are chains of
+lambdas together it's not really necessary:
+
+  \x . scc "f" \y . scc "f" . e
+
+will behave exactly like
+
+  \x . \y . scc "f" . e
+
+because the scc "f" between the two lambdas has no effect.
+
+
+----
+Advantages
+
+ - boxing is not required, and indeed is a no-op with respect to the
+   call graph
+
+ - Entry counts are always accurate (the only operation is pushing an
+   scc).
+
+ - simple implementation: no special behaviour at a function call
+
+ - many compiler transformations are unconditionally valid
+   see Note [transformations}
+
+ - can wrap an expression in an SCC to find its costs (not possible with
+   lexical scoping alone)
+
+ - can distinguish between lexically scoped SCCs and evaluation scoped SCCs
+   (see Note [user sccs])
+
+
+
+----
+Note [auto-all]
+
+-auto-all can do something much more useful.  e.g.
+
+f xs = if b then g xs else h xs
+  where
+    g ys = ..
+    h zs = ..
+
+could translate to
+
+f = \xs . scc "f"
+          let
+              g = \ys . scc "f.g" ...
+              h = \zs . scc "f.h" ...
+          in
+          ...
+
+Since we have to put an scc on every function, we might as well make
+it a useful one ("f.g" instead of just "f").
+
+  QUESTION: should this be an option?
+     -prof               no auto sccs added (everything subsumed into callers)
+     -prof -scc-exports  exported fns only
+     -prof -scc-top      top-level fns only (copied into nested functions)
+     -prof -scc-all      top-level and nested fns get different CCs
+
+
+------
+Note [user sccs]
+
+  Maybe there should be two kinds of scc annotation:
+
+    ESCC: evaluation cost centre; the cost of evaluating the
+    expression to normal form (to the extent that it is demanded)
+
+    LSCC: lexical cost centre: the cost of evaluating the expression
+    to normal form, and the cost of evaluating any nested functions,
+    each time they are called (an LSCC would be copied into each
+    nested function).
+
+
+-----
+Note [transformations]
+
+ - inlining a lambda is valid
+ - inlining a non-lambda is only valid as long as the expression does
+   not move inside any sccs (changes cost attibution, possibly dramatically)
+
+ - floating out lambdas is valid
+ - floating out non-lambdas should collect sccs on the way out
+
+ - boxing is a no-op: it no longer matters which SCC is attached to a lambda
+
+ - scc "x" \y . e  ==>  \y . e
+  (NB. only if scc "x" is a "dupd" cost centre, that is we aren't
+  counting entries)
+  (not preserving the cost attribution, but we only change attribution
+   for the allocation of the lambda, so don't care)
+
+ - scc "x" C a b c ==> C a b c  (same as above)
+
+ - what about eta-expansion/eta-reduction??
+
+
+-----------------------------------------------------------------------------
+
+To think about: getting useful costs for CAFs.
+
+We should attribute the costs for a CAF to the site that is evaluating
+it.  This just adds more information to the profile for free.
+
+Note that the cost centres attached to lambdas in a CAF should still
+just be [CAF,...] because we don't want the evaluation site of the CAF
+showing up when we call functions from it.
+
+-----------
+
+To think about: whether we can do partial profiling.
+
+  - consistent representation of closures.  There has to be some way
+    to know for a thunk whether it is profiled or not: if profiled
+    code enters an unprofiled thunk, it sets the CCCS to "unprofiled",
+    but if unprofiled code enters a profiled thunk, the code for the
+    thunk will have to set CCCS itself (a thunk knows where it saved
+    its CCCS).
+
+-----------
+
+To think about: whether we can keep track of call stacks in
+GHCi. (related to partial profiling above).
+
+If we abstract the RTS functionality that deals with cost centres so
+that GHCi can use it, this should be feasible.
+
+
+GHCi already has ticks on every nested function - so no need for sccs,
+we just use ticks.
+
+   ** Tick behaves exactly like scc (should they be the same??!) **
+
+Interpreted thunks must save/restore the CCS.  Store the CCS in every
+thunk; save and restore the current CCS during thunk eval.
+
+So the difficulty comes when interacting with compiled code.
+
+  - entering a compiled function: do not change the current CCS
+    (do nothing)
+
+  - entering a compiled thunk: save the current CCS on the stack,
+    set the CCS to "<unknown>", eval the thunk.
+
+So this will lose information whenever we have a compiled thunk being
+evaluated from interpreted code.  e.g. suppose we call map and then
+evaluate the resulting list:
+
+  g x = error "foo"
+
+  f xs = map g xs
+
+  main = case head (f xs) of ...
+
+The backtrace could give information like:
+
+  main
+    evaluated: <unknown>
+      g
+
+by traversing the real stack (after <unkonwn> we show the CCS that was
+saved on the stack during evaluation of main).
+
+So we lose the information that the call to g was from [f,map,g], but
+we get the point in the source code that evaluated the thunk.
+
+
+-----------------------------------------------------------------------------
+[ticks and sccs]
+
+Ticks and sccs are very similar:
+
+  - They are both "side effecting", in that they both count entries.
+    Compiler transformations must preserve the entry counts.
+
+  - sccs differ in that if we float some computation outside an scc we
+    need to annotate it with that scc (but a special,
+    non-entry-counting, copy of the scc).
+
+  
+Predicates:
+
+  count-entries :: Bool
+      -- count entries accurately; True  for ticks/breakpoints
+      --                           True  for ordinary cost centres
+      --                           False for "duplicated cost centres"
+
+  cost-attrib   :: Bool
+      -- maintain cost attribution; False for ticks/breakpoints
+      --                            True  for ordinary cost centres
+      --                            True  for "duplicated cost centres"
+
+
+
+For call stacks, we want cost-attrib=True,  count-entries maybe
+    coverage,    we want cost-attrib=False, count-entries=True
+
+Breakpoints have the same requirements as coverage, except that we
+probably want to do call stacks with breakpoints too.
+
+Transformations:
+
+ Three core forms:
+
+   scc c  E    (count-entries=False, cost-attrib=True)
+   tick c E    (count-entries=True,  cost-attrib=False)
+   scctick c E (count-entries=True,  cost-attrib=True)
+
+  ------- Variables & Literals
+
+  scc c x   => x                -- a variable has no cost
+  scc c lit => lit
+
+  scctick c x   => tick x                -- a variable has no cost
+  scctick c lit => tick lit
+
+  ------- Values
+
+  scc c (\x . e)    =>  \x . e
+  scc c (C x1..xn)  =>  C x1..xn
+
+  scctick c (\x . e)    =>  tick c (\x . e)
+  scctick c (C x1..xn)  =>  tick c (C x1..xn)
+
+  ------- App
+
+  (tick e) e' => tick (e e')            -- app-of-tick
+
+  ------- Let Floating
+
+  tick c (let x = e in e')
+    => let x = e in tick c e'
+
+  scc c (let x = e in e')
+    => let x = scc c e in scc c e'
+
+  scctick c (let x = e in e')
+    => let x = scc c e in scctick c e'
+
+  ------- Case
+
+  case (tick c e) of alts               -- case-of-tick
+    => tick c (case e of alts)
+
+  case (scc c (case e of p -> e')) of alts
+    => case scc c e of p -> case scc c e' of alts
+                                        -- case-of-scc-of-case
+
+  case (scctick c (case e of p -> e')) of alts
+    => case scctick c e of p -> case scc c e' of alts
+                                        -- case-of-scctick-of-case
+
+  ------- Inlining
+
+  let x = e in e'
+    => e' [ e/x ]                       -- iff e is a value, or e does not
+                                        -- move inside an scc
+
+
+Representation:
+
+  - tick boxes contain an Int (mapping from Int->SrcLoc is stored
+    separately)
+
+  - cost centres contain a (Module,String) (String may be derived from
+    the SrcLoc when we're doing auto-sccs, otherwise it is
+    user-supplied).
+
+  - breakpoints need to hold a bunch of free variables as arguments
+
+OTOH, it will be hard to use case for cost centres, because we have to
+make the simplifier respect the scoping rules.  But we want the
+scoping rules for breakpoints too, so we can track call stacks in
+GHCi.
+
+We could make a new Id for each scc, like mkTickBoxId.  But:
+
+  - how to change an scctick into a tick or an scc?  We can't just
+    flip a flag inside the IdInfo, because then the scctick Id will
+    still have the same unique and therefore compare equal to the
+    tick/scc Id.  We could make a *new* unique for the scc/tick Id,
+    but then it won't compare equal to other scc/tick Ids generated
+    from the same scctick.
+
+  - we could pass an Int# argument to say whether to tick, scc, or
+    scctick.  A bit ugly :( What happens if the arg is not a manifest
+    constant when we need to compile it?  (very unlikely to happen,
+    but still).
+
+  - we have to generate the scc/tick Ids once and for all and put them
+    inside the scctick Id to begin with.
author	Simon Marlow <marlowsd@gmail.com>	2011-09-16 13:38:44 +0100
committer	Simon Marlow <marlowsd@gmail.com>	2011-09-16 13:38:44 +0100
commit	1127c0bfb4e0070bec5b3cab9d7ebd8650d0bc68 (patch)
tree	63a14d73def2838a001cd14db5fb108adc9ce574
parent	418140bd3d18331de697b8d27e2cf2fe762e5696 (diff)
download	haskell-1127c0bfb4e0070bec5b3cab9d7ebd8650d0bc68.tar.gz