| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Because garbage collector calls `retainerProfile()` and `heapCensus()`,
GC times normally include some of PROF times too. To fix this we have
these lines:
// heapCensus() is called by the GC, so RP and HC time are
// included in the GC stats. We therefore subtract them to
// obtain the actual GC cpu time.
stats.gc_cpu_ns -= prof_cpu;
stats.gc_elapsed_ns -= prof_elapsed;
These variables are later used for calculating GC time excluding the
final GC (which should be attributed to EXIT).
exit_gc_elapsed = stats.gc_elapsed_ns - start_exit_gc_elapsed;
The problem is if we subtract PROF times from `gc_elapsed_ns` and then
subtract `start_exit_gc_elapsed` from the result, we end up subtracting
PROF times twice, because `start_exit_gc_elapsed` also includes PROF
times.
We now subtract PROF times from GC after the calculations for EXIT and
MUT times. The existing assertion that checks
INIT + MUT + GC + EXIT = TOTAL
now holds. When we subtract PROF numbers from GC, and a new assertion
INIT + MUT + GC + PROF + EXIT = TOTAL
also holds.
Fixes #15897. New assertions added in this commit also revealed #16102,
which is also fixed by this commit.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This change fixes build failure like this:
```
rts/Stats.c:1467:14: error:
error: format '%u' expects argument of type 'unsigned int',
but argument 4 has type 'long unsigned int' [-Werror=format=]
debugBelch("%51s%9" FMT_Word " %9" FMT_Word "\n",
^~~~~~~~
"",tot_live*sizeof(W_),tot_slop*sizeof(W_));
~~~~~~~~~~~~~~~~~~~
```
The fix is to cast sizeof() result to Word (W_).
Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org>
Test Plan: build for 32-bit target
Reviewers: bgamari, erikd, simonmar
Reviewed By: simonmar
Subscribers: thomie, carter
Differential Revision: https://phabricator.haskell.org/D4608
|
| |
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
There should be no change in the output of the '+RTS -s' (summary)
report, or
the 'RTS -t' (one-line) report.
All data shown in the summary report is now shown in the machine
readable
report.
All data in RTSStats is now shown in the machine readable report.
init times are added to RTSStats and added to GHC.Stats.
Example of the new output:
```
[("bytes allocated", "375016384")
,("num_GCs", "113")
,("average_bytes_used", "148348")
,("max_bytes_used", "206552")
,("num_byte_usage_samples", "2")
,("peak_megabytes_allocated", "6")
,("init_cpu_seconds", "0.001642")
,("init_wall_seconds", "0.001027")
,("mut_cpu_seconds", "3.020166")
,("mut_wall_seconds", "0.757244")
,("GC_cpu_seconds", "0.037750")
,("GC_wall_seconds", "0.009569")
,("exit_cpu_seconds", "0.000890")
,("exit_wall_seconds", "0.002551")
,("total_cpu_seconds", "3.060452")
,("total_wall_seconds", "0.770395")
,("major_gcs", "2")
,("allocated_bytes", "375016384")
,("max_live_bytes", "206552")
,("max_large_objects_bytes", "159344")
,("max_compact_bytes", "0")
,("max_slop_bytes", "59688")
,("max_mem_in_use_bytes", "6291456")
,("cumulative_live_bytes", "296696")
,("copied_bytes", "541024")
,("par_copied_bytes", "493976")
,("cumulative_par_max_copied_bytes", "104104")
,("cumulative_par_balanced_copied_bytes", "274456")
,("fragmentation_bytes", "2112")
,("alloc_rate", "124170795")
,("productivity_cpu_percent", "0.986838")
,("productivity_wall_percent", "0.982935")
,("bound_task_count", "1")
,("sparks_count", "5836258")
,("sparks_converted", "237")
,("sparks_overflowed", "1990408")
,("sparks_dud ", "0")
,("sparks_gcd", "3455553")
,("sparks_fizzled", "390060")
,("work_balance", "0.555606")
,("n_capabilities", "4")
,("task_count", "10")
,("peak_worker_count", "9")
,("worker_count", "9")
,("gc_alloc_block_sync_spin", "162")
,("gc_alloc_block_sync_yield", "0")
,("gc_alloc_block_sync_spin", "162")
,("gc_spin_spin", "18840855")
,("gc_spin_yield", "10355")
,("mut_spin_spin", "70331392")
,("mut_spin_yield", "61700")
,("waitForGcThreads_spin", "241")
,("waitForGcThreads_yield", "2797")
,("whitehole_gc_spin", "0")
,("whitehole_lockClosure_spin", "0")
,("whitehole_lockClosure_yield", "0")
,("whitehole_executeMessage_spin", "0")
,("whitehole_threadPaused_spin", "0")
,("any_work", "1667")
,("no_work", "1662")
,("scav_find_work", "1026")
,("gen_0_collections", "111")
,("gen_0_par_collections", "111")
,("gen_0_cpu_seconds", "0.036126")
,("gen_0_wall_seconds", "0.036126")
,("gen_0_max_pause_seconds", "0.036126")
,("gen_0_avg_pause_seconds", "0.000081")
,("gen_0_sync_spin", "21")
,("gen_0_sync_yield", "0")
,("gen_1_collections", "2")
,("gen_1_par_collections", "1")
,("gen_1_cpu_seconds", "0.001624")
,("gen_1_wall_seconds", "0.001624")
,("gen_1_max_pause_seconds", "0.001624")
,("gen_1_avg_pause_seconds", "0.000272")
,("gen_1_sync_spin", "3")
,("gen_1_sync_yield", "0")
]
```
Test Plan: Ensure that one-line and summary reports are unchanged.
Reviewers: erikd, simonmar, hvr
Subscribers: duog, carter, thomie, rwbarton
GHC Trac Issues: #14660
Differential Revision: https://phabricator.haskell.org/D4529
|
|
|
|
| |
This reverts commit 2d4bda2e4ac68816baba0afab00da6f769ea75a7.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
There should be no change in the output of the '+RTS -s' (summary)
report, or the 'RTS -t' (one-line) report.
All data shown in the summary report is now shown in the machine
readable report.
All data in RTSStats is now shown in the machine readable report.
init times are added to RTSStats and added to GHC.Stats.
Example of the new output:
```
[("bytes allocated", "375016384")
,("num_GCs", "113")
,("average_bytes_used", "148348")
,("max_bytes_used", "206552")
,("num_byte_usage_samples", "2")
,("peak_megabytes_allocated", "6")
,("init_cpu_seconds", "0.001642")
,("init_wall_seconds", "0.001027")
,("mut_cpu_seconds", "3.020166")
,("mut_wall_seconds", "0.757244")
,("GC_cpu_seconds", "0.037750")
,("GC_wall_seconds", "0.009569")
,("exit_cpu_seconds", "0.000890")
,("exit_wall_seconds", "0.002551")
,("total_cpu_seconds", "3.060452")
,("total_wall_seconds", "0.770395")
,("major_gcs", "2")
,("allocated_bytes", "375016384")
,("max_live_bytes", "206552")
,("max_large_objects_bytes", "159344")
,("max_compact_bytes", "0")
,("max_slop_bytes", "59688")
,("max_mem_in_use_bytes", "6291456")
,("cumulative_live_bytes", "296696")
,("copied_bytes", "541024")
,("par_copied_bytes", "493976")
,("cumulative_par_max_copied_bytes", "104104")
,("cumulative_par_balanced_copied_bytes", "274456")
,("fragmentation_bytes", "2112")
,("alloc_rate", "124170795")
,("productivity_cpu_percent", "0.986838")
,("productivity_wall_percent", "0.982935")
,("bound_task_count", "1")
,("sparks_count", "5836258")
,("sparks_converted", "237")
,("sparks_overflowed", "1990408")
,("sparks_dud ", "0")
,("sparks_gcd", "3455553")
,("sparks_fizzled", "390060")
,("work_balance", "0.555606")
,("n_capabilities", "4")
,("task_count", "10")
,("peak_worker_count", "9")
,("worker_count", "9")
,("gc_alloc_block_sync_spin", "162")
,("gc_alloc_block_sync_yield", "0")
,("gc_alloc_block_sync_spin", "162")
,("gc_spin_spin", "18840855")
,("gc_spin_yield", "10355")
,("mut_spin_spin", "70331392")
,("mut_spin_yield", "61700")
,("waitForGcThreads_spin", "241")
,("waitForGcThreads_yield", "2797")
,("whitehole_gc_spin", "0")
,("whitehole_lockClosure_spin", "0")
,("whitehole_lockClosure_yield", "0")
,("whitehole_executeMessage_spin", "0")
,("whitehole_threadPaused_spin", "0")
,("any_work", "1667")
,("no_work", "1662")
,("scav_find_work", "1026")
,("gen_0_collections", "111")
,("gen_0_par_collections", "111")
,("gen_0_cpu_seconds", "0.036126")
,("gen_0_wall_seconds", "0.036126")
,("gen_0_max_pause_seconds", "0.036126")
,("gen_0_avg_pause_seconds", "0.000081")
,("gen_0_sync_spin", "21")
,("gen_0_sync_yield", "0")
,("gen_1_collections", "2")
,("gen_1_par_collections", "1")
,("gen_1_cpu_seconds", "0.001624")
,("gen_1_wall_seconds", "0.001624")
,("gen_1_max_pause_seconds", "0.001624")
,("gen_1_avg_pause_seconds", "0.000272")
,("gen_1_sync_spin", "3")
,("gen_1_sync_yield", "0")
]
```
Test Plan: Ensure that one-line and summary reports are unchanged.
Reviewers: bgamari, erikd, simonmar, hvr
Reviewed By: simonmar
Subscribers: rwbarton, thomie, carter
GHC Trac Issues: #14660
Differential Revision: https://phabricator.haskell.org/D4303
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The existing internal counters:
* gc_alloc_block_sync
* whitehole_spin
* gen[g].sync
* gen[1].sync
are now not shown in the -s report unless --internal-counters is also passed.
If --internal-counters is passed we now show the counters above, reformatted, as
well as several other counters. In particular, we now count the yieldThread()
calls that SpinLocks do as well as their spins.
The added counters are:
* gc_spin (spin and yield)
* mut_spin (spin and yield)
* whitehole_threadPaused (spin only)
* whitehole_executeMessage (spin only)
* whitehole_lockClosure (spin only)
* waitForGcThreadsd (spin and yield)
As well as the following, which are not SpinLock-like things:
* any_work
* do_work
* scav_find_work
See the Note for descriptions of what these counters are.
We add busy_wait_nops in these loops along with the counter increment where it
was absent.
Old internal counters output:
```
gc_alloc_block_sync: 0
whitehole_gc_spin: 0
gen[0].sync: 0
gen[1].sync: 0
```
New internal counters output:
```
Internal Counters:
Spins Yields
gc_alloc_block_sync 323 0
gc_spin 9016713 752
mut_spin 57360944 47716
whitehole_gc 0 n/a
whitehole_threadPaused 0 n/a
whitehole_executeMessage 0 n/a
whitehole_lockClosure 0 0
waitForGcThreads 2 415
gen[0].sync 6 0
gen[1].sync 1 0
any_work 2017
no_work 2014
scav_find_work 1004
```
Test Plan:
./validate
Check it builds with #define PROF_SPIN removed from includes/rts/Config.h
Reviewers: bgamari, erikd, simonmar, hvr
Reviewed By: simonmar
Subscribers: rwbarton, thomie, carter
GHC Trac Issues: #3553, #9221
Differential Revision: https://phabricator.haskell.org/D4302
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Rename to whitehole_gc_spin, in preparation for adding stats for the
whitehole busy-loop in SMPClosureOps.
Make whitehole_gc_spin volatile, and move it to be defined and
statically initialised in GC.c. This saves some #ifs, and I'm pretty
sure it should be volatile.
Test Plan: ./validate
Reviewers: bgamari, erikd, simonmar
Reviewed By: bgamari
Subscribers: rwbarton, thomie, carter
Differential Revision: https://phabricator.haskell.org/D4300
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
mut_elapsed should deduct retainer profiling and heap censuses, just as
mut_cpu does.
mutator_cpu_ns should not deduct retainer profiling or heap censuses,
since those times are included in stats.gc_cpu_ns.
Reviewers: bgamari, erikd, simonmar
Reviewed By: simonmar
Subscribers: rwbarton, thomie
Differential Revision: https://phabricator.haskell.org/D4185
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We were accumulating the gc times of the previous gc.
`stats.gc.{cpu,elappsed}_ns` were being accumulated into
`stats.gc_{cpu,elapsed}_ns` before they were set.
There is also a change in that heap profiling will no longer cause gc
events to
be emitted.
Reviewers: bgamari, erikd, simonmar
Reviewed By: bgamari
Subscribers: rwbarton, thomie
GHC Trac Issues: #14257, #14445
Differential Revision: https://phabricator.haskell.org/D4184
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
An additional stat is tracked per gc: par_balanced_copied This is the
the number of bytes copied by each gc thread under the balanced lmit,
which is simply (copied_bytes / num_gc_threads). The stat is added to
all the appropriate GC structures, so is visible in the eventlog and in
GHC.Stats.
A note is added explaining how work balance is computed.
Remove some end of line whitespace
Test Plan:
./validate
experiment with the program attached to the ticket
examine code changes carefully
Reviewers: simonmar, austin, hvr, bgamari, erikd
Reviewed By: simonmar
Subscribers: Phyx, rwbarton, thomie
GHC Trac Issues: #13830
Differential Revision: https://phabricator.haskell.org/D3658
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
It seems that 12ad4d417b89462ba8e19a3c7772a931b3a93f0e enabled
collection by default as its needs stats.allocated_bytes to determine
whether the program has exceeded its grace limit.
However, enabling stats also enables some potentially expensive times
checks. In general GC statistics should be cheap to compute (relative
to the GC itself), so now we always compute them. This allows us to once
again disable giveStats by default.
Fixes #13864.
Reviewers: simonmar, austin, erikd
Reviewed By: simonmar
Subscribers: rwbarton, thomie
GHC Trac Issues: #13864
Differential Revision: https://phabricator.haskell.org/D3669
|
|
|
|
| |
Our new CPP linter enforces this.
|
|
|
|
| |
The formatting strings fell out of sync with the arguments.
|
|
|
|
|
|
|
|
|
|
|
|
| |
Test Plan: ./validate
Reviewers: austin, bgamari, erikd, simonmar
Reviewed By: simonmar
Subscribers: thomie
Differential Revision: https://phabricator.haskell.org/D2825
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary:
This commit makes various improvements and addresses some issues with
Compact Regions (aka Compact Normal Forms).
This was the most important thing I wanted to fix. Compaction
previously prevented GC from running until it was complete, which
would be a problem in a multicore setting. Now, we compact using a
hand-written Cmm routine that can be interrupted at any point. When a
GC is triggered during a sharing-enabled compaction, the GC has to
traverse and update the hash table, so this hash table is now stored
in the StgCompactNFData object.
Previously, compaction consisted of a deepseq using the NFData class,
followed by a traversal in C code to copy the data. This is now done
in a single pass with hand-written Cmm (see rts/Compact.cmm). We no
longer use the NFData instances, instead the Cmm routine evaluates
components directly as it compacts.
The new compaction is about 50% faster than the old one with no
sharing, and a little faster on average with sharing (the cost of the
hash table dominates when we're doing sharing).
Static objects that don't (transitively) refer to any CAFs don't need
to be copied into the compact region. In particular this means we
often avoid copying Char values and small Int values, because these
are static closures in the runtime.
Each Compact# object can support a single compactAdd# operation at any
given time, so the Data.Compact library now enforces mutual exclusion
using an MVar stored in the Compact object.
We now get exceptions rather than killing everything with a barf()
when we encounter an object that cannot be compacted (a function, or a
mutable object). We now also detect pinned objects, which can't be
compacted either.
The Data.Compact API has been refactored and cleaned up. A new
compactSize operation returns the size (in bytes) of the compact
object.
Most of the documentation is in the Haddock docs for the compact
library, which I've expanded and improved here.
Various comments in the code have been improved, especially the main
Note [Compact Normal Forms] in rts/sm/CNF.c.
I've added a few tests, and expanded a few of the tests that were
there. We now also run the tests with GHCi, and in a new test way
that enables sanity checking (+RTS -DS).
There's a benchmark in libraries/compact/tests/compact_bench.hs for
measuring compaction speed and comparing sharing vs. no sharing.
The field totalDataW in StgCompactNFData was unnecessary.
Test Plan:
* new unit tests
* validate
* tested manually that we can compact Data.Aeson data
Reviewers: gcampax, bgamari, ezyang, austin, niteria, hvr, erikd
Subscribers: thomie, simonpj
Differential Revision: https://phabricator.haskell.org/D2751
GHC Trac Issues: #12455
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary:
Visible API changes:
* The C struct `GCDetails` gives the stats about a single GC. This is
passed to the `gcDone()` callback if one is set via the
RtsConfig. (previously we just passed a collection of values, so this
is more extensible, at the expense of breaking the existing API)
* `RTSStats` gives cumulative stats since the start of the program,
and includes the `GCDetails` for the most recent GC. This struct
can be obtained via `getRTSStats()` (the old `getGCStats()` has been
removed, and `getGCStatsEnabled()` has been renamed to
`getRTSStatsEnabled()`)
Improvements:
* The per-GC stats and cumulative stats are now cleanly separated.
* Inside the RTS we have a top-level `RTSStats` struct to keep all our
stats in, previously this was just a collection of strangely-named
variables. This struct is mostly just copied in `getRTSStats()`, so
the implementation of that function is a lot shorter.
* Types are more consistent. We use a uint64_t byte count for all
memory values, and Time for all time values.
* Names are more consistent. We use a suffix `_bytes` for all byte
counts and `_ns` for all time values.
* We now collect information about the amount of memory in large
objects and compact objects in `GCDetails`. (the latter was the reason
I started doing this patch but it seems to have ballooned a bit!)
* I fixed a bug in the calculation of the elapsed MUT time, and added
an ASSERT to stop the calculations going wrong in the future.
For now I kept the Haskell API in `GHC.Stats` the same, by
impedence-matching with the new API. We could either break that API
and make it match the C API more closely, or we could add a new API
and deprecate the old one. Opinions welcome.
This stuff is very easy to get wrong, and it's hard to test. Reviews
welcome!
Test Plan:
manual testing
validate
Reviewers: bgamari, niteria, austin, ezyang, hvr, erikd, rwbarton, Phyx
Subscribers: thomie
Differential Revision: https://phabricator.haskell.org/D2756
|
|
|
|
|
|
|
|
|
|
|
|
| |
Test Plan: Validate on lots of platforms
Reviewers: erikd, simonmar, austin
Reviewed By: erikd, simonmar
Subscribers: michalt, thomie
Differential Revision: https://phabricator.haskell.org/D2699
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This exposes mblocks_allocated in the GCStats struct.
Test Plan: it builds
Reviewers: bgamari, simonmar, austin, hvr, erikd
Reviewed By: erikd
Subscribers: thomie
Differential Revision: https://phabricator.haskell.org/D2429
|
| |
|
|
|
|
|
|
|
|
|
|
| |
Reviewers: austin, erikd, simonmar, bgamari
Reviewed By: bgamari
Subscribers: thomie
Differential Revision: https://phabricator.haskell.org/D2241
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We can't define Stg{Int,Word} in terms of {,u}intptr_t because STG
depends on them being the exact same size as void*, and {,u}intptr_t
does not make that guarantee. Furthermore, we also need to define
StgHalf{Int,Word}, so the preprocessor if needs to stay. But we can at
least keep it in a single place instead of repeating it in various
files.
Also define STG_{INT,WORD}{8,16,32,64}_{MIN,MAX} and use it in HsFFI.h,
further reducing the need for CPP in other files.
Reviewers: austin, bgamari, simonmar, hvr, erikd
Subscribers: thomie
Differential Revision: https://phabricator.haskell.org/D2182
|
|
|
|
|
|
|
|
|
|
|
|
| |
The `nat` type was an alias for `unsigned int` with a comment saying
it was at least 32 bits. We keep the typedef in case client code is
using it but mark it as deprecated.
Test Plan: Validated on Linux, OS X and Windows
Reviewers: simonmar, austin, thomie, hvr, bgamari, hsyl20
Differential Revision: https://phabricator.haskell.org/D2166
|
|
|
|
|
|
|
|
|
|
|
|
| |
Was never used looking at history available in git.
While at it marked 'mut_user_time_during_RP' as 'static'.
Noticed by uselex.rb:
mut_user_time_during_heap_census: [R]: exported from:
./rts/dist/build/Stats.p_o
Signed-off-by: Sergei Trofimovich <siarheit@google.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This hasn't been used for a very long time and will soon be superceded
by perf_events support.
Test Plan: validate
Reviewers: austin, simonmar
Reviewed By: austin, simonmar
Subscribers: thomie, erikd
Differential Revision: https://phabricator.haskell.org/D1493
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary:
Hooks rely on static linking semantics, and are broken by -Bsymbolic
which we need when using dynamic linking.
Test Plan: Built it
Reviewers: austin, hvr, tibbe
Differential Revision: https://phabricator.haskell.org/D8
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary:
clearNursery resets all the bd->free pointers of nursery blocks to
make the blocks empty. In profiles we've seen clearNursery taking
significant amounts of time particularly with large -N and -A values.
This patch moves the work of clearNursery to the point at which we
actually need the new block, thereby introducing an invariant that
blocks to the right of the CurrentNursery pointer still need their
bd->free pointer reset. This should make things faster overall,
because we don't need to clear blocks that we don't use.
Test Plan: validate
Reviewers: AndreasVoellmy, ezyang, austin
Subscribers: thomie, carter, ezyang, simonmar
Differential Revision: https://phabricator.haskell.org/D318
|
|
|
|
| |
Signed-off-by: Austin Seipp <austin@well-typed.com>
|
|
|
|
| |
This reverts commit 39b5c1cbd8950755de400933cecca7b8deb4ffcd.
|
|
|
|
|
|
|
|
| |
This will hopefully help ensure some basic consistency in the forward by
overriding buffer variables. In particular, it sets the wrap length, the
offset to 4, and turns off tabs.
Signed-off-by: Austin Seipp <austin@well-typed.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary:
Today's hardware is much faster, so it makes sense to report timings
with more precision, and possibly help reduce rounding-induced
fluctuations in the nofib statistics.
This commit increases the precision of all timings previously reported
with a granularity of 10ms to 1ms. For instance, the `+RTS -S` output is
now rendered as:
Alloc Copied Live GC GC TOT TOT Page Flts
bytes bytes bytes user elap user elap
641936 59944 158120 0.000 0.000 0.013 0.001 0 0 (Gen: 0)
517672 60840 158464 0.000 0.000 0.013 0.002 0 0 (Gen: 0)
517256 58800 156424 0.005 0.005 0.019 0.007 0 0 (Gen: 1)
670208 9520 158728 0.000 0.000 0.019 0.008 0 0 (Gen: 0)
...
Tot time (elapsed) Avg pause Max pause
Gen 0 24 colls, 0 par 0.002s 0.002s 0.0001s 0.0002s
Gen 1 3 colls, 0 par 0.011s 0.011s 0.0038s 0.0055s
TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1)
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.001s ( 0.001s elapsed)
MUT time 0.005s ( 0.006s elapsed)
GC time 0.014s ( 0.014s elapsed)
EXIT time 0.001s ( 0.001s elapsed)
Total time 0.032s ( 0.020s elapsed)
Note that this change also requires associated changes in the nofib
submodule.
Test Plan: tested with modified nofib
Reviewers: simonmar, nomeata, austin
Subscribers: simonmar, relrod, carter
Differential Revision: https://phabricator.haskell.org/D97
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary: Avoid unnecessary clock_gettime() syscalls in GC stats.
Test Plan: Use strace.
Reviewers: simonmar, austin
Reviewed By: simonmar, austin
Subscribers: simonmar, relrod, carter
Differential Revision: https://phabricator.haskell.org/D39
|
|
|
|
| |
Signed-off-by: Austin Seipp <austin@well-typed.com>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We have various problems with reallocating the array of Capabilities,
due to threads in waitForReturnCapability that are already holding a
pointer to a Capability.
Rather than add more locking to make this safer, I decided it would be
easier to ensure that we never move the Capabilities at all. The
capabilities array is now an array of pointers to Capabaility. There
are extra indirections, but it rarely matters - we don't often access
Capabilities via the array, normally we already have a pointer to
one. I ran the parallel benchmarks and didn't see any difference.
|
|
|
|
|
|
|
|
|
|
|
| |
We were doing it in two different ways and asserting that the results
were the same. In most cases they were, but I found one case where
they weren't: the GC itself allocates some memory for running
finalizers, and this memory was accounted for one way but not the
other.
It was simpler to remove the old way of counting allocation that to
try to fix it up, so I did that.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
| |
lnat was originally "long unsigned int" but we were using it when we
wanted a 64-bit type on a 64-bit machine. This broke on Windows x64,
where long == int == 32 bits. Using types of unspecified size is bad,
but what we really wanted was a type with N bits on an N-bit machine.
StgWord is exactly that.
lnat was mentioned in some APIs that clients might be using
(e.g. StackOverflowHook()), so we leave it defined but with a comment
to say that it's deprecated.
|
| |
|
| |
|
|
|
|
|
|
| |
Mostly this meant getting pointer<->int conversions to use the right
sizes. lnat is now size_t, rather than unsigned long, as that seems a
better match for how it's used.
|
|
|
|
|
|
|
|
|
| |
Quoting design rationale by dcoutts: The event indicates that we're doing
a stop-the-world GC and all other HECs should be between their GC_START
and GC_END events at that moment. We don't want to use GC_STATS_GHC
for that, because GC_STATS_GHC is for extra GHC-specific info,
not something we have to rely on to be able to match the GC pauses
across HECs to a particular global GC.
|
|
|
|
|
|
|
|
|
|
|
| |
There was a discrepancy between GC times reported in +RTS -s
and the timestamps of GC_START and GC_END events on the cap,
on which +RTS -s stats for the given GC are based.
This is fixed by posting the events with exactly the same timestamp
as generated for the stat calculation. The calls posting the events
are moved too, so that the events are emitted close to the time instant
they claim to be emitted at. The GC_STATS_GHC was moved, too, ensuring
it's emitted before the moved GC_END on all caps, which simplifies tools code.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In stat_exit we want to emit a final EVENT_HEAP_ALLOCATED for each cap
so that we get the same total allocation count as reported via +RTS -s.
To do so we need to update the per-cap total_allocated counts.
Previously we had a single calcAllocated(rtsBool) function that counted
the large allocations and optionally the nurseries for all caps. The GC
would always call it with false, and the stat_exit always with true.
The reason for these two modes is that the GC counts the nurseries via
clearNurseries() (which also updates the per-cap total_allocated
counts), so it's only the stat_exit() path that needs to count them.
We now split the calcAllocated() function into two: countLargeAllocated
and updateNurseriesStats. As the name suggests, the latter now updates
the per-cap total_allocated counts, in additon to returning a total.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
They cover much the same info as is available via the GHC.Stats module
or via the '+RTS -s' textual output, but via the eventlog and with a
better sampling frequency.
We have three new generic heap info events and two very GHC-specific
ones. (The hope is the general ones are usable by other implementations
that use the same eventlog system, or indeed not so sensitive to changes
in GHC itself.)
The general ones are:
* total heap mem allocated since prog start, on a per-HEC basis
* current size of the heap (MBlocks reserved from OS for the heap)
* current size of live data in the heap
Currently these are all emitted by GHC at GC time (live data only at
major GC).
The GHC specific ones are:
* an event giving various static heap paramaters:
* number of generations (usually 2)
* max size if any
* nursary size
* MBlock and block sizes
* a event emitted on each GC containing:
* GC generation (usually just 0,1)
* total bytes copied
* bytes lost to heap slop and fragmentation
* the number of threads in the parallel GC (1 for serial)
* the maximum number of bytes copied by any par GC thread
* the total number of bytes copied by all par GC threads
(these last three can be used to calculate an estimate of the
work balance in parallel GCs)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Also rename internal variables to make the names match what they hold.
The parallel GC work balance is calculated using the total amount of
memory copied by all GC threads, and the maximum copied by any
individual thread. You have serial GC when the max is the same as
copied, and perfectly balanced GC when total/max == n_caps.
Previously we presented this as the ratio total/max and told users
that the serial value was 1 and the ideal value N, for N caps, e.g.
Parallel GC work balance: 1.05 (4045071 / 3846774, ideal 2)
The downside of this is that the user always has to keep in mind the
number of cores being used. Our new presentation uses a normalised
scale 0--1 as a percentage. The 0% means completely serial and 100%
is perfect balance, e.g.
Parallel GC work balance: 4.56% (serial 0%, perfect 100%)
|
|
|
|
|
|
|
| |
In addition to the existing global method. For now we just do
it both ways and assert they give the same grand total. At some
stage we can simplify the global method to just take the sum of
the per-cap counters.
|