| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If startEventlog is called after the program has already started running
then quite a few useful events are missing from the eventlog because
they are only posted when the program starts. This patch adds a
mechanism to declare that an event should be reposted everytime the
startEventlog function is called.
Now in EventLog.c there is a global list of functions called
`eventlog_header_funcs` which stores a list of functions which should be
called everytime the eventlog starts.
When calling `postInitEvent`, the event will not only be immediately
posted to the eventlog but also added to the global list.
When startEventLog is called, the list is traversed and the events
reposted.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
See #19357
The event reports the
* Current number of megablocks allocated
* The number that the RTS thinks it needs
* The number is managed to return to the OS
When current > need then the difference is returned to the OS, the
successful number of returned mblocks is reported by 'returned'.
In a fragmented heap current > need but returned < current - need.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This new flag embeds a lookup table from the address of an info table
to information about that info table.
The main interface for consulting the map is the `lookupIPE` C function
> InfoProvEnt * lookupIPE(StgInfoTable *info)
The `InfoProvEnt` has the following structure:
> typedef struct InfoProv_{
> char * table_name;
> char * closure_desc;
> char * ty_desc;
> char * label;
> char * module;
> char * srcloc;
> } InfoProv;
>
> typedef struct InfoProvEnt_ {
> StgInfoTable * info;
> InfoProv prov;
> struct InfoProvEnt_ *link;
> } InfoProvEnt;
The source positions are approximated in a similar way to the source
positions for DWARF debugging information. They are only approximate but
in our experience provide a good enough hint about where the problem
might be. It is therefore recommended to use this flag in conjunction
with `-g<n>` for more accurate locations.
The lookup table is also emitted into the eventlog when it is available
as it is intended to be used with the `-hi` profiling mode.
Using this flag will significantly increase the size of the resulting
object file but only by a factor of 2-3x in our experience.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Previously the eventlog infrastructure had a couple of races that could
pop up when using the startEventLog/endEventLog interfaces. In
particular, stopping and then later restarting logging could result in
data preceding the eventlog header, breaking the integrity of the
stream.
To fix this we rework the invariants regarding the eventlog and
generally tighten up the concurrency control surrounding starting and
stopping of logging.
We also fix an unrelated bug, wherein log events from disabled
capabilities could end up never flushed.
|
|
|
|
|
|
|
|
|
|
|
|
| |
As noted in #18043, flushTrace failed flush anything beyond the writer.
This means that a significant amount of data sitting in capability-local
event buffers may never get flushed, despite the users' pleads for us to
flush.
Fix this by making flushEventLog flush all of the event buffers before
flushing the writer.
Fixes #18043.
|
|
|
|
|
|
|
|
| |
We currently only post the entry counters, not the other global
counters as in my experience the former are more useful. We use the heap
profiler's census period to decide when to dump.
Also spruces up the documentation surrounding ticky-ticky a bit.
|
|
|
|
|
|
| |
This exposes a set of interfaces from the GHC API for configuring
EventLogWriters. These can be used by consumers like
[ghc-eventlog-socket](https://github.com/bgamari/ghc-eventlog-socket).
|
|\
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
This introduces a concurrent mark & sweep garbage collector to manage the old
generation. The concurrent nature of this collector typically results in
significantly reduced maximum and mean pause times in applications with large
working sets.
Due to the large and intricate nature of the change I have opted to
preserve the fully-buildable history, including merge commits, which is
described in the "Branch overview" section below.
Collector design
================
The full design of the collector implemented here is described in detail
in a technical note
> B. Gamari. "A Concurrent Garbage Collector For the Glasgow Haskell
> Compiler" (2018)
This document can be requested from @bgamari.
The basic heap structure used in this design is heavily inspired by
> K. Ueno & A. Ohori. "A fully concurrent garbage collector for
> functional programs on multicore processors." /ACM SIGPLAN Notices/
> Vol. 51. No. 9 (presented at ICFP 2016)
This design is intended to allow both marking and sweeping
concurrent to execution of a multi-core mutator. Unlike the Ueno design,
which requires no global synchronization pauses, the collector
introduced here requires a stop-the-world pause at the beginning and end
of the mark phase.
To avoid heap fragmentation, the allocator consists of a number of
fixed-size /sub-allocators/. Each of these sub-allocators allocators into
its own set of /segments/, themselves allocated from the block
allocator. Each segment is broken into a set of fixed-size allocation
blocks (which back allocations) in addition to a bitmap (used to track
the liveness of blocks) and some additional metadata (used also used
to track liveness).
This heap structure enables collection via mark-and-sweep, which can be
performed concurrently via a snapshot-at-the-beginning scheme (although
concurrent collection is not implemented in this patch).
Implementation structure
========================
The majority of the collector is implemented in a handful of files:
* `rts/Nonmoving.c` is the heart of the beast. It implements the entry-point
to the nonmoving collector (`nonmoving_collect`), as well as the allocator
(`nonmoving_allocate`) and a number of utilities for manipulating the heap.
* `rts/NonmovingMark.c` implements the mark queue functionality, update
remembered set, and mark loop.
* `rts/NonmovingSweep.c` implements the sweep loop.
* `rts/NonmovingScav.c` implements the logic necessary to scavenge the
nonmoving heap.
Branch overview
===============
```
* wip/gc/opt-pause:
| A variety of small optimisations to further reduce pause times.
|
* wip/gc/compact-nfdata:
| Introduce support for compact regions into the non-moving
|\ collector
| \
| \
| | * wip/gc/segment-header-to-bdescr:
| | | Another optimization that we are considering, pushing
| | | some segment metadata into the segment descriptor for
| | | the sake of locality during mark
| | |
| * | wip/gc/shortcutting:
| | | Support for indirection shortcutting and the selector optimization
| | | in the non-moving heap.
| | |
* | | wip/gc/docs:
| |/ Work on implementation documentation.
| /
|/
* wip/gc/everything:
| A roll-up of everything below.
|\
| \
| |\
| | \
| | * wip/gc/optimize:
| | | A variety of optimizations, primarily to the mark loop.
| | | Some of these are microoptimizations but a few are quite
| | | significant. In particular, the prefetch patches have
| | | produced a nontrivial improvement in mark performance.
| | |
| | * wip/gc/aging:
| | | Enable support for aging in major collections.
| | |
| * | wip/gc/test:
| | | Fix up the testsuite to more or less pass.
| | |
* | | wip/gc/instrumentation:
| | | A variety of runtime instrumentation including statistics
| | / support, the nonmoving census, and eventlog support.
| |/
| /
|/
* wip/gc/nonmoving-concurrent:
| The concurrent write barriers.
|
* wip/gc/nonmoving-nonconcurrent:
| The nonmoving collector without the write barriers necessary
| for concurrent collection.
|
* wip/gc/preparation:
| A merge of the various preparatory patches that aren't directly
| implementing the GC.
|
|
* GHC HEAD
.
.
.
```
|
| | |
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
This introduces a few events to mark key points in the nonmoving
garbage collection cycle. These include:
* `EVENT_CONC_MARK_BEGIN`, denoting the beginning of a round of
marking. This may happen more than once in a single major collection
since we the major collector iterates until it hits a fixed point.
* `EVENT_CONC_MARK_END`, denoting the end of a round of marking.
* `EVENT_CONC_SYNC_BEGIN`, denoting the beginning of the post-mark
synchronization phase
* `EVENT_CONC_UPD_REM_SET_FLUSH`, indicating that a capability has
flushed its update remembered set.
* `EVENT_CONC_SYNC_END`, denoting that all mutators have flushed their
update remembered sets.
* `EVENT_CONC_SWEEP_BEGIN`, denoting the beginning of the sweep portion
of the major collection.
* `EVENT_CONC_SWEEP_END`, denoting the end of the sweep portion of the
major collection.
|
|/
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
With this change it is possible to reconstruct the timing portion of a
`.prof` file after the fact. By logging the stacks at each time point
a more precise executation trace of the program can be observed rather
than all identical cost centres being identified in the report.
There are two new events:
1. `EVENT_PROF_BEGIN` - emitted at the start of profiling to communicate
the tick interval
2. `EVENT_PROF_SAMPLE_COST_CENTRE` - emitted on each tick to communicate the
current call stack.
Fixes #17322
|
|
|
|
|
|
|
|
|
|
| |
This patch adds a new eventlog event which indicates the start of
a biographical profiler sample. These are different to normal events as
they also include the timestamp of when the census took place. This is
because the LDV profiler only emits samples at the end of the run.
Now all the different profiling modes emit consumable events to the
eventlog.
|
|
|
|
|
|
|
| |
This allows a user to observe how long a sampling period lasts so that
the time taken can be removed from the profiling output.
Fixes #16697
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This adds a new primop called traceBinaryEvent# that takes the length
of binary data and a pointer to the data, then emits it to the eventlog.
There is some example code that uses this primop and the new event:
* [traceBinaryEventIO][1] that calls `traceBinaryEvent#`
* [A patch to ghc-events][2] that parses the new `EVENT_USER_BINARY_MSG`
There's no corresponding issue on Trac but it was discussed at
ghc-devs [3].
[1] https://github.com/maoe/ghc-trace-events/blob
/fb226011ef1f85a97b4da7cc9d5f98f9fe6316ae/src/Debug/Trace/Binary.hs#L29)
[2] https://github.com/maoe/ghc-events/commit
/239ca77c24d18cdd10d6d85a0aef98e4a7c56ae6)
[3] https://mail.haskell.org/pipermail/ghc-devs/2018-May/015791.html
Reviewers: bgamari, erikd, simonmar
Reviewed By: bgamari
Subscribers: rwbarton, thomie, carter
Differential Revision: https://phabricator.haskell.org/D5007
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
An additional stat is tracked per gc: par_balanced_copied This is the
the number of bytes copied by each gc thread under the balanced lmit,
which is simply (copied_bytes / num_gc_threads). The stat is added to
all the appropriate GC structures, so is visible in the eventlog and in
GHC.Stats.
A note is added explaining how work balance is computed.
Remove some end of line whitespace
Test Plan:
./validate
experiment with the program attached to the ticket
examine code changes carefully
Reviewers: simonmar, austin, hvr, bgamari, erikd
Reviewed By: simonmar
Subscribers: Phyx, rwbarton, thomie
GHC Trac Issues: #13830
Differential Revision: https://phabricator.haskell.org/D3658
|
|
|
|
| |
Our new CPP linter enforces this.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This both says what we mean and silences a bunch of spurious CPP linting
warnings. This pragma is supported by all CPP implementations which we
support.
Reviewers: austin, erikd, simonmar, hvr
Reviewed By: simonmar
Subscribers: rwbarton, thomie
Differential Revision: https://phabricator.haskell.org/D3482
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently eventlog data is always written to a file `progname.eventlog`.
This patch introduces the `flushEventLog` field in `RtsConfig` which
allows to customize the writing of eventlog data.
One possible scenario is the ongoing live-profile-monitor effort by
@NCrashed which slurps all eventlog data through `fluchEventLog`.
`flushEventLog` takes a buffer with eventlog data and its size and
returns `false` (0) in case eventlog data could not be procesed.
Reviewers: simonmar, austin, erikd, bgamari
Reviewed By: simonmar, bgamari
Subscribers: qnikst, thomie, NCrashed
Differential Revision: https://phabricator.haskell.org/D2934
|
|
|
|
|
|
|
|
|
|
|
|
| |
Test Plan: Try it
Reviewers: hvr, simonmar, austin, erikd
Subscribers: thomie
Differential Revision: https://phabricator.haskell.org/D1722
GHC Trac Issues: #11094
|
|
|
|
|
|
|
|
|
|
|
|
| |
The `nat` type was an alias for `unsigned int` with a comment saying
it was at least 32 bits. We keep the typedef in case client code is
using it but mark it as deprecated.
Test Plan: Validated on Linux, OS X and Windows
Reviewers: simonmar, austin, thomie, hvr, bgamari, hsyl20
Differential Revision: https://phabricator.haskell.org/D2166
|
|
|
|
|
| |
This has been unnecessary for quite some time due to the create/delete
capability events.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary:
Don't call postLogMsg to post a user msg, because it truncates messages to
512 bytes.
Rename traceCap_stderr and trace_stderr to vtraceCap_stderr and trace_stderr,
to signal that they take a va_list (similar to vdebugBelch vs debugBelch).
See #3874 for the original reason behind traceFormatUserMsg.
See the commit msg in #9395 (d360d440) for a discussion about using
null-terminated strings vs strings with an explicit length.
Test Plan:
Run `cabal install ghc-events` and inspect the result of `ghc-events show` on
an eventlog file created with `ghc -eventlog Test.hs` and `./Test +RTS -l`,
where Test.hs contains:
```
import Debug.Trace
main = traceEvent (replicate 510 'a' ++ "bcd") $ return ()
```
Depends on D655.
Reviewers: austin
Reviewed By: austin
Subscribers: thomie
Differential Revision: https://phabricator.haskell.org/D656
GHC Trac Issues: #8309
|
|
|
|
| |
This reverts commit 39b5c1cbd8950755de400933cecca7b8deb4ffcd.
|
|
|
|
|
|
|
|
| |
This will hopefully help ensure some basic consistency in the forward by
overriding buffer variables. In particular, it sets the wrap length, the
offset to 4, and turns off tabs.
Signed-off-by: Austin Seipp <austin@well-typed.com>
|
|
|
|
|
|
|
|
|
| |
In time-based profiling visualisations (e.g. heap profiles and ThreadScope)
it would be useful to be able to mark particular points in the execution and
have those points in time marked in the visualisation.
The traceMarker# primop currently emits an event into the eventlog. In
principle it could be extended to do something in the heap profiling too.
|
|
|
|
|
|
|
|
|
|
|
|
| |
lnat was originally "long unsigned int" but we were using it when we
wanted a 64-bit type on a 64-bit machine. This broke on Windows x64,
where long == int == 32 bits. Using types of unspecified size is bad,
but what we really wanted was a type with N bits on an N-bit machine.
StgWord is exactly that.
lnat was mentioned in some APIs that clients might be using
(e.g. StackOverflowHook()), so we leave it defined but with a comment
to say that it's deprecated.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Based on initial patches by Mikolaj Konarski <mikolaj@well-typed.com>
These new eventlog events are to let profiling tools keep track of all
the OS threads that belong to an RTS capability at any moment in time.
In the RTS, OS threads correspond to the Task abstraction, so that is
what we track. There are events for tasks being created, migrated
between capabilities and deleted. In particular the task creation event
also records the kernel thread id which lets us match up the OS thread
with data collected by others tools (in the initial use case with
Linux's perf tool, but in principle also with DTrace).
|
|
|
|
| |
I've assumed that the definition type is right.
|
|
|
|
|
|
|
|
|
|
|
| |
There was a discrepancy between GC times reported in +RTS -s
and the timestamps of GC_START and GC_END events on the cap,
on which +RTS -s stats for the given GC are based.
This is fixed by posting the events with exactly the same timestamp
as generated for the stat calculation. The calls posting the events
are moved too, so that the events are emitted close to the time instant
they claim to be emitted at. The GC_STATS_GHC was moved, too, ensuring
it's emitted before the moved GC_END on all caps, which simplifies tools code.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
They cover much the same info as is available via the GHC.Stats module
or via the '+RTS -s' textual output, but via the eventlog and with a
better sampling frequency.
We have three new generic heap info events and two very GHC-specific
ones. (The hope is the general ones are usable by other implementations
that use the same eventlog system, or indeed not so sensitive to changes
in GHC itself.)
The general ones are:
* total heap mem allocated since prog start, on a per-HEC basis
* current size of the heap (MBlocks reserved from OS for the heap)
* current size of live data in the heap
Currently these are all emitted by GHC at GC time (live data only at
major GC).
The GHC specific ones are:
* an event giving various static heap paramaters:
* number of generations (usually 2)
* max size if any
* nursary size
* MBlock and block sizes
* a event emitted on each GC containing:
* GC generation (usually just 0,1)
* total bytes copied
* bytes lost to heap slop and fragmentation
* the number of threads in the parallel GC (1 for serial)
* the maximum number of bytes copied by any par GC thread
* the total number of bytes copied by all par GC threads
(these last three can be used to calculate an estimate of the
work balance in parallel GCs)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Now that we can adjust the number of capabilities on the fly, we need
this reflected in the eventlog. Previously the eventlog had a single
startup event that declared a static number of capabilities. Obviously
that's no good anymore.
For compatability we're keeping the EVENT_STARTUP but adding new
EVENT_CAP_CREATE/DELETE. The EVENT_CAP_DELETE is actually just the old
EVENT_SHUTDOWN but renamed and extended (using the existing mechanism
to extend eventlog events in a compatible way). So we now emit both
EVENT_STARTUP and EVENT_CAP_CREATE. One day we will drop EVENT_STARTUP.
Since reducing the number of capabilities at runtime does not really
delete them, it just disables them, then we also have new events for
disable/enable.
The old EVENT_SHUTDOWN was in the scheduler class of events. The new
EVENT_CAP_* events are in the unconditional class, along with the
EVENT_CAPSET_* ones. Knowing when capabilities are created and deleted
is crucial to making sense of eventlogs, you always want those events.
In any case, they're extremely low volume.
|
|
|
|
|
| |
At present the number of capabilities can only be *increased*, not
decreased. The latter presents a few more challenges!
|
|
|
|
|
|
| |
The existing GHC.Conc.labelThread will now also emit the the thread
label into the eventlog. Profiling tools like ThreadScope could then
use the thread labels rather than thread numbers.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Eventlog timestamps are elapsed times (in nanoseconds) relative to the
process start. To be able to merge eventlogs from multiple processes we
need to be able to align their timelines. If they share a clock domain
(or a user judges that their clocks are sufficiently closely
synchronised) then it is sufficient to know how the eventlog timestamps
match up with the clock.
The EVENT_WALL_CLOCK_TIME contains the clock time with (up to)
nanosecond precision. It is otherwise an ordinary event and so contains
the usual timestamp for the same moment in time. It therefore enables
us to match up all the eventlog timestamps with clock time.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Replaces the existing EVENT_RUN/STEAL_SPARK events with 7 new events
covering all stages of the spark lifcycle:
create, dud, overflow, run, steal, fizzle, gc
The sampled spark events are still available. There are now two event
classes for sparks, the sampled and the fully accurate. They can be
enabled/disabled independently. By default +RTS -l includes the sampled
but not full detail spark events. Use +RTS -lf-p to enable the detailed
'f' and disable the sampled 'p' spark.
Includes work by Mikolaj <mikolaj.konarski@gmail.com>
|
|
|
|
|
|
|
| |
A new eventlog event containing 7 spark counters/statistics: sparks
created, dud, overflowed, converted, GC'd, fizzled and remaining.
These are maintained and logged separately for each capability.
We log them at startup, on each GC (minor and major) and on shutdown.
|
|
|
|
|
| |
The process ID, parent process ID, rts name and version
The program arguments and environment.
|
|
|
|
|
|
|
| |
We trace the creation and shutdown of capabilities. All the capabilities
in the process are assigned to one capabilitiy set of OS-process type.
This is a second version of the patch. Includes work by Spencer Janssen.
|
|
|
|
| |
Rather than doing it differently for the eventlog and Dtrace cases.
|
|
|
|
|
|
|
|
| |
Coutts."
This reverts commit 58532eb46041aec8d4cbb48b054cb5b001edb43c.
Turns out it didn't work on Windows and it'll need some non-trivial changes
to make it work on Windows. We'll get it in later once that's sorted out.
|
| |
|
|
|
|
|
|
|
|
|
| |
So we can now get these in ThreadScope:
19487000: cap 1: stopping thread 6 (blocked on black hole owned by thread 4)
Note: needs an update to ghc-events. Older ThreadScopes will just
ignore the new information.
|
|
|
|
| |
Schedule.c - uses them in forkProcess.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This replaces some complicated locking schemes with message-passing
in the implementation of throwTo. The benefits are
- previously it was impossible to guarantee that a throwTo from
a thread running on one CPU to a thread running on another CPU
would be noticed, and we had to rely on the GC to pick up these
forgotten exceptions. This no longer happens.
- the locking regime is simpler (though the code is about the same
size)
- threads can be unblocked from a blocked_exceptions queue without
having to traverse the whole queue now. It's a rare case, but
replaces an O(n) operation with an O(1).
- generally we move in the direction of sharing less between
Capabilities (aka HECs), which will become important with other
changes we have planned.
Also in this patch I replaced several STM-specific closure types with
a generic MUT_PRIM closure type, which allowed a lot of code in the GC
and other places to go away, hence the line-count reduction. The
message-passing changes resulted in about a net zero line-count
difference.
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
added:
primop TraceEventOp "traceEvent#" GenPrimOp
Addr# -> State# s -> State# s
{ Emits an event via the RTS tracing framework. The contents
of the event is the zero-terminated byte string passed as the first
argument. The event will be emitted either to the .eventlog file,
or to stderr, depending on the runtime RTS flags. }
and added the required RTS functionality to support it. Also a bit of
refactoring in the RTS tracing code.
|
|
|
|
|
| |
These indicate the size and time span of a sequence of events in the
event log, to make it easier to sort and navigate a large event log.
|