| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This moves all URL references to Trac Wiki to their corresponding
GitLab counterparts.
This substitution is classified as follows:
1. Automated substitution using sed with Ben's mapping rule [1]
Old: ghc.haskell.org/trac/ghc/wiki/XxxYyy...
New: gitlab.haskell.org/ghc/ghc/wikis/xxx-yyy...
2. Manual substitution for URLs containing `#` index
Old: ghc.haskell.org/trac/ghc/wiki/XxxYyy...#Zzz
New: gitlab.haskell.org/ghc/ghc/wikis/xxx-yyy...#zzz
3. Manual substitution for strings starting with `Commentary`
Old: Commentary/XxxYyy...
New: commentary/xxx-yyy...
See also !539
[1]: https://gitlab.haskell.org/bgamari/gitlab-migration/blob/master/wiki-mapping.json
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
While debugging #15285 I realized that free block lists (free_list in
BlockAlloc.c) get corrupted when multiple scavenge threads allocate and
release blocks concurrently. Here's a picture of one such race:
Thread 2 (Thread 32573.32601):
#0 check_tail
(bd=0x940d40 <stg_TSO_info>) at rts/sm/BlockAlloc.c:860
#1 0x0000000000928ef7 in checkFreeListSanity
() at rts/sm/BlockAlloc.c:896
#2 0x0000000000928979 in freeGroup
(p=0x7e998ce02880) at rts/sm/BlockAlloc.c:721
#3 0x0000000000928a17 in freeChain
(bd=0x7e998ce02880) at rts/sm/BlockAlloc.c:738
#4 0x0000000000926911 in freeChain_sync
(bd=0x7e998ce02880) at rts/sm/GCUtils.c:80
#5 0x0000000000934720 in scavenge_capability_mut_lists
(cap=0x1acae80) at rts/sm/Scav.c:1665
#6 0x000000000092b411 in gcWorkerThread
(cap=0x1acae80) at rts/sm/GC.c:1157
#7 0x000000000090be9a in yieldCapability
(pCap=0x7f9994e69e20, task=0x7e9984000b70, gcAllowed=true) at rts/Capability.c:861
#8 0x0000000000906120 in scheduleYield
(pcap=0x7f9994e69e50, task=0x7e9984000b70) at rts/Schedule.c:673
#9 0x0000000000905500 in schedule
(initialCapability=0x1acae80, task=0x7e9984000b70) at rts/Schedule.c:293
#10 0x0000000000908d4f in scheduleWorker
(cap=0x1acae80, task=0x7e9984000b70) at rts/Schedule.c:2554
#11 0x000000000091a30a in workerStart
(task=0x7e9984000b70) at rts/Task.c:444
#12 0x00007f99937fa6db in start_thread
(arg=0x7f9994e6a700) at pthread_create.c:463
#13 0x000061654d59f88f in clone
() at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
Thread 1 (Thread 32573.32573):
#0 checkFreeListSanity
() at rts/sm/BlockAlloc.c:887
#1 0x0000000000928979 in freeGroup
(p=0x7e998d303540) at rts/sm/BlockAlloc.c:721
#2 0x0000000000926f23 in todo_block_full
(size=513, ws=0x1aa8ce0) at rts/sm/GCUtils.c:264
#3 0x00000000009583b9 in alloc_for_copy
(size=513, gen_no=0) at rts/sm/Evac.c:80
#4 0x000000000095850d in copy_tag_nolock
(p=0x7e998c675f28, info=0x421d98 <Main_Large_con_info>, src=0x7e998d075d80, size=513,
gen_no=0, tag=1) at rts/sm/Evac.c:153
#5 0x0000000000959177 in evacuate
(p=0x7e998c675f28) at rts/sm/Evac.c:715
#6 0x0000000000932388 in scavenge_small_bitmap
(p=0x7e998c675f28, size=1, bitmap=0) at rts/sm/Scav.c:271
#7 0x0000000000934aaf in scavenge_stack
(p=0x7e998c675f28, stack_end=0x7e998c676000) at rts/sm/Scav.c:1908
#8 0x0000000000934295 in scavenge_one
(p=0x7e998c66e000) at rts/sm/Scav.c:1466
#9 0x0000000000934662 in scavenge_mutable_list
(bd=0x7e998d300440, gen=0x1b1d880) at rts/sm/Scav.c:1643
#10 0x0000000000934700 in scavenge_capability_mut_lists
(cap=0x1aaa340) at rts/sm/Scav.c:1664
#11 0x00000000009299b6 in GarbageCollect
(collect_gen=0, do_heap_census=false, gc_type=2, cap=0x1aaa340, idle_cap=0x1b38aa0)
at rts/sm/GC.c:378
#12 0x0000000000907a4a in scheduleDoGC
(pcap=0x7ffdec5b5310, task=0x1b36650, force_major=false) at rts/Schedule.c:1798
#13 0x0000000000905de7 in schedule
(initialCapability=0x1aaa340, task=0x1b36650) at rts/Schedule.c:546
#14 0x0000000000908bc4 in scheduleWaitThread
(tso=0x7e998c0067c8, ret=0x0, pcap=0x7ffdec5b5430) at rts/Schedule.c:2537
#15 0x000000000091b5a0 in rts_evalLazyIO
(cap=0x7ffdec5b5430, p=0x9c11f0, ret=0x0) at rts/RtsAPI.c:530
#16 0x000000000091ca56 in hs_main
(argc=1, argv=0x7ffdec5b5628, main_closure=0x9c11f0, rts_config=...) at rts/RtsMain.c:72
#17 0x0000000000421ea0 in main
()
In particular, dbl_link_onto() which is used to add a freed block to a
doubly-linked free list is not thread safe and corrupts the list when
called concurrently.
Note that thread 1 is to blame here as thread 2 is properly taking the
spinlock. With this patch we now take the spinlock when freeing a todo
block in GC, avoiding this race.
Test Plan:
- Tried slow validate locally: this patch does not introduce new failures.
- circleci: https://circleci.com/gh/ghc/ghc-diffs/283 The test got killed
because it took 5 hours but T7919 (which was previously failing on circleci)
passed.
Reviewers: simonmar, bgamari, erikd
Reviewed By: simonmar
Subscribers: rwbarton, carter
GHC Trac Issues: #15285
Differential Revision: https://phabricator.haskell.org/D5115
|
| |
|
|
|
|
| |
Our new CPP linter enforces this.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The C code in the RTS now gets built with `-Wundef` and the Haskell code
(stages 1 and 2 only) with `-Wcpp-undef`. We now get warnings whereever
`#if` is used on undefined identifiers.
Test Plan: Validate on Linux and Windows
Reviewers: austin, angerman, simonmar, bgamari, Phyx
Reviewed By: bgamari
Subscribers: thomie, snowleopard
Differential Revision: https://phabricator.haskell.org/D3278
|
|
|
|
|
|
|
|
| |
This is causing too much platform dependent breakage at the moment. We
will need a more rigorous testing strategy before this can be
merged again.
This reverts commit 7e340c2bbf4a56959bd1e95cdd1cfdb2b7e537c2.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The C code in the RTS now gets built with `-Wundef` and the Haskell code
(stages 1 and 2 only) with `-Wcpp-undef`. We now get warnings whereever
`#if` is used on undefined identifiers.
Test Plan: Validate on Linux and Windows
Reviewers: austin, angerman, simonmar, bgamari, Phyx
Reviewed By: bgamari
Subscribers: thomie, snowleopard
Differential Revision: https://phabricator.haskell.org/D3278
|
|
|
|
|
|
|
|
|
|
|
|
| |
Test Plan: Validate on lots of platforms
Reviewers: erikd, simonmar, austin
Reviewed By: erikd, simonmar
Subscribers: michalt, thomie
Differential Revision: https://phabricator.haskell.org/D2699
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary:
The aim here is to reduce the number of remote memory accesses on
systems with a NUMA memory architecture, typically multi-socket servers.
Linux provides a NUMA API for doing two things:
* Allocating memory local to a particular node
* Binding a thread to a particular node
When given the +RTS --numa flag, the runtime will
* Determine the number of NUMA nodes (N) by querying the OS
* Assign capabilities to nodes, so cap C is on node C%N
* Bind worker threads on a capability to the correct node
* Keep a separate free lists in the block layer for each node
* Allocate the nursery for a capability from node-local memory
* Allocate blocks in the GC from node-local memory
For example, using nofib/parallel/queens on a 24-core 2-socket machine:
```
$ ./Main 15 +RTS -N24 -s -A64m
Total time 173.960s ( 7.467s elapsed)
$ ./Main 15 +RTS -N24 -s -A64m --numa
Total time 150.836s ( 6.423s elapsed)
```
The biggest win here is expected to be allocating from node-local
memory, so that means programs using a large -A value (as here).
According to perf, on this program the number of remote memory accesses
were reduced by more than 50% by using `--numa`.
Test Plan:
* validate
* There's a new flag --debug-numa=<n> that pretends to do NUMA without
actually making the OS calls, which is useful for testing the code
on non-NUMA systems.
* TODO: I need to add some unit tests
Reviewers: erikd, austin, rwbarton, ezyang, bgamari, hvr, niteria
Subscribers: thomie
Differential Revision: https://phabricator.haskell.org/D2199
|
|
|
|
|
|
|
|
|
|
|
|
| |
The `nat` type was an alias for `unsigned int` with a comment saying
it was at least 32 bits. We keep the typedef in case client code is
using it but mark it as deprecated.
Test Plan: Validated on Linux, OS X and Windows
Reviewers: simonmar, austin, thomie, hvr, bgamari, hsyl20
Differential Revision: https://phabricator.haskell.org/D2166
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Avoids contention for the block allocator lock in the GC; this can be
seen in the gc_alloc_block_sync counter emitted by +RTS -s.
I experimented with this a while ago, and there was already
commented-out code for it in GCUtils.c, but I've now improved it so that
it doesn't result in significantly worse memory usage.
* The old method of putting spare blocks on ws->part_list was wasteful,
the spare blocks are now shared between all generations and retained
between GCs.
* repeated allocGroup() results in fragmentation, so I switched to using
allocLargeChunk() instead which is fragmentation-friendly; we already
use it for the same reason in nursery allocation.
|
|
|
|
|
|
|
|
|
| |
After a parallel GC, it is possible to have a long list of blocks in
ws->part_list, if we did a lot of work stealing but didn't fill up the
blocks we stole. These blocks persist until the next load-balanced GC,
which might be a long time, and during every GC we were traversing this
list to find its size. The fix is to maintain the size all the time, so
we don't have to compute it.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary:
The fix is a bit clunky, and is perhaps not the best fix, but I'm not
sure how much work it would be to fix it the other way (see comments
for more info).
Test Plan: T7919 doesn't crash
Reviewers: austin, rwbarton, ezyang, bgamari
Subscribers: thomie
Differential Revision: https://phabricator.haskell.org/D1113
GHC Trac Issues: #7919
|
|
|
|
| |
This reverts commit 39b5c1cbd8950755de400933cecca7b8deb4ffcd.
|
|
|
|
| |
Signed-off-by: Austin Seipp <austin@well-typed.com>
|
|
|
|
|
|
|
|
| |
This will hopefully help ensure some basic consistency in the forward by
overriding buffer variables. In particular, it sets the wrap length, the
offset to 4, and turns off tabs.
Signed-off-by: Austin Seipp <austin@well-typed.com>
|
| |
|
|
|
|
| |
See comments for details.
|
|
|
|
|
|
| |
As far as I can tell the bug should be harmless, apart from the
failing assertion. Since the ticket reported crashes, there might be
problems elsewhere that aren't triggered by this test case.
|
|
|
|
|
|
|
|
|
|
|
|
| |
lnat was originally "long unsigned int" but we were using it when we
wanted a 64-bit type on a 64-bit machine. This broke on Windows x64,
where long == int == 32 bits. Using types of unspecified size is bad,
but what we really wanted was a type with N bits on an N-bit machine.
StgWord is exactly that.
lnat was mentioned in some APIs that clients might be using
(e.g. StackOverflowHook()), so we leave it defined but with a comment
to say that it's deprecated.
|
|
|
|
|
|
|
|
|
|
|
| |
Based on a patch from David Terei.
Some parts are a little ugly (e.g. defining things that only ASSERTs
use only when DEBUG is defined), so we might want to tweak things a
little.
I've also turned off -Werror for didn't-inline warnings, as we now
get a few such warnings.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is a port of some of the changes from my private local-GC branch
(which is still in darcs, I haven't converted it to git yet). There
are a couple of small functional differences in the GC stats: first,
per-thread GC timings should now be more accurate, and secondly we now
report average and maximum pause times. e.g. from minimax +RTS -N8 -s:
Tot time (elapsed) Avg pause Max pause
Gen 0 2755 colls, 2754 par 13.16s 0.93s 0.0003s 0.0150s
Gen 1 769 colls, 769 par 3.71s 0.26s 0.0003s 0.0059s
|
|
|
|
| |
Now that we use the per-capability mutable lists exclusively.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The GC had a two-level structure, G generations each of T steps.
Steps are for aging within a generation, mostly to avoid premature
promotion.
Measurements show that more than 2 steps is almost never worthwhile,
and 1 step is usually worse than 2. In theory fractional steps are
possible, so the ideal number of steps is somewhere between 1 and 3.
GHC's default has always been 2.
We can implement 2 steps quite straightforwardly by having each block
point to the generation to which objects in that block should be
promoted, so blocks in the nursery point to generation 0, and blocks
in gen 0 point to gen 1, and so on.
This commit removes the explicit step structures, merging generations
with steps, thus simplifying a lot of code. Performance is
unaffected. The tunable number of steps is now gone, although it may
be replaced in the future by a way to tune the aging in generation 0.
|
| |
|
|
|
|
|
|
|
| |
At the moment, this just saves a memory reference in the GC inner loop
(worth a percent or two of GC time). Later, it will hopefully let me
experiment with partial steps, and simplifying the generation/step
infrastructure.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
It is possible for the program to allocate single object larger than a
block, without going through the normal large-object mechanisms that
we have for arrays and threads and so on.
The GC was assuming that no object was larger than a block, but #3424
contains a program that breaks the assumption. This patch removes the
assumption. The objects in question will still be copied, that is
they don't get the normal large-object treatment, but this case is
unlikely to occur often in practice.
In the future we may improve things by generating code to allocate
them as large objects in the first place.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The first phase of this tidyup is focussed on the header files, and in
particular making sure we are exposinng publicly exactly what we need
to, and no more.
- Rts.h now includes everything that the RTS exposes publicly,
rather than a random subset of it.
- Most of the public header files have moved into subdirectories, and
many of them have been renamed. But clients should not need to
include any of the other headers directly, just #include the main
public headers: Rts.h, HsFFI.h, RtsAPI.h.
- All the headers needed for via-C compilation have moved into the
stg subdirectory, which is self-contained. Most of the headers for
the rest of the RTS APIs have moved into the rts subdirectory.
- I left MachDeps.h where it is, because it is so widely used in
Haskell code.
- I left a deprecated stub for RtsFlags.h in place. The flag
structures are now exposed by Rts.h.
- Various internal APIs are no longer exposed by public header files.
- Various bits of dead code and declarations have been removed
- More gcc warnings are turned on, and the RTS code is more
warning-clean.
- More source files #include "PosixSource.h", and hence only use
standard POSIX (1003.1c-1995) interfaces.
There is a lot more tidying up still to do, this is just the first
pass. I also intend to standardise the names for external RTS APIs
(e.g use the rts_ prefix consistently), and declare the internal APIs
as hidden for shared libraries.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Generate binary log files from the RTS containing a log of runtime
events with timestamps. The log file can be visualised in various
ways, for investigating runtime behaviour and debugging performance
problems. See for example the forthcoming ThreadScope viewer.
New GHC option:
-eventlog (link-time option) Enables event logging.
+RTS -l (runtime option) Generates <prog>.eventlog with
the binary event information.
This replaces some of the tracing machinery we already had in the RTS:
e.g. +RTS -vg for GC tracing (we should do this using the new event
logging instead).
Event logging has almost no runtime cost when it isn't enabled, though
in the future we might add more fine-grained events and this might
change; hence having a link-time option and compiling a separate
version of the RTS for event logging. There's a small runtime cost
for enabling event-logging, for most programs it shouldn't make much
difference.
(Todo: docs)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
New flag: "+RTS -qb" disables load-balancing in the parallel GC
(though this is subject to change, I think we will probably want to do
something more automatic before releasing this).
To get the "PARGC3" configuration described in the "Runtime support
for Multicore Haskell" paper, use "+RTS -qg0 -qb -RTS".
The main advantage of this is that it allows us to easily disable
load-balancing altogether, which turns out to be important in parallel
programs. Maintaining locality is sometimes more important that
spreading the work out in parallel GC. There is a side benefit in
that the parallel GC should have improved locality even when
load-balancing, because each processor prefers to take work from its
own queue before stealing from others.
|
| |
|
| |
|
| |
|
|
|
|
|
|
|
|
|
| |
BF_EVACUATED is now set on all blocks except those that we are
copying. This means we don't need a separate test for gen>N in
evacuate(), because in generations older than N, BF_EVACUATED will be
set anyway. The disadvantage is that we have to reset the
BF_EVACUATED flag on the blocks of any generation we're collecting
before starting GC. Results in a small speed improvement.
|
| |
|
|
|
|
|
|
|
|
|
| |
- GCAux.c contains code not compiled with the gct register enabled,
it is callable from outside the GC
- marking functions are moved to their relevant subsystems, outside
the GC
- mark_root needs to save the gct register, as it is called from
outside the GC
|
| |
|
| |
|
|
|
|
|
|
|
| |
- count and report number of parallel collections
- calculate bytes scanned in addition to bytes copied per thread
- calculate "work balance factor"
- tidy up the formatting a bit
|
| |
|
| |
|
| |
|
| |
|
|
|
|
|
|
|
| |
DEBUG imposes a significant performance hit in the GC, yet we often
want some of the debugging output, so -vg gives us the cheap trace
messages without the sanity checking of DEBUG, just like -vs for the
scheduler.
|
| |
|
|
|
|
| |
don't cache a work block locally if the global queue is empty
|
| |
|
|
|
|
|
| |
avoids cache contention: bd->todo_bd->free may clash with any cache
line, so we localise it.
|