summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorÖmer Sinan Ağacan <omeragacan@gmail.com>2020-03-13 21:25:27 +0300
committerMarge Bot <ben+marge-bot@smart-cactus.org>2020-03-15 10:58:27 -0400
commitcfcc3c9a1f2e4e33bed4c40767f8e7971e331c15 (patch)
tree6e84c72e3468baee47a7f72f6f92caad28e7311f
parentc35c545d3f32f092c52052349f741739a844ec0f (diff)
downloadhaskell-cfcc3c9a1f2e4e33bed4c40767f8e7971e331c15.tar.gz
Fix global_link of TSOs for threads reachable via dead weaks
Fixes #17785 Here's how the problem occurs: - In generation 0 we have a TSO that is finished (i.e. it has no more work to do or it is killed). - The TSO only becomes reachable after collectDeadWeakPtrs(). - After collectDeadWeakPtrs() we switch to WeakDone phase where we don't move TSOs to different lists anymore (like the next gen's thread list or the resurrected_threads list). - So the TSO will never be moved to a generation's thread list, but it will be promoted to generation 1. - Generation 1 collected via mark-compact, and because the TSO is reachable it is marked, and its `global_link` field, which is bogus at this point (because the TSO is not in a list), will be threaded. - Chaos ensues. In other words, when these conditions hold: - A TSO is reachable only after collectDeadWeakPtrs() - It's finished (what_next is ThreadComplete or ThreadKilled) - It's retained by mark-compact collector (moving collector doesn't evacuate the global_list field) We end up doing random mutations on the heap because the TSO's global_list field is not valid, but it still looks like a heap pointer so we thread it during compacting GC. The fix is simple: when we traverse old_threads lists to resurrect unreachable threads the threads that won't be resurrected currently stays on the old_threads lists. Those threads will never be visited again by MarkWeak so we now reset the global_list fields. This way compacting GC does not thread pointers to nowhere. Testing ------- The reproducer in #17785 is quite large and hard to build, because of the dependencies, so I'm not adding a regression test. In my testing the reproducer would take a less than 5 seconds to run, and once in every ~5 runs would fail with a segfault or an assertion error. In other cases it also fails with a test failure. Because the tests never fail with the bug fix, assuming the code is correct, this also means that this bug can sometimes lead to incorrect runtime results. After the fix I was able to run the reproducer repeatedly for about an hour, with no runtime crashes or test failures. To run the reproducer clone the git repo: $ git clone https://github.com/osa1/streamly --branch ghc-segfault Then clone primitive and atomic-primops from their git repos and point to the clones in cabal.project.local. The project should then be buildable using GHC HEAD. Run the executable `properties` with `+RTS -c -DZ`. In addition to the reproducer above I run the test suite using: $ make slowtest EXTRA_HC_OPTS="-debug -with-rtsopts=-DS \ -with-rtsopts=-c +RTS -c -RTS" SKIPWAY='nonmoving nonmoving_thr' This enables compacting GC always in both GHC when building the test programs and when running the test programs, and also enables sanity checking when running the test programs. These set of flags are not compatible for all tests so there are some failures, but I got the same set of failures with this patch compared to GHC HEAD.
-rw-r--r--rts/sm/MarkWeak.c8
1 files changed, 8 insertions, 0 deletions
diff --git a/rts/sm/MarkWeak.c b/rts/sm/MarkWeak.c
index 7475b5e625..65b1338f10 100644
--- a/rts/sm/MarkWeak.c
+++ b/rts/sm/MarkWeak.c
@@ -213,6 +213,12 @@ static bool resurrectUnreachableThreads (generation *gen, StgTSO **resurrected_t
switch (t->what_next) {
case ThreadKilled:
case ThreadComplete:
+ // The thread was unreachable so far, but it might still end up
+ // being reachable later, e.g. after collectDeadWeakPtrs(). We don't
+ // want the global_link field to be dangling in that case, so reset
+ // it to END_TSO_QUEUE. The copying GC doesn't currently care, but
+ // the compacting GC does, see #17785.
+ t->global_link = END_TSO_QUEUE;
continue;
default:
tmp = t;
@@ -222,6 +228,8 @@ static bool resurrectUnreachableThreads (generation *gen, StgTSO **resurrected_t
flag = true;
}
}
+
+ gen->old_threads = END_TSO_QUEUE;
return flag;
}