diff options
author | Nathan Froyd <froydnj@mozilla.com> | 2018-05-30 10:29:53 -0400 |
---|---|---|
committer | Nathan Froyd <froydnj@mozilla.com> | 2018-05-30 10:29:53 -0400 |
commit | cea9c7ec22e0483a4d5d03245cd03f443aeb752b (patch) | |
tree | e87dae5a1a26b3873c54a00224f8655c80c3f7ac | |
parent | 08822cb4e64e092862a23d6017958bcf5c4fc29e (diff) | |
download | nss-hg-cea9c7ec22e0483a4d5d03245cd03f443aeb752b.tar.gz |
Bug 1461731 - wait indefinitely in nssSlot_IsTokenPresent; r=rrelyea,fkiefer
Crashes for a particular hang have been spiking in the last month, and
all the crashes are associated with macOS 10.12 and 10.13. The crashes
look like this:
Thread 1: waiting on a condition variable in nssSlot_IsTokenPresent
Thread 2: waiting on the (contended) lock in nssSlot_IsTokenPresent
Thread 3: waiting on the (contended) lock in nssSlot_IsTokenPresent
Thread 2 and 3 are waiting on the lock associated with the condition
variable that thread 1 is holding.
One would expect that thread 1 would drop the lock associated with the
condition variable when the wait occurs, and enable thread 2 or thread 3
to make progress. But the particular wait in question passes a
(relative) timeout of zero (which corresponds to what would be
PR_INTERVAL_NO_WAIT), which is unusual in NSS code and condition
variable-using programs in general.
A relative timeout of zero on macOS needs to be translated to an
absolute time for the underlying API, pthread_cond_timedwait. What
appears to be happening is that some absolute time, $NOW, is determined
before calling pthread_cond_timedwait. We then call into
pthread_cond_timedwait and do whatever work we need to do before
checking whether the specified time ($NOW) has passed. Of course it
has; we are at some time $NOW + epsilon, and so the wait times out.
The wait appears to time out without the lock ever being released; if
the lock was released, even if ever-so-shortly, presumably one of the
other threads would be able to make progress. Since the hang only
occurs on macOS 10.12 and 10.13, we are assuming that there was some
change in the condition variable code that attempts to optimize
extremely short timeouts, or treats timeouts of zero differently (even
if inadvertently). The other possibility is this is the way macOS has
always worked, and the crash data we have is only for those versions of
the operating system.
In any event, there's no need to specify a timeout of zero here. We can
specify an "infinite" wait instead (PR_INTERVAL_NO_TIMEOUT) and let
another thread make progress, waking us up when it is done.
-rw-r--r-- | lib/dev/devslot.c | 2 |
1 files changed, 1 insertions, 1 deletions
diff --git a/lib/dev/devslot.c b/lib/dev/devslot.c index ef941f9ed..ebd6e6aa5 100644 --- a/lib/dev/devslot.c +++ b/lib/dev/devslot.c @@ -153,7 +153,7 @@ nssSlot_IsTokenPresent( /* set up condition so only one thread is active in this part of the code at a time */ PZ_Lock(slot->isPresentLock); while (slot->isPresentThread) { - PR_WaitCondVar(slot->isPresentCondition, 0); + PR_WaitCondVar(slot->isPresentCondition, PR_INTERVAL_NO_TIMEOUT); } /* if we were one of multiple threads here, the first thread will have * given us the answer, no need to make more queries of the token. */ |