Sentinel: abort failover when in wait-start if master is back.

When we are a Leader Sentinel in wait-start state, starting with this commit the failover is aborted if the master returns online. This improves the way we handle a notable case of net split, that is the split between Sentinels and Redis servers, that will be a very common case of split becase Sentinels will often be installed in the client's network and servers can be in a differnt arm of the network. When Sentinels and Redis servers are isolated the master is in ODOWN condition since the Sentinels can agree about this state, however the failover does not start since there are no good slaves to promote (in this specific case all the slaves are unreachable). However when the split is resolved, Sentinels may sense the slave back a moment before they sense the master is back, so the failover may start without a good reason (since the master is actually working too). Now this condition is reversible, so the failover will be aborted immediately after if the master is detected to be working again, that is, not in SDOWN nor in ODOWN condition.
author: antirez <antirez@gmail.com> 2012-07-31 10:14:23 +0200
committer: antirez <antirez@gmail.com> 2012-07-31 10:19:34 +0200
commit: 75084e057dcbd0cefbd1ee035c367320f2257de6 (patch)
tree: 3279366f36bdba7f099e5cd88c1a5ebb77b7081d /src
parent: 7f5bdba4343cf32c8ae7d38a3f6d0d163677c14c (diff)
download: redis-75084e057dcbd0cefbd1ee035c367320f2257de6.tar.gz
1 files changed, 18 insertions, 0 deletions
diff --git a/src/sentinel.c b/src/sentinel.c
index 1048e8c72..d1c6befe2 100644
--- a/src/sentinel.c
+++ b/src/sentinel.c
@@ -2400,6 +2400,24 @@ sentinelRedisInstance *sentinelSelectSlave(sentinelRedisInstance *master) {
 
 /* ---------------- Failover state machine implementation ------------------- */
 void sentinelFailoverWaitStart(sentinelRedisInstance *ri) {
+    /* If we in "wait start" but the master is no longer in ODOWN nor in
+     * SDOWN condition we abort the failover. This is important as it
+     * prevents a useless failover in a a notable case of netsplit, where
+     * the senitnels are split from the redis instances. In this case
+     * the failover will not start while there is the split because no
+     * good slave can be reached. However when the split is resolved, we
+     * can go to waitstart if the slave is back rechable a few milliseconds
+     * before the master is. In that case when the master is back online
+     * we cancel the failover. */
+    if ((ri->flags & (SRI_S_DOWN|SRI_O_DOWN)) == 0) {
+        sentinelEvent(REDIS_WARNING,"-failover-abort-master-is-back",
+            ri,"%@");
+        sentinelAbortFailover(ri);
+        return;
+    }
+
+    /* Start the failover going to the next state if enough time has
+     * elapsed. */
     if (mstime() >= ri->failover_start_time) {
         ri->failover_state = SENTINEL_FAILOVER_STATE_SELECT_SLAVE;
         ri->failover_state_change_time = mstime();
author	antirez <antirez@gmail.com>	2012-07-31 10:14:23 +0200
committer	antirez <antirez@gmail.com>	2012-07-31 10:19:34 +0200
commit	75084e057dcbd0cefbd1ee035c367320f2257de6 (patch)
tree	3279366f36bdba7f099e5cd88c1a5ebb77b7081d /src
parent	7f5bdba4343cf32c8ae7d38a3f6d0d163677c14c (diff)
download	redis-75084e057dcbd0cefbd1ee035c367320f2257de6.tar.gz