diff options
author | Ilya Maximets <i.maximets@ovn.org> | 2022-05-25 16:16:06 +0200 |
---|---|---|
committer | Ilya Maximets <i.maximets@ovn.org> | 2022-05-25 21:22:12 +0200 |
commit | c8c78a76e540b2c3542c655c72fe581247919e26 (patch) | |
tree | 8beb93d1b34cc8e1051d534944103c6cd06e6b57 | |
parent | 2809af022a47a0d7759acdd71e5f58140f49c750 (diff) | |
download | openvswitch-c8c78a76e540b2c3542c655c72fe581247919e26.tar.gz |
ovsdb: raft: Fix transaction double commit due to lost leadership.
While becoming a follower, the leader aborts all the current
'in-flight' commands, so the higher layers can re-try corresponding
transactions when the new leader is elected. However, most of these
commands are already sent to followers as append requests, hence they
will actually be committed by the majority of the cluster members,
i.e. will be treated as committed by the new leader, unless there is
an actual network problem between servers. However, the old leader
will decline append replies, since it's not the leader anymore and
commands are already completed with RAFT_CMD_LOST_LEADERSHIP status.
New leader will replicate the commit index back to the old leader.
Old leader will re-try the previously "failed" transaction, because
"cluster error"s are temporary.
If a transaction had some prerequisites that didn't allow double
committing or there are other database constraints (like indexes) that
will not allow a transaction to be committed twice, the server will
reply to the client with a false-negative transaction result.
If there are no prerequisites or additional database constraints,
the server will execute the same transaction again, but as a follower.
E.g. in the OVN case, this may result in creation of duplicated logical
switches / routers / load balancers. I.e. resources with the same
non-indexed name. That may cause issues later where ovn-nbctl will
not be able to add ports to these switches / routers.
Suggested solution is to not complete (abort) the commands, but allow
them to be completed with the commit index update from a new leader.
It the similar behavior to what we do in order to complete commands
in a backward scenario when the follower becomes a leader. That
scenario was fixed by commit 5a9b53a51ec9 ("ovsdb raft: Fix duplicated
transaction execution when leader failover.").
Code paths for leader and follower inside the raft_update_commit_index
were very similar, so they were refactored into one, since we also
needed an ability to complete more than one command for a follower.
Failure test added to exercise scenario of a leadership transfer.
Fixes: 1b1d2e6daa56 ("ovsdb: Introduce experimental support for clustered databases.")
Fixes: 3c2d6274bcee ("raft: Transfer leadership before creating snapshots.")
Reported-at: https://bugzilla.redhat.com/2046340
Acked-by: Han Zhou <hzhou@ovn.org>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
-rw-r--r-- | ovsdb/raft.c | 133 | ||||
-rw-r--r-- | tests/ovsdb-cluster.at | 27 |
2 files changed, 99 insertions, 61 deletions
diff --git a/ovsdb/raft.c b/ovsdb/raft.c index a0474be59..3d8d549bf 100644 --- a/ovsdb/raft.c +++ b/ovsdb/raft.c @@ -78,6 +78,8 @@ enum raft_failure_test { FT_DELAY_ELECTION, FT_DONT_SEND_VOTE_REQUEST, FT_STOP_RAFT_RPC, + FT_TRANSFER_LEADERSHIP, + FT_TRANSFER_LEADERSHIP_AFTER_SEND_APPEND_REQ, }; static enum raft_failure_test failure_test; @@ -1928,6 +1930,13 @@ raft_run(struct raft *raft) return; } + if (failure_test == FT_TRANSFER_LEADERSHIP) { + /* Using this function as it conveniently implements all we need and + * snapshotting is the main test scenario for leadership transfer. */ + raft_notify_snapshot_recommended(raft); + failure_test = FT_NO_TEST; + } + raft_waiters_run(raft); if (!raft->listener && time_msec() >= raft->listen_backoff) { @@ -2064,7 +2073,14 @@ raft_run(struct raft *raft) HMAP_FOR_EACH_SAFE (cmd, next_cmd, hmap_node, &raft->commands) { if (cmd->timestamp && now - cmd->timestamp > raft->election_timer * 2) { - raft_command_complete(raft, cmd, RAFT_CMD_TIMEOUT); + if (cmd->index && raft->role != RAFT_LEADER) { + /* This server lost leadership and command didn't complete + * in time. Likely, it wasn't replicated to the majority + * of servers before losing the leadership. */ + raft_command_complete(raft, cmd, RAFT_CMD_LOST_LEADERSHIP); + } else { + raft_command_complete(raft, cmd, RAFT_CMD_TIMEOUT); + } } } raft_reset_ping_timer(raft); @@ -2256,6 +2272,9 @@ raft_command_initiate(struct raft *raft, if (failure_test == FT_CRASH_AFTER_SEND_APPEND_REQ) { ovs_fatal(0, "Raft test: crash after sending append_request."); } + if (failure_test == FT_TRANSFER_LEADERSHIP_AFTER_SEND_APPEND_REQ) { + failure_test = FT_TRANSFER_LEADERSHIP; + } raft_reset_ping_timer(raft); return cmd; @@ -2622,7 +2641,13 @@ raft_become_follower(struct raft *raft) * configuration is already part of the log. Possibly the configuration * log entry will not be committed, but until we know that we must use the * new configuration. Our AppendEntries processing will properly update - * the server configuration later, if necessary. */ + * the server configuration later, if necessary. + * + * Also we do not complete commands here, as they can still be completed + * if their log entries have already been replicated to other servers. + * If the entries were actually committed according to the new leader, our + * AppendEntries processing will complete the corresponding commands. + */ struct raft_server *s; HMAP_FOR_EACH (s, hmap_node, &raft->add_servers) { raft_send_add_server_reply__(raft, &s->sid, s->address, false, @@ -2636,8 +2661,6 @@ raft_become_follower(struct raft *raft) raft_server_destroy(raft->remove_server); raft->remove_server = NULL; } - - raft_complete_all_commands(raft, RAFT_CMD_LOST_LEADERSHIP); } static void @@ -2898,61 +2921,56 @@ raft_update_commit_index(struct raft *raft, uint64_t new_commit_index) return false; } - if (raft->role == RAFT_LEADER) { - while (raft->commit_index < new_commit_index) { - uint64_t index = ++raft->commit_index; - const struct raft_entry *e = raft_get_entry(raft, index); - if (e->data) { - struct raft_command *cmd - = raft_find_command_by_eid(raft, &e->eid); - if (cmd) { - if (!cmd->index) { - VLOG_DBG("Command completed after role change from" - " follower to leader "UUID_FMT, - UUID_ARGS(&e->eid)); - cmd->index = index; - } - raft_command_complete(raft, cmd, RAFT_CMD_SUCCESS); + while (raft->commit_index < new_commit_index) { + uint64_t index = ++raft->commit_index; + const struct raft_entry *e = raft_get_entry(raft, index); + + if (e->data) { + struct raft_command *cmd = raft_find_command_by_eid(raft, &e->eid); + static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 5); + + if (cmd) { + if (!cmd->index && raft->role == RAFT_LEADER) { + VLOG_INFO_RL(&rl, + "command completed after role change from " + "follower to leader (eid: "UUID_FMT", " + "commit index: %"PRIu64")", UUID_ARGS(&e->eid), index); + } else if (!cmd->index && raft->role != RAFT_LEADER) { + /* This can happen when leader fail-over before sending + * execute_command_reply. */ + VLOG_INFO_RL(&rl, + "command completed without reply (eid: "UUID_FMT", " + "commit index: %"PRIu64")", UUID_ARGS(&e->eid), index); + } else if (cmd->index && raft->role != RAFT_LEADER) { + /* This can happen if current server lost leadership after + * sending append requests to the majority of servers, but + * before receiving majority of append replies. */ + VLOG_INFO_RL(&rl, + "command completed after role change from " + "leader to follower (eid: "UUID_FMT", " + "commit index: %"PRIu64")", UUID_ARGS(&e->eid), index); + /* Clearing 'sid' to avoid sending cmd execution reply. */ + cmd->sid = UUID_ZERO; + } else { + /* (cmd->index && raft->role == RAFT_LEADER) + * Normal command completion on a leader. */ } - } - if (e->election_timer) { - VLOG_INFO("Election timer changed from %"PRIu64" to %"PRIu64, - raft->election_timer, e->election_timer); - raft->election_timer = e->election_timer; - raft->election_timer_new = 0; - raft_update_probe_intervals(raft); - } - if (e->servers) { - /* raft_run_reconfigure() can write a new Raft entry, which can - * reallocate raft->entries, which would invalidate 'e', so - * this case must be last, after the one for 'e->data'. */ - raft_run_reconfigure(raft); + cmd->index = index; + raft_command_complete(raft, cmd, RAFT_CMD_SUCCESS); } } - } else { - while (raft->commit_index < new_commit_index) { - uint64_t index = ++raft->commit_index; - const struct raft_entry *e = raft_get_entry(raft, index); - if (e->election_timer) { - VLOG_INFO("Election timer changed from %"PRIu64" to %"PRIu64, - raft->election_timer, e->election_timer); - raft->election_timer = e->election_timer; - raft_update_probe_intervals(raft); - } + if (e->election_timer) { + VLOG_INFO("Election timer changed from %"PRIu64" to %"PRIu64, + raft->election_timer, e->election_timer); + raft->election_timer = e->election_timer; + raft->election_timer_new = 0; + raft_update_probe_intervals(raft); } - /* Check if any pending command can be completed, and complete it. - * This can happen when leader fail-over before sending - * execute_command_reply. */ - const struct uuid *eid = raft_get_eid(raft, new_commit_index); - struct raft_command *cmd = raft_find_command_by_eid(raft, eid); - if (cmd) { - static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 5); - VLOG_INFO_RL(&rl, - "Command completed without reply (eid: "UUID_FMT", " - "commit index: %"PRIu64")", - UUID_ARGS(eid), new_commit_index); - cmd->index = new_commit_index; - raft_command_complete(raft, cmd, RAFT_CMD_SUCCESS); + if (e->servers && raft->role == RAFT_LEADER) { + /* raft_run_reconfigure() can write a new Raft entry, which can + * reallocate raft->entries, which would invalidate 'e', so + * this case must be last, after the one for 'e->data'. */ + raft_run_reconfigure(raft); } } @@ -4962,6 +4980,11 @@ raft_unixctl_failure_test(struct unixctl_conn *conn OVS_UNUSED, failure_test = FT_DONT_SEND_VOTE_REQUEST; } else if (!strcmp(test, "stop-raft-rpc")) { failure_test = FT_STOP_RAFT_RPC; + } else if (!strcmp(test, + "transfer-leadership-after-sending-append-request")) { + failure_test = FT_TRANSFER_LEADERSHIP_AFTER_SEND_APPEND_REQ; + } else if (!strcmp(test, "transfer-leadership")) { + failure_test = FT_TRANSFER_LEADERSHIP; } else if (!strcmp(test, "clear")) { failure_test = FT_NO_TEST; unixctl_command_reply(conn, "test dismissed"); diff --git a/tests/ovsdb-cluster.at b/tests/ovsdb-cluster.at index ee9c7b937..0f7076a05 100644 --- a/tests/ovsdb-cluster.at +++ b/tests/ovsdb-cluster.at @@ -468,6 +468,7 @@ ovsdb_cluster_failure_test () { if test "$crash_node" == "1"; then new_leader=$5 fi + log_grep=$6 cp $top_srcdir/vswitchd/vswitch.ovsschema schema schema=`ovsdb-tool schema-name schema` @@ -488,7 +489,7 @@ ovsdb_cluster_failure_test () { start_server() { local i=$1 printf "\ns$i: starting\n" - AT_CHECK([ovsdb-server -vjsonrpc -vconsole:off -vsyslog:off --detach --no-chdir --log-file=s$i.log --pidfile=s$i.pid --unixctl=s$i --remote=punix:s$i.ovsdb s$i.db]) + AT_CHECK([ovsdb-server -vjsonrpc -vraft -vconsole:off -vsyslog:off --detach --no-chdir --log-file=s$i.log --pidfile=s$i.pid --unixctl=s$i --remote=punix:s$i.ovsdb s$i.db]) } connect_server() { local i=$1 @@ -514,14 +515,23 @@ ovsdb_cluster_failure_test () { fi AT_CHECK([ovs-appctl -t "`pwd`"/s$delay_election_node cluster/failure-test delay-election], [0], [ignore]) fi + + # Initializing the database separately to avoid extra 'wait' operation + # in later transactions. + AT_CHECK([ovs-vsctl -v --db="$db" --no-leader-only --no-shuffle-remotes --no-wait init], [0], [ignore], [ignore]) + AT_CHECK([ovs-appctl -t "`pwd`"/s$crash_node cluster/failure-test $crash_command], [0], [ignore]) AT_CHECK([ovs-vsctl -v --db="$db" --no-leader-only --no-shuffle-remotes --no-wait create QoS type=x], [0], [ignore], [ignore]) - # Make sure that the node really crashed. - AT_CHECK([ls s$crash_node.ovsdb], [2], [ignore], [ignore]) - # XXX: Client will fail if remotes contains unix socket that doesn't exist (killed). - if test "$remote_1" = "$crash_node"; then - db=unix:s$remote_2.ovsdb + # Make sure that the node really crashed or has specific log message. + if test -z "$log_grep"; then + AT_CHECK([ls s$crash_node.ovsdb], [2], [ignore], [ignore]) + # XXX: Client will fail if remotes contains unix socket that doesn't exist (killed). + if test "$remote_1" = "$crash_node"; then + db=unix:s$remote_2.ovsdb + fi + else + OVS_WAIT_UNTIL([grep -q "$log_grep" s${crash_node}.log]) fi AT_CHECK([ovs-vsctl --db="$db" --no-leader-only --no-wait --columns=type --bare list QoS], [0], [x ]) @@ -617,6 +627,11 @@ AT_KEYWORDS([ovsdb server negative unix cluster pending-txn]) ovsdb_cluster_failure_test 2 2 3 crash-after-receiving-append-request-update AT_CLEANUP +AT_SETUP([OVSDB cluster - txn on leader, leader transfers leadership after sending appendReq]) +AT_KEYWORDS([ovsdb server negative unix cluster pending-txn transfer]) +ovsdb_cluster_failure_test 1 2 1 transfer-leadership-after-sending-append-request -1 "Transferring leadership" +AT_CLEANUP + AT_SETUP([OVSDB cluster - competing candidates]) AT_KEYWORDS([ovsdb server negative unix cluster competing-candidates]) |