SERVER-46723 Update replication architecture guide for safe reconfig

(cherry picked from commit 2de03a25a0a3dc8f7e675c33ff9e1b1370532d41)
author: William Schultz <william.schultz@mongodb.com> 2020-05-27 11:03:20 -0400
committer: Evergreen Agent <no-reply@evergreen.mongodb.com> 2020-06-17 14:07:15 +0000
commit: f5b36d71a86cf3c46a32b9d2eb109b6ec760b512 (patch)
tree: 0afb4e63347467d8e73d566a3929e16c51893431
parent: 5218ee3e883b0230e121ae13a7640e0bc4a313ae (diff)
download: mongo-f5b36d71a86cf3c46a32b9d2eb109b6ec760b512.tar.gz
1 files changed, 150 insertions, 13 deletions
diff --git a/src/mongo/db/repl/README.md b/src/mongo/db/repl/README.md
index e3c6eaece89..7e9bae358ce 100644
--- a/src/mongo/db/repl/README.md
+++ b/src/mongo/db/repl/README.md
@@ -320,21 +320,23 @@ quadratically with the number of nodes and is the reasoning behind the 50 member
 set. The data, `ReplSetHeartbeatArgsV1` that accompanies every heartbeat is:
 
 1. `ReplicaSetConfig` version
-2. The id of the sender in the `ReplSetConfig`
-3. Term
-4. Replica set name
-5. Sender host address
+2. `ReplicaSetConfig` term
+3. The id of the sender in the `ReplSetConfig`
+4. Term
+5. Replica set name
+6. Sender host address
 
 When the remote node receives the heartbeat, it first processes the heartbeat data, and then sends a
 response back. First, the remote node makes sure the heartbeat is compatible with its replica set
-name and its `ReplicaSetConfig` version and otherwise sends an error.
+name. Otherwise it sends an error.
 
 The receiving node's `TopologyCoordinator` updates the last time it received a heartbeat from the
 sending node for liveness checking in its `MemberHeartbeatData` list.
 
-If the sending node's config is higher than the receiving node's, then the receiving node schedules
-a heartbeat to get the config. The receiving node's `ReplicationCoordinator` also updates its
-`SlaveInfo` with the last update from the sending node and marks it as being up.
+If the sending node's config is newer than the receiving node's, then the receiving node schedules a
+heartbeat to get the config. The receiving node's `ReplicationCoordinator` also updates its
+`SlaveInfo` with the last update from the sending node and marks it as being up. See more details on
+config propagation via heartbeats in the [Reconfiguration](#Reconfiguration) section.
 
 It then creates a `ReplSetHeartbeatResponse` object. This includes:
 
@@ -346,7 +348,7 @@ It then creates a `ReplSetHeartbeatResponse` object. This includes:
 6. The term of the receiving node
 7. The state of the receiving node
 8. The receiving node's sync source
-9. The receiving node's `ReplicaSetConfig` version
+9. The receiving node's `ReplicaSetConfig` version and term
 10. Whether the receiving node is primary
 
 When the sending node receives the response to the heartbeat, it first processes its
@@ -434,8 +436,7 @@ When a node receives a `replSetUpdatePosition` command, the first thing it does
 For every node’s OpTime data in the `optimes` array, the receiving node updates its view of the
 replicaset in the replication and topology coordinators. This updates the liveness information of
 every node in the `optimes` list. If the data is about the receiving node, it ignores it. If the
-`ReplSetConfig` versions don’t match, it errors. If the receiving node is a primary and it learns
-that the commit point should be moved forward, it does so.
+receiving node is a primary and it learns that the commit point should be moved forward, it does so.
 
 If something has changed and the receiving node itself has a sync source, it forwards its new
 information to its own sync source.
@@ -1148,7 +1149,7 @@ updates its own term accordingly. The `ReplicationCoordinator` then asks the `To
 if it should grant a vote. The vote is rejected if:
 
 1. It's from an older term.
-2. The config versions do not match.
+2. The configs do not match (see more detail in [Config Ordering and Elections](#config-ordering-and-elections)).
 3. The replica set name does not match.
 4. The last applied OpTime that comes in the vote request is older than the voter's last applied
    OpTime.
@@ -1237,7 +1238,7 @@ that is okay. If you consider the minimum spanning tree on the cluster where edg
 from nodes to their sync source, then as long as the primary is connected to a majority of nodes, it
 will stay primary.
 * Force reconfig via the `replSetReconfig` command
-* Force reconfig via heartbeat: If we learn of a newer config version through heartbeats, we will
+* Force reconfig via heartbeat: If we learn of a newer config through heartbeats, we will
 schedule a replica set config change.
 
 During unconditional stepdown, we do not check preconditions before attempting to step down. Similar
@@ -1476,6 +1477,142 @@ flag and tell the storage engine that the [`initialDataTimestamp`](#replication-
 is the node's last applied OpTime. Finally, the `InitialSyncer` shuts down and the
 `ReplicationCoordinator` starts steady state replication.
 
+# Reconfiguration
+
+MongoDB replica sets consist of a set of members, where a *member* corresponds to a single
+participant of the replica set, identified by a host name and port. We refer to a *node* as the
+mongod server process that corresponds to a particular replica set member. A replica set
+*configuration* consists of a list of members in a replica set along with some member specific
+settings as well as global settings for the set. We alternately refer to a configuration as a
+*config*, for brevity. Each member of the config has a [member
+id](https://github.com/mongodb/mongo/blob/r4.4.0-rc6/src/mongo/db/repl/member_id.h), which is a
+unique integer identifier for that member. The schema of a config is defined in the
+[ReplSetConfig](https://github.com/mongodb/mongo/blob/r4.4.0-rc6/src/mongo/db/repl/repl_set_config.h#L110-L547)
+class, which is serialized as a BSON object and stored durably in the `local.system.replset`
+collection on each replica set node.
+
+## Initiation
+
+When the mongod processes for members of a replica set are first started, they have no configuration
+installed and they do not communicate with each other over the network or replicate any data. To
+initialize the replica set, an initial config must be provided via the `replSetInitiate` command, so
+that nodes know who the other members of the replica set are. Upon receiving this command, which can
+be run on any node of an uninitialized set, a node validates and installs the specified config. It
+then establishes connections to and begins sending heartbeats to the other nodes of the replica set
+contained in the configuration it installed. Configurations are propagated between nodes via
+heartbeats, which is how nodes in the replica set will receive and install the initial config.
+
+## Reconfiguration Behavior
+
+To update the current configuration, a client may execute the `replSetReconfig` command with the
+new, desired config. Reconfigurations [can be run
+](https://github.com/mongodb/mongo/blob/892bce4528b2ec97d9f264b5a982d54da0e4971d/src/mongo/db/repl/repl_set_commands.cpp#L419-L421)in
+*safe* mode or in *force* mode. We alternately refer to reconfigurations as *reconfigs*, for
+brevity. Safe reconfigs, which are the default, can only be run against primary nodes and ensure the
+replication safety guarantee that majority committed writes will not be rolled back. Force reconfigs
+can be run against either a primary or secondary node and their usage may cause the rollback of
+majority committed writes. Although force reconfigs are unsafe, they exist to allow users to salvage
+or repair a replica set where a majority of nodes are no longer operational or reachable.
+
+### Safe Reconfig Protocol
+
+The safe reconfiguration protocol implemented in MongoDB shares certain conceptual similarities with
+the "single server" reconfiguration approach described in Section 4 of the [Raft PhD
+thesis](https://web.stanford.edu/~ouster/cgi-bin/papers/OngaroPhD.pdf), but was designed with some
+differences to integrate with the existing, heartbeat-based reconfig protocol more easily.
+
+Note that in a static configuration, the safety of the Raft protocol depends on the fact that any
+two quorums (i.e. majorities) of a replica set have at least one member in common i.e. they satisfy
+the *quorum overlap* property. For any two arbitrary configurations, however, this is not the case.
+So, extra restrictions are placed on how nodes are allowed to move between configurations. First,
+all safe reconfigs enforce a **[single node
+change](https://github.com/mongodb/mongo/blob/r4.4.0-rc6/src/mongo/db/repl/repl_set_config_checks.cpp#L82-L89)**
+condition, which requires that no more than a single voting node is added or removed in a single
+reconfig. Any number of non voting nodes can be added or removed in a single reconfig. This
+constraint ensures that any adjacent configs satisfy quorum overlap. You can see a justification of
+why this is true in the Raft thesis section referenced above.
+
+Even though the single node change condition ensures quorum overlap between two adjacent configs,
+quorum overlap may not always be ensured between configs on all nodes of the system, so there are
+two additional constraints that must be satisfied before a primary node can install a new
+configuration:
+
+1. **[Config
+Replication](https://github.com/mongodb/mongo/blob/r4.4.0-rc6/src/mongo/db/repl/replication_coordinator_impl.cpp#L3531-L3534)**:
+The current config, C, must be installed on at least a majority of voting nodes in C.
+2. **[Oplog
+Commitment](https://github.com/mongodb/mongo/blob/r4.4.0-rc6/src/mongo/db/repl/replication_coordinator_impl.cpp#L3553-L3560)**:
+Any oplog entries that were majority committed in the previous config, C0, must be replicated to at
+least a majority of voting nodes in the current config, C1.
+
+Condition 1 ensures that any configs earlier than C can no longer independently form a quorum to
+elect a node or commit a write. Condition 2 ensures that committed writes in any older configs are
+now committed by the rules of the current configuration. This guarantees that any leaders elected in
+a subsequent configuration will contain these entries in their log upon assuming role as leader.
+When both conditions are satisfied, we say that the current config is *committed*.
+
+We wait for both of these conditions to become true at the
+[beginning](https://github.com/mongodb/mongo/blob/r4.4.0-rc6/src/mongo/db/repl/repl_set_commands.cpp#L421-L437)
+of the `replSetReconfig` command, before installing the new config. Satisfaction of these conditions
+before transitioning to a new config is fundamental to the safety of the reconfig protocol. After
+satisfying these conditions and installing the new config, we also wait for condition 1 to become
+true of the new config at the
+[end](https://github.com/mongodb/mongo/blob/r4.4.0-rc6/src/mongo/db/repl/repl_set_commands.cpp#L442-L454)
+of the reconfig command. This waiting ensures that the new config is installed on a majority of
+nodes before reconfig returns success, but it is not strictly necessary for guaranteeing safety. If
+it fails, an error will be returned, but the new config will have already been installed and can
+begin to propagate. On a subsequent reconfig, we will still ensure that both safety conditions are
+satisfied before installing the next config. By waiting for config replication at the end of the
+reconfig command, however, we can make the waiting period shorter at the beginning of the next
+reconfig, in addition to ensuring that the newly installed config will be present on a subsequent
+primary.
+
+Note that force reconfigs bypass all checks of condition 1 and 2, and they do not enforce the single
+node change condition.
+
+### Config Ordering and Elections
+
+As mentioned above, configs are propagated between nodes via heartbeats. To do this properly, nodes
+must have some way of determining if one config is "newer" than another. Each configuration has a
+`term` and `version` field, and configs are totally ordered by the [`(version,
+term)`](https://github.com/mongodb/mongo/blob/r4.4.0-rc6/src/mongo/db/repl/repl_set_config.h#L50-L55)
+pair, where `term` is compared first, and then `version`, analogous to the rules for optime
+comparison. The `term` of a config is the term of the primary that originally created that config,
+and the `version` is a monotonically increasing number assigned to each config. When executing a
+reconfig, the version of the new config must be greater than the version of the current config.  If
+the `(version, term)` pair of config A is greater than that of config B, then it is considered
+"newer" than config B. If a node hears about a newer config via a heartbeat from another node, it
+will [schedule a
+heartbeat](https://github.com/mongodb/mongo/blob/r4.4.0-rc6/src/mongo/db/repl/replication_coordinator_impl.cpp#L5019-L5036)
+to fetch the config and
+[install](https://github.com/mongodb/mongo/blob/r4.4.0-rc6/src/mongo/db/repl/topology_coordinator.cpp#L892-L895)
+it locally.
+
+Note that force reconfigs set the new config's term to an [uninitialized term
+value](https://github.com/mongodb/mongo/blob/r4.4.0-rc6/src/mongo/db/repl/optime.h#L58-L59). When we
+compare two configs, if either of them has an uninitialized term value, then we only consider config
+versions for comparison. A force reconfig also [increments the
+version](https://github.com/mongodb/mongo/blob/r4.4.0-rc6/src/mongo/db/repl/replication_coordinator_impl.cpp#L3227-L3232)
+of the current config by a large, random number. This makes it very likely that the force config
+will be "newer" than any other config in the system.
+
+Config ordering also affects voting behavior. If a replica set node is a candidate for election in
+config `(vc, tc)`, then a prospective voter with config `(v, t)` will only cast a vote for the
+candidate if `(vc, tc) = (v, t)`. For correctness, it would be acceptable for a candidate to cast
+its vote whenever `(vc, tc) >= (v, t)`, but the current implementation is more restrictive. For a
+description of the complete voting behavior, see the [Elections](#Elections) section.
+
+### Formal Specification
+
+For more details on the safe reconfig protocol and its behaviors, refer to the [TLA+
+specification](https://github.com/mongodb/mongo/tree/r4.4.0-rc6/src/mongo/db/repl/tla_plus/MongoReplReconfig).
+It defines two main invariants of the protocol,
+[ElectionSafety](https://github.com/mongodb/mongo/blob/r4.4.0-rc6/src/mongo/db/repl/tla_plus/MongoReplReconfig/MongoReplReconfig.tla#L403-L404)
+and
+[NeverRollbackCommitted](https://github.com/mongodb/mongo/blob/r4.4.0-rc6/src/mongo/db/repl/tla_plus/MongoReplReconfig/MongoReplReconfig.tla#L413-L420),
+which assert, respectively, that no two leaders are elected in the same term and that majority
+committed writes are never rolled back.
+
 # Startup Recovery
 
 **Startup recovery** is a node's process for putting both the oplog and data into a consistent state
author	William Schultz <william.schultz@mongodb.com>	2020-05-27 11:03:20 -0400
committer	Evergreen Agent <no-reply@evergreen.mongodb.com>	2020-06-17 14:07:15 +0000
commit	f5b36d71a86cf3c46a32b9d2eb109b6ec760b512 (patch)
tree	0afb4e63347467d8e73d566a3929e16c51893431
parent	5218ee3e883b0230e121ae13a7640e0bc4a313ae (diff)
download	mongo-f5b36d71a86cf3c46a32b9d2eb109b6ec760b512.tar.gz