diff options
author | William Schultz <william.schultz@mongodb.com> | 2020-05-27 11:03:20 -0400 |
---|---|---|
committer | Evergreen Agent <no-reply@evergreen.mongodb.com> | 2020-06-17 14:07:15 +0000 |
commit | f5b36d71a86cf3c46a32b9d2eb109b6ec760b512 (patch) | |
tree | 0afb4e63347467d8e73d566a3929e16c51893431 | |
parent | 5218ee3e883b0230e121ae13a7640e0bc4a313ae (diff) | |
download | mongo-f5b36d71a86cf3c46a32b9d2eb109b6ec760b512.tar.gz |
SERVER-46723 Update replication architecture guide for safe reconfig
(cherry picked from commit 2de03a25a0a3dc8f7e675c33ff9e1b1370532d41)
-rw-r--r-- | src/mongo/db/repl/README.md | 163 |
1 files changed, 150 insertions, 13 deletions
diff --git a/src/mongo/db/repl/README.md b/src/mongo/db/repl/README.md index e3c6eaece89..7e9bae358ce 100644 --- a/src/mongo/db/repl/README.md +++ b/src/mongo/db/repl/README.md @@ -320,21 +320,23 @@ quadratically with the number of nodes and is the reasoning behind the 50 member set. The data, `ReplSetHeartbeatArgsV1` that accompanies every heartbeat is: 1. `ReplicaSetConfig` version -2. The id of the sender in the `ReplSetConfig` -3. Term -4. Replica set name -5. Sender host address +2. `ReplicaSetConfig` term +3. The id of the sender in the `ReplSetConfig` +4. Term +5. Replica set name +6. Sender host address When the remote node receives the heartbeat, it first processes the heartbeat data, and then sends a response back. First, the remote node makes sure the heartbeat is compatible with its replica set -name and its `ReplicaSetConfig` version and otherwise sends an error. +name. Otherwise it sends an error. The receiving node's `TopologyCoordinator` updates the last time it received a heartbeat from the sending node for liveness checking in its `MemberHeartbeatData` list. -If the sending node's config is higher than the receiving node's, then the receiving node schedules -a heartbeat to get the config. The receiving node's `ReplicationCoordinator` also updates its -`SlaveInfo` with the last update from the sending node and marks it as being up. +If the sending node's config is newer than the receiving node's, then the receiving node schedules a +heartbeat to get the config. The receiving node's `ReplicationCoordinator` also updates its +`SlaveInfo` with the last update from the sending node and marks it as being up. See more details on +config propagation via heartbeats in the [Reconfiguration](#Reconfiguration) section. It then creates a `ReplSetHeartbeatResponse` object. This includes: @@ -346,7 +348,7 @@ It then creates a `ReplSetHeartbeatResponse` object. This includes: 6. The term of the receiving node 7. The state of the receiving node 8. The receiving node's sync source -9. The receiving node's `ReplicaSetConfig` version +9. The receiving node's `ReplicaSetConfig` version and term 10. Whether the receiving node is primary When the sending node receives the response to the heartbeat, it first processes its @@ -434,8 +436,7 @@ When a node receives a `replSetUpdatePosition` command, the first thing it does For every node’s OpTime data in the `optimes` array, the receiving node updates its view of the replicaset in the replication and topology coordinators. This updates the liveness information of every node in the `optimes` list. If the data is about the receiving node, it ignores it. If the -`ReplSetConfig` versions don’t match, it errors. If the receiving node is a primary and it learns -that the commit point should be moved forward, it does so. +receiving node is a primary and it learns that the commit point should be moved forward, it does so. If something has changed and the receiving node itself has a sync source, it forwards its new information to its own sync source. @@ -1148,7 +1149,7 @@ updates its own term accordingly. The `ReplicationCoordinator` then asks the `To if it should grant a vote. The vote is rejected if: 1. It's from an older term. -2. The config versions do not match. +2. The configs do not match (see more detail in [Config Ordering and Elections](#config-ordering-and-elections)). 3. The replica set name does not match. 4. The last applied OpTime that comes in the vote request is older than the voter's last applied OpTime. @@ -1237,7 +1238,7 @@ that is okay. If you consider the minimum spanning tree on the cluster where edg from nodes to their sync source, then as long as the primary is connected to a majority of nodes, it will stay primary. * Force reconfig via the `replSetReconfig` command -* Force reconfig via heartbeat: If we learn of a newer config version through heartbeats, we will +* Force reconfig via heartbeat: If we learn of a newer config through heartbeats, we will schedule a replica set config change. During unconditional stepdown, we do not check preconditions before attempting to step down. Similar @@ -1476,6 +1477,142 @@ flag and tell the storage engine that the [`initialDataTimestamp`](#replication- is the node's last applied OpTime. Finally, the `InitialSyncer` shuts down and the `ReplicationCoordinator` starts steady state replication. +# Reconfiguration + +MongoDB replica sets consist of a set of members, where a *member* corresponds to a single +participant of the replica set, identified by a host name and port. We refer to a *node* as the +mongod server process that corresponds to a particular replica set member. A replica set +*configuration* consists of a list of members in a replica set along with some member specific +settings as well as global settings for the set. We alternately refer to a configuration as a +*config*, for brevity. Each member of the config has a [member +id](https://github.com/mongodb/mongo/blob/r4.4.0-rc6/src/mongo/db/repl/member_id.h), which is a +unique integer identifier for that member. The schema of a config is defined in the +[ReplSetConfig](https://github.com/mongodb/mongo/blob/r4.4.0-rc6/src/mongo/db/repl/repl_set_config.h#L110-L547) +class, which is serialized as a BSON object and stored durably in the `local.system.replset` +collection on each replica set node. + +## Initiation + +When the mongod processes for members of a replica set are first started, they have no configuration +installed and they do not communicate with each other over the network or replicate any data. To +initialize the replica set, an initial config must be provided via the `replSetInitiate` command, so +that nodes know who the other members of the replica set are. Upon receiving this command, which can +be run on any node of an uninitialized set, a node validates and installs the specified config. It +then establishes connections to and begins sending heartbeats to the other nodes of the replica set +contained in the configuration it installed. Configurations are propagated between nodes via +heartbeats, which is how nodes in the replica set will receive and install the initial config. + +## Reconfiguration Behavior + +To update the current configuration, a client may execute the `replSetReconfig` command with the +new, desired config. Reconfigurations [can be run +](https://github.com/mongodb/mongo/blob/892bce4528b2ec97d9f264b5a982d54da0e4971d/src/mongo/db/repl/repl_set_commands.cpp#L419-L421)in +*safe* mode or in *force* mode. We alternately refer to reconfigurations as *reconfigs*, for +brevity. Safe reconfigs, which are the default, can only be run against primary nodes and ensure the +replication safety guarantee that majority committed writes will not be rolled back. Force reconfigs +can be run against either a primary or secondary node and their usage may cause the rollback of +majority committed writes. Although force reconfigs are unsafe, they exist to allow users to salvage +or repair a replica set where a majority of nodes are no longer operational or reachable. + +### Safe Reconfig Protocol + +The safe reconfiguration protocol implemented in MongoDB shares certain conceptual similarities with +the "single server" reconfiguration approach described in Section 4 of the [Raft PhD +thesis](https://web.stanford.edu/~ouster/cgi-bin/papers/OngaroPhD.pdf), but was designed with some +differences to integrate with the existing, heartbeat-based reconfig protocol more easily. + +Note that in a static configuration, the safety of the Raft protocol depends on the fact that any +two quorums (i.e. majorities) of a replica set have at least one member in common i.e. they satisfy +the *quorum overlap* property. For any two arbitrary configurations, however, this is not the case. +So, extra restrictions are placed on how nodes are allowed to move between configurations. First, +all safe reconfigs enforce a **[single node +change](https://github.com/mongodb/mongo/blob/r4.4.0-rc6/src/mongo/db/repl/repl_set_config_checks.cpp#L82-L89)** +condition, which requires that no more than a single voting node is added or removed in a single +reconfig. Any number of non voting nodes can be added or removed in a single reconfig. This +constraint ensures that any adjacent configs satisfy quorum overlap. You can see a justification of +why this is true in the Raft thesis section referenced above. + +Even though the single node change condition ensures quorum overlap between two adjacent configs, +quorum overlap may not always be ensured between configs on all nodes of the system, so there are +two additional constraints that must be satisfied before a primary node can install a new +configuration: + +1. **[Config +Replication](https://github.com/mongodb/mongo/blob/r4.4.0-rc6/src/mongo/db/repl/replication_coordinator_impl.cpp#L3531-L3534)**: +The current config, C, must be installed on at least a majority of voting nodes in C. +2. **[Oplog +Commitment](https://github.com/mongodb/mongo/blob/r4.4.0-rc6/src/mongo/db/repl/replication_coordinator_impl.cpp#L3553-L3560)**: +Any oplog entries that were majority committed in the previous config, C0, must be replicated to at +least a majority of voting nodes in the current config, C1. + +Condition 1 ensures that any configs earlier than C can no longer independently form a quorum to +elect a node or commit a write. Condition 2 ensures that committed writes in any older configs are +now committed by the rules of the current configuration. This guarantees that any leaders elected in +a subsequent configuration will contain these entries in their log upon assuming role as leader. +When both conditions are satisfied, we say that the current config is *committed*. + +We wait for both of these conditions to become true at the +[beginning](https://github.com/mongodb/mongo/blob/r4.4.0-rc6/src/mongo/db/repl/repl_set_commands.cpp#L421-L437) +of the `replSetReconfig` command, before installing the new config. Satisfaction of these conditions +before transitioning to a new config is fundamental to the safety of the reconfig protocol. After +satisfying these conditions and installing the new config, we also wait for condition 1 to become +true of the new config at the +[end](https://github.com/mongodb/mongo/blob/r4.4.0-rc6/src/mongo/db/repl/repl_set_commands.cpp#L442-L454) +of the reconfig command. This waiting ensures that the new config is installed on a majority of +nodes before reconfig returns success, but it is not strictly necessary for guaranteeing safety. If +it fails, an error will be returned, but the new config will have already been installed and can +begin to propagate. On a subsequent reconfig, we will still ensure that both safety conditions are +satisfied before installing the next config. By waiting for config replication at the end of the +reconfig command, however, we can make the waiting period shorter at the beginning of the next +reconfig, in addition to ensuring that the newly installed config will be present on a subsequent +primary. + +Note that force reconfigs bypass all checks of condition 1 and 2, and they do not enforce the single +node change condition. + +### Config Ordering and Elections + +As mentioned above, configs are propagated between nodes via heartbeats. To do this properly, nodes +must have some way of determining if one config is "newer" than another. Each configuration has a +`term` and `version` field, and configs are totally ordered by the [`(version, +term)`](https://github.com/mongodb/mongo/blob/r4.4.0-rc6/src/mongo/db/repl/repl_set_config.h#L50-L55) +pair, where `term` is compared first, and then `version`, analogous to the rules for optime +comparison. The `term` of a config is the term of the primary that originally created that config, +and the `version` is a monotonically increasing number assigned to each config. When executing a +reconfig, the version of the new config must be greater than the version of the current config. If +the `(version, term)` pair of config A is greater than that of config B, then it is considered +"newer" than config B. If a node hears about a newer config via a heartbeat from another node, it +will [schedule a +heartbeat](https://github.com/mongodb/mongo/blob/r4.4.0-rc6/src/mongo/db/repl/replication_coordinator_impl.cpp#L5019-L5036) +to fetch the config and +[install](https://github.com/mongodb/mongo/blob/r4.4.0-rc6/src/mongo/db/repl/topology_coordinator.cpp#L892-L895) +it locally. + +Note that force reconfigs set the new config's term to an [uninitialized term +value](https://github.com/mongodb/mongo/blob/r4.4.0-rc6/src/mongo/db/repl/optime.h#L58-L59). When we +compare two configs, if either of them has an uninitialized term value, then we only consider config +versions for comparison. A force reconfig also [increments the +version](https://github.com/mongodb/mongo/blob/r4.4.0-rc6/src/mongo/db/repl/replication_coordinator_impl.cpp#L3227-L3232) +of the current config by a large, random number. This makes it very likely that the force config +will be "newer" than any other config in the system. + +Config ordering also affects voting behavior. If a replica set node is a candidate for election in +config `(vc, tc)`, then a prospective voter with config `(v, t)` will only cast a vote for the +candidate if `(vc, tc) = (v, t)`. For correctness, it would be acceptable for a candidate to cast +its vote whenever `(vc, tc) >= (v, t)`, but the current implementation is more restrictive. For a +description of the complete voting behavior, see the [Elections](#Elections) section. + +### Formal Specification + +For more details on the safe reconfig protocol and its behaviors, refer to the [TLA+ +specification](https://github.com/mongodb/mongo/tree/r4.4.0-rc6/src/mongo/db/repl/tla_plus/MongoReplReconfig). +It defines two main invariants of the protocol, +[ElectionSafety](https://github.com/mongodb/mongo/blob/r4.4.0-rc6/src/mongo/db/repl/tla_plus/MongoReplReconfig/MongoReplReconfig.tla#L403-L404) +and +[NeverRollbackCommitted](https://github.com/mongodb/mongo/blob/r4.4.0-rc6/src/mongo/db/repl/tla_plus/MongoReplReconfig/MongoReplReconfig.tla#L413-L420), +which assert, respectively, that no two leaders are elected in the same term and that majority +committed writes are never rolled back. + # Startup Recovery **Startup recovery** is a node's process for putting both the oplog and data into a consistent state |