SERVER-48175 VectoClock architecture guide updates

author: Kevin Pulo <kevin.pulo@mongodb.com> 2020-09-18 16:55:26 +1000
committer: Evergreen Agent <no-reply@evergreen.mongodb.com> 2020-10-04 12:15:18 +0000
commit: 800b1fdb10810290eba1f33b7455a1a3d782a6c9 (patch)
tree: 5223c5fdb6a34e2a2f3b970dc2910cef54329a38 /src/mongo/db/s
parent: f9b27f554543471cf405a0448e90f934651dc626 (diff)
download: mongo-800b1fdb10810290eba1f33b7455a1a3d782a6c9.tar.gz
1 files changed, 143 insertions, 51 deletions
diff --git a/src/mongo/db/s/README.md b/src/mongo/db/s/README.md
index a2a4547f1f8..6ddd3036a6f 100644
--- a/src/mongo/db/s/README.md
+++ b/src/mongo/db/s/README.md
@@ -530,43 +530,75 @@ mergeChunks, and moveChunk all take the chunk ResourceMutex.
 
 ---
 
-# The logical clock and causal consistency
+# The vector clock and causal consistency
+
+The vector clock is used to manage various logical times and clocks across the distributed system, for the purpose of ensuring various guarantees about the ordering of distributed events (ie. "causality").
+
+These causality guarantees are implemented by assigning certain _logical times_ to relevant _events_ in the system.  These logical times are strictly monotonically increasing, and are communicated ("gossiped") between nodes on all messages (both requests and responses).  This allows the order of distributed operations to be controlled in the same manner as with a Lamport clock.
+
+## Vector clock
+
+The VectorClock class manages these logical time, all of which have similar requirements (and possibly special relationships with each other).  There are currently three such components of the vector clock:
+
+1. _ClusterTime_
+1. _ConfigTime_
+1. _TopologyTime_
+
+Each of these has a type of LogicalTime, which is similar to Timestamp - it is an unsigned 64 bit integer representing a combination of unix epoch (high 32 bit) and an integer 32 bit counter (low 32 bit).  Together, the LogicalTimes for all vector clock components are known as the VectorTime.  Each LogicalTime must always strictly monotonically increase, and there are two ways that this can happen:
+
+1. _Ticking_ is when a node encounters circumstances that require it to unilaterally increase the value.  This can either be some incremental amount (usually 1), or to some appropriate LogicalTime value.
+1. _Advancing_ happens in response to learning about a larger value from some other node, ie. gossiping.
+
+Each component has rules regarding when (and how) it is ticked, and when (and how) it is gossiped.  These define the system state that the component "tracks", what the component can be used for, and its relationship to the other components.
+
+Since mongoses are stateless, they can never tick any vector clock component.  In order to enforce this, the VectorClockMutable class (a sub-class of VectorClock that provides the ticking API) is not linked on mongos.
+
+## Component relationships
+
+As explained in more detail below, certain relationships are preserved between between the vector clock components, most importantly:
+```
+ClusterTime >= ConfigTime >= TopologyTime
+```
+
+As a result, it is important to ensure that times are fetched correctly from the VectorClock.  The `getTime()` function returns a `VectorTime` which contains an atomic snapshot of all components.  Thus code should always be written such as:
+```
+auto currentTime = VectorClock::get(opCtx)->getTime();
+doSomeWork(currentTime.clusterTime());
+doOtherWork(currentTime.configTime());    // Always passes a timestamp <= what was passed to doSomeWork()
+```
+
+And generally speaking, code such as the following is incorrect:
+```
+doSomeWork(VectorClock::get(opCtx)->getTime().clusterTime());
+doOtherWork(VectorClock::get(opCtx)->getTime().configTime());    // Might pass a timestamp > what was passed to doSomeWork()
+```
+because the timestamp received by `doOtherWork()` could be greater than the one received by `doSomeWork()` (ie. apparently violating the property).
+
+To discourage this incorrect pattern, it is forbidden to use the result of getTime() as a temporary (r-value) in this way; it must always first be stored in a variable.
+
+## ClusterTime
+
 Starting from v3.6 MongoDB provides session based causal consistency. All operations in the causally
 consistent session will be execute in the order that preserves the causality. In particular it
 means that client of the session has guarantees to
+
 * Read own writes
 * Monotonic reads and writes
 * Writes follow reads
+
 Causal consistency guarantees described in details in the [**server
 documentation**](https://docs.mongodb.com/v4.0/core/read-isolation-consistency-recency/#causal-consistency).
-Causality is implemented by assigning to all operations in the system a strictly monotonically increasing  scalar number - a cluster time - and making sure that
-the operations are executed in the order of the cluster times. To achieve this in a distributed
-system MongoDB implements gossiping protocol which distributes the cluster time across all the
-nodes: mongod, mongos, drivers and mongo shell clients. Separately from gossiping the cluster time is incremented only on the nodes that can write -
-i.e. primary nodes.
-
-## Cluster time
-ClusterTime refers to the time value of the node's logical clock. Its represented as an unsigned 64
-bit integer representing a combination of unix epoch (high 32 bit) and an integer 32 bit counter (low 32 bit). It's
-incremented only when state changing events occur. As the state is represented by the oplog entries
-the oplog optime is derived from the cluster time.
-
-### Cluster time gossiping
-Every node (mongod, mongos, config servers, clients) keep track on the maximum value of the
-ClusterTime it has seen. Every node adds this value to each message it sends.
-
-### Cluster time ticking
-Every node in the cluster has a LogicalClock that keeps an in-memory version of the node's
-ClusterTime. ClusterTime can be converted to the OpTime, the time stored in MongoDB's replication oplog. OpTime
-can be used to identify entries and corresponding events in the oplog. OpTime is represented by a
-<Time><Increment><ElectionTerm> triplet. Here, the <ElectionTerm> is specific to the MongoDB
-replication protocol. It is local to the replica set and not a global state that should be included in the ClusterTime.
-To associate the ClusterTime with an oplog entry when events occur, [**MongoDB first computes the next ClusterTime
-value on the node, and then uses that value to create an OpTime (with the election term).**](https://github.com/mongodb/mongo/blob/v4.4/src/mongo/db/logical_clock.cpp#L102)
-This OpTime is what gets written to the oplog. This update does not require the OpTime format to change, remaining tied to a physical time. All the
-existing tools that use the oplog, such as backup and recovery, remain forward compatible.
-
-Example of ClusterTime gossiping and incrementing:
+
+### ClusterTime ticking
+The ClusterTime tracks the state of user databases.  As such, it is ticked only when the state of user databases change, i.e. when a mongod in PRIMARY state performs a write.  (In fact, there are a small number of other situations that tick ClusterTime, such as during step-up after a mongod node has won an election.)  The OpTime value used in the oplog entry for the write is obtained by converting this ticked ClusterTime to a Timestamp, and appending the current replication election term.
+
+The ticking itself is performed by first incrementing the unix epoch part to the current walltime (if necessary), and then incrementing the counter by 1.  (Parallel insertion into the oplog will increment by N, rather than 1, and allocate the resulting range of ClusterTime values to the oplog entries.)
+
+### ClusterTime gossiping
+The ClusterTime is gossiped by all nodes in the system: mongoses, shard mongods, config server mongods, and clients such as drivers or the shell.  It is gossiped with both internal clients (other mongod/mongos nodes in the cluster) and external clients (drivers, the shell).  It uses the `$clusterTime` field to do this, using the `SignedComponentFormat` described below.
+
+### ClusterTime example
+Example of ClusterTime gossiping and ticking:
 1. Client sends a write command to the primary, the message includes its current value of the ClusterTime: T1.
 1. Primary node receives the message and advances its ClusterTime to T1, if T1 is greater than the primary
 node's current ClusterTime value.
@@ -576,19 +608,23 @@ the only time a new value of ClusterTime is generated.
 1. Result is returned to the client, it includes the new ClusterTime T2.
 1. The client advances its ClusterTime to T2.
 
+### SignedComponentFormat: ClusterTime signing
 
-### Cluster time signing
-As shown before, nodes advance their logical clocks to the maximum ClusterTime that they receive in the client
+As explained above, nodes advance their ClusterTime to the maximum value that they receive in the client
 messages. The next oplog entry will use this value in the timestamp portion of the OpTime. But a malicious client
-could modify their maximum ClusterTime sent in a message.  For example, it could send the <greatest possible
-cluster time - 1> . This value, once written to the oplogs of replica set nodes, will not be incrementable and the
-nodes will be unable to accept any changes (writes against the database). The only way to recover from this situation
-would be to unload the data, clean it, and reload back with the correct OpTime. This malicious attack would take the
-affected shard offline, affecting the availability of the entire system. To mitigate this risk,
-MongoDB added a HMAC- SHA1 signature that is used to verify the value of the ClusterTime on the server. ClusterTime values can be read
-by any node, but only MongoDB processes can sign new values. The signature cannot be generated by clients.
-
-Here is an example of the document that distributes ClusterTime:
+could modify their maximum ClusterTime sent in a message.  For example, it could send the `<greatest possible
+cluster time - 1>`. This value, once written to the oplogs of replica set nodes, will not be incrementable (since LogicalTimes are unsigned) and the
+nodes will then be unable to accept any changes (writes against the database). This ClusterTime
+would eventually be gossiped across the entire cluster, affecting the availability of the whole
+system.  The only ways to recover from this situation involve downtime (eg. dump and restore the
+entire cluster).
+
+To mitigate this risk, a HMAC-SHA1 signature is used to verify the value of the ClusterTime on the
+server. ClusterTime values can be read by any node, but only MongoDB processes can sign new values.
+The signature cannot be generated by clients. This means that servers can trust that validly signed
+ClusterTime values supplied by (otherwise untrusted) clients must have been generated by a server.
+
+Here is an example of the document that gossips ClusterTime:
 ```
 "$clusterTime" : {
     "clusterTime" :
@@ -605,11 +641,13 @@ Every time the mongod or mongos receives a message that includes a
 ClusterTime that is greater than the value of its logical clock, they will validate it by generating the signature using the key
 with the keyId from the message. If the signature does not match, the message will be rejected.
 
-## Key management
+### Key management
 To provide HMAC message verification all nodes inside a security perimeter i.e. mongos and mongod  need to access a secret key to generate and
 verify message signatures. MongoDB maintains keys in a `system.keys` collection in the `admin`
 database. In the sharded cluster this collection is located on the config server, in the Replica Set
-its on the primary node. The key document has the following format:
+its managed by the primary node (and propagated to secondaries via normal replication).
+
+The key document has the following format:
 ```
 {
     _id: <NumberLong>,
@@ -625,17 +663,18 @@ there is always one key that is valid for the next 3 months (the default). The s
 requests the key that was used for signing the message by its Id which is also stored in the
 signature. Since the old keys are never deleted from the `system.keys` collection they are always
 available to verify the messages signed in the past.
-As the message verification is on the critical path each node also keeps the in memory cache of the
+
+As the message verification is on the critical path each node also keeps an in memory cache of the
 valid keys.
 
-## Handling operator errors
+### Handling operator errors
 The risk of malicious clients affecting ClusterTime is mitigated by a signature, but it is still possible to advance the
 ClusterTime to the end of time by changing the wall clock value. This may happen as a result of operator error. Once
 the data with the OpTime containing the end of time timestamp is committed to the majority of nodes it cannot be
-changed. To mitigate this, we implemented a limit on the rate of change. The ClusterTime on a node cannot be advanced
-more than the number of seconds defined by the `maxAcceptableLogicalClockDriftSecs` parameter (default value is one year).
+changed. To mitigate this, there is a limit on the magnitude by which the (epoch part of the) ClusterTime can be
+advanced.  This limit is the `maxAcceptableLogicalClockDriftSecs` parameter (default value is one year).
 
-## Causal consistency in sessions
+### Causal consistency in sessions
 When a write event is sent from a client, that client has no idea what time is associated with the write, because the time
 was assigned after the message was sent. But the node that processes the write does know, as it incremented its
 ClusterTime and applied the write to the oplog. To make the client aware of the write's ClusterTime, it will be included
@@ -647,17 +686,70 @@ the received `operationTime` - in the `afterClusterTime` field of the request. T
 needs to return data with an associated ClusterTime greater than or equal to the requested `afterClusterTime` value.
 
 Below is an example of causally consistent "read own write" for the products collection that is sharded and has chunks on Shards A and B.
-1. The client sends db.products.insert({_id: 10, price: 100}) to a mongos and it gets routed to Shard A.
+1. The client sends `db.products.insert({_id: 10, price: 100})` to a mongos and it gets routed to Shard A.
 1. The primary on Shard A computes the ClusterTime, and ticks as described in the previous sections.
 1. Shard A returns the result with the `operationTime` that was written to the oplog.
-1. The client conditionally updates its local lastOperationTime value with the returned `operationTime` value
-1. The client sends a read db.products.aggregate([{$count: "numProducts"}]) to mongos and it gets routed to all shards where this collection has chunks: i.e. Shard A and Shard B.
+1. The client conditionally updates its local `lastOperationTime` value with the returned `operationTime` value
+1. The client sends a read `db.products.aggregate([{$count: "numProducts"}])` to mongos and it gets routed to all shards where this collection has chunks: i.e. Shard A and Shard B.
   To be sure that it can "read own write" the client includes the `afterClusterTime` field in the request and passes the `operationTime` value it received from the write.
 1. Shard B checks if the data with the requested OpTime is in its oplog. If not, it performs a noop write, then returns the result to mongos.
  It includes the `operationTime` that was the top of the oplog at the moment the read was performed.
 1. Shard A checks if the data with the requested OpTime is in its oplog and returns the result to mongos. It includes the `operationTime` that was the top of the oplog at the moment the read was performed.
 1. mongos aggregates the results and returns to the client with the largest `operationTime` it has seen in the responses from shards A and B.
 
+## ConfigTime
+
+ConfigTime is similar to the legacy `configOpTime` value used for causally consistent reads from config servers, but as a LogicalTime rather than an OpTime.
+
+### ConfigTime ticking
+The ConfigTime tracks the sharding state stored on the config servers.  As such, it is ticked only by config servers when they advance their majority commit point, and is ticked by increasing to that majority commit point value.  Since the majority commit point is based on oplog OpTimes, which are based the ClusterTime, this means that the ConfigTime ticks between ClusterTime values.  It also means that it is always true that ConfigTime <= ClusterTime, ie. ConfigTime "lags" ClusterTime.
+
+The ConfigTime value is then used when querying the config servers to ensure that the returned state
+is causally consistent.  This is done by using the ConfigTime as the parameter to `$afterOpTime`
+field of the Read Concern (with an Uninitialised term, so that it's not used in comparisons), and as
+the `minClusterTime` parameter to the read preference (to ensure that a current config server is
+targeted, if possible).
+
+### ConfigTime gossiping
+The ConfigTime is gossiped only by sharded cluster nodes: mongoses, shard mongods, and config server mongods.  Clients (drivers/shell), and plain replica sets do not gossip ConfigTime.  In addition, ConfigTime is only gossiped with internal clients (other mongos/mongod nodes), as identified by the kInternalClient flag (set during the `hello` command sent by mongos/mongod).
+
+It uses the `$configTime` field with the `PlainComponentFormat`, which simply represents the LogicalTime value as a Timestamp:
+```
+"$configTime" : Timestamp(1495470881, 5)
+```
+
+## TopologyTime
+
+TopologyTime is related to the "topology" of the sharded cluster, in terms of the shards present.
+
+### TopologyTime ticking
+Since the TopologyTime tracks the cluster topology, it ticks when a shard is added or removed from the cluster.  This is done by ticking TopologyTime to the ConfigTime of the write issued by the `_configsvrAddShard` or `_configsvrRemoveShard` command.  Thus, the property holds that TopologyTime <= ConfigTime, ie. TopologyTime "lags" ConfigTime.
+
+The TopologyTime value is then used by the ShardRegistry to know when it needs to refresh from the config servers.
+
+### TopologyTime gossiping
+The TopologyTime is gossiped identically to ConfigTime, except with a field name of `$topologyTime`.  (Note that this name is very similar to the unrelated `$topologyVersion` field returned by the streaming `hello` command response.)
+
+## Code references
+
+* [**Base VectorClock class**](https://github.com/mongodb/mongo/blob/3681b03baa/src/mongo/db/vector_clock.h) (contains querying, advancing, gossiping the time)
+* [**VectorClockMutable class**](https://github.com/mongodb/mongo/blob/3681b03baa/src/mongo/db/vector_clock_mutable.h) (adds ticking and persistence, not linked on mongos)
+* [**VectorClockMongoD class**](https://github.com/mongodb/mongo/blob/3681b03baa/src/mongo/db/vector_clock_mongod.cpp) (specific implementation used by mongod nodes)
+* [**VectorClockMongoS class**](https://github.com/mongodb/mongo/blob/3681b03baa/src/mongo/s/vector_clock_mongos.cpp) (specific implementation used by mongos nodes)
+
+* [**Definition of which components use which gossiping format**](https://github.com/mongodb/mongo/blob/3681b03baa/src/mongo/db/vector_clock.cpp#L322-L330)
+* [**PlainComponentFormat class**](https://github.com/mongodb/mongo/blob/3681b03baa/src/mongo/db/vector_clock.cpp#L125-L155) (for gossiping without signatures, and persistence formatting)
+* [**SignedComponentFormat class**](https://github.com/mongodb/mongo/blob/3681b03baa/src/mongo/db/vector_clock.cpp#L186-L320) (for signed gossiping of ClusterTime)
+* [**LogicalTimeValidator class**](https://github.com/mongodb/mongo/blob/3681b03baa/src/mongo/db/logical_time_validator.h) (generates and validates ClusterTime signatures)
+* [**KeysCollectionManager class**](https://github.com/mongodb/mongo/blob/3681b03baa/src/mongo/db/keys_collection_manager.h) (maintains the ClusterTime signing keys in `admin.system.keys`)
+
+* [**Definition of which components are gossiped internally/externally by mongod**](https://github.com/mongodb/mongo/blob/3681b03baa/src/mongo/db/vector_clock_mongod.cpp#L389-L406)
+* [**Definition of when components may be ticked by mongod**](https://github.com/mongodb/mongo/blob/3681b03baa/src/mongo/db/vector_clock_mongod.cpp#L408-L450)
+* [**Definition of which components are gossiped internally/externally by mongos**](https://github.com/mongodb/mongo/blob/3681b03baa/src/mongo/s/vector_clock_mongos.cpp#L71-L79)
+
+* [**Ticking ClusterTime**](https://github.com/mongodb/mongo/blob/3681b03baa/src/mongo/db/repl/local_oplog_info.cpp#L125) (main usage, search for `tickClusterTime` to find unusual cases)
+* [**Ticking ConfigTime and TopologyTime**](https://github.com/mongodb/mongo/blob/3681b03baa/src/mongo/db/s/config_server_op_observer.cpp#L252-L256)
+
 ---
 
 # Logical Sessions
@@ -1015,4 +1107,4 @@ If the mongod server is primary, it will [try to step down](https://github.com/m
 * [Shutdown logic](https://github.com/mongodb/mongo/blob/30f5448e95114d344e6acffa92856536885e35dd/src/mongo/s/mongos_main.cpp#L336-L354) for mongos.
 
 ### Quiesce mode on shutdown
-mongos enters quiesce mode prior to shutdown, to allow short-running operations to finish. During this time, new and existing operations are allowed to run, but `isMaster` requests return a `ShutdownInProgress` error, to indicate that clients should start routing operations to other nodes. Entering quiesce mode is considered a significant topology change in the streaming `isMaster` protocol, so mongos tracks a `TopologyVersion`, which it increments on entering quiesce mode, prompting it to respond to all waiting isMaster requests.
-\ No newline at end of file
+mongos enters quiesce mode prior to shutdown, to allow short-running operations to finish. During this time, new and existing operations are allowed to run, but `isMaster` requests return a `ShutdownInProgress` error, to indicate that clients should start routing operations to other nodes. Entering quiesce mode is considered a significant topology change in the streaming `isMaster` protocol, so mongos tracks a `TopologyVersion`, which it increments on entering quiesce mode, prompting it to respond to all waiting isMaster requests.
author	Kevin Pulo <kevin.pulo@mongodb.com>	2020-09-18 16:55:26 +1000
committer	Evergreen Agent <no-reply@evergreen.mongodb.com>	2020-10-04 12:15:18 +0000
commit	800b1fdb10810290eba1f33b7455a1a3d782a6c9 (patch)
tree	5223c5fdb6a34e2a2f3b970dc2910cef54329a38 /src/mongo/db/s
parent	f9b27f554543471cf405a0448e90f934651dc626 (diff)
download	mongo-800b1fdb10810290eba1f33b7455a1a3d782a6c9.tar.gz