From 800b1fdb10810290eba1f33b7455a1a3d782a6c9 Mon Sep 17 00:00:00 2001 From: Kevin Pulo Date: Fri, 18 Sep 2020 16:55:26 +1000 Subject: SERVER-48175 VectoClock architecture guide updates --- src/mongo/db/s/README.md | 194 ++++++++++++++++++++++++++++++++++------------- 1 file changed, 143 insertions(+), 51 deletions(-) (limited to 'src/mongo/db/s/README.md') diff --git a/src/mongo/db/s/README.md b/src/mongo/db/s/README.md index a2a4547f1f8..6ddd3036a6f 100644 --- a/src/mongo/db/s/README.md +++ b/src/mongo/db/s/README.md @@ -530,43 +530,75 @@ mergeChunks, and moveChunk all take the chunk ResourceMutex. --- -# The logical clock and causal consistency +# The vector clock and causal consistency + +The vector clock is used to manage various logical times and clocks across the distributed system, for the purpose of ensuring various guarantees about the ordering of distributed events (ie. "causality"). + +These causality guarantees are implemented by assigning certain _logical times_ to relevant _events_ in the system. These logical times are strictly monotonically increasing, and are communicated ("gossiped") between nodes on all messages (both requests and responses). This allows the order of distributed operations to be controlled in the same manner as with a Lamport clock. + +## Vector clock + +The VectorClock class manages these logical time, all of which have similar requirements (and possibly special relationships with each other). There are currently three such components of the vector clock: + +1. _ClusterTime_ +1. _ConfigTime_ +1. _TopologyTime_ + +Each of these has a type of LogicalTime, which is similar to Timestamp - it is an unsigned 64 bit integer representing a combination of unix epoch (high 32 bit) and an integer 32 bit counter (low 32 bit). Together, the LogicalTimes for all vector clock components are known as the VectorTime. Each LogicalTime must always strictly monotonically increase, and there are two ways that this can happen: + +1. _Ticking_ is when a node encounters circumstances that require it to unilaterally increase the value. This can either be some incremental amount (usually 1), or to some appropriate LogicalTime value. +1. _Advancing_ happens in response to learning about a larger value from some other node, ie. gossiping. + +Each component has rules regarding when (and how) it is ticked, and when (and how) it is gossiped. These define the system state that the component "tracks", what the component can be used for, and its relationship to the other components. + +Since mongoses are stateless, they can never tick any vector clock component. In order to enforce this, the VectorClockMutable class (a sub-class of VectorClock that provides the ticking API) is not linked on mongos. + +## Component relationships + +As explained in more detail below, certain relationships are preserved between between the vector clock components, most importantly: +``` +ClusterTime >= ConfigTime >= TopologyTime +``` + +As a result, it is important to ensure that times are fetched correctly from the VectorClock. The `getTime()` function returns a `VectorTime` which contains an atomic snapshot of all components. Thus code should always be written such as: +``` +auto currentTime = VectorClock::get(opCtx)->getTime(); +doSomeWork(currentTime.clusterTime()); +doOtherWork(currentTime.configTime()); // Always passes a timestamp <= what was passed to doSomeWork() +``` + +And generally speaking, code such as the following is incorrect: +``` +doSomeWork(VectorClock::get(opCtx)->getTime().clusterTime()); +doOtherWork(VectorClock::get(opCtx)->getTime().configTime()); // Might pass a timestamp > what was passed to doSomeWork() +``` +because the timestamp received by `doOtherWork()` could be greater than the one received by `doSomeWork()` (ie. apparently violating the property). + +To discourage this incorrect pattern, it is forbidden to use the result of getTime() as a temporary (r-value) in this way; it must always first be stored in a variable. + +## ClusterTime + Starting from v3.6 MongoDB provides session based causal consistency. All operations in the causally consistent session will be execute in the order that preserves the causality. In particular it means that client of the session has guarantees to + * Read own writes * Monotonic reads and writes * Writes follow reads + Causal consistency guarantees described in details in the [**server documentation**](https://docs.mongodb.com/v4.0/core/read-isolation-consistency-recency/#causal-consistency). -Causality is implemented by assigning to all operations in the system a strictly monotonically increasing scalar number - a cluster time - and making sure that -the operations are executed in the order of the cluster times. To achieve this in a distributed -system MongoDB implements gossiping protocol which distributes the cluster time across all the -nodes: mongod, mongos, drivers and mongo shell clients. Separately from gossiping the cluster time is incremented only on the nodes that can write - -i.e. primary nodes. - -## Cluster time -ClusterTime refers to the time value of the node's logical clock. Its represented as an unsigned 64 -bit integer representing a combination of unix epoch (high 32 bit) and an integer 32 bit counter (low 32 bit). It's -incremented only when state changing events occur. As the state is represented by the oplog entries -the oplog optime is derived from the cluster time. - -### Cluster time gossiping -Every node (mongod, mongos, config servers, clients) keep track on the maximum value of the -ClusterTime it has seen. Every node adds this value to each message it sends. - -### Cluster time ticking -Every node in the cluster has a LogicalClock that keeps an in-memory version of the node's -ClusterTime. ClusterTime can be converted to the OpTime, the time stored in MongoDB's replication oplog. OpTime -can be used to identify entries and corresponding events in the oplog. OpTime is represented by a -