summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorXuerui Fa <xuerui.fa@mongodb.com>2020-05-27 14:10:50 -0400
committerEvergreen Agent <no-reply@evergreen.mongodb.com>2020-06-01 17:44:40 +0000
commitf5a4a6fcf306fbd0d62ecd6c1213a1ca174f7497 (patch)
treeea71440b40015267c314d74d02fd93f36682d4db
parent16d49c9b6496227aae24a0f7ead6dce5053b33d2 (diff)
downloadmongo-f5a4a6fcf306fbd0d62ecd6c1213a1ca174f7497.tar.gz
SERVER-47458: Update the Sync Source Selection section of the Repl Architecture Guide
-rw-r--r--src/mongo/db/repl/README.md85
1 files changed, 62 insertions, 23 deletions
diff --git a/src/mongo/db/repl/README.md b/src/mongo/db/repl/README.md
index 9afd1b33166..57d84b81a3f 100644
--- a/src/mongo/db/repl/README.md
+++ b/src/mongo/db/repl/README.md
@@ -59,19 +59,50 @@ A secondary keeps its data synchronized with its sync source by fetching oplog e
source. This is done via the
[`OplogFetcher`](https://github.com/mongodb/mongo/blob/929cd5af6623bb72f05d3364942e84d053ddea0d/src/mongo/db/repl/oplog_fetcher.h).
-The `OplogFetcher` first creates a connection to the sync source. Through this connection, it will
-establish an **exhaust cursor** to fetch oplog entries. This means that after the initial `find` and
-`getMore` are sent, the sync source will keep sending all subsequent batches without needing the
-fetching node to run any additional `getMore`s.
+The `OplogFetcher` does not directly apply the operations it retrieves from the sync source.
+Rather, it puts them into a buffer (the **`OplogBuffer`**) and another thread is in charge of
+taking the operations off the buffer and applying them. That buffer uses an in-memory blocking
+queue for steady state replication; there is a similar collection-backed buffer used for initial
+sync.
+
+#### Oplog Fetcher Lifecycle
+
+The `OplogFetcher` is owned by the
+[`BackgroundSync`](https://github.com/mongodb/mongo/blob/r4.2.0/src/mongo/db/repl/bgsync.h) thread.
+The `BackgroundSync` thread runs continuously while a node is in `SECONDARY` state.
+`BackgroundSync` sits in a loop, where each iteration it first chooses a sync source with the
+`SyncSourceResolver` and then starts up the `OplogFetcher`.
+
+In steady state, the `OplogFetcher` continuously receives and processes batches of oplog entries
+from its sync source.
+
+The `OplogFetcher` could terminate because the first batch implies that a rollback is required, it
+could receive an error from the sync source, or it could just be shut down by its owner, such as
+when `BackgroundSync` itself is shut down. In addition, after every batch, the `OplogFetcher` runs
+validation checks on the documents in that batch. It then decides if it should continue syncing
+from the current sync source. If validation fails, or if the node decides to stop syncing, the
+`OplogFetcher` will shut down.
+
+When the `OplogFetcher` terminates, `BackgroundSync` restarts sync source selection, exits, or goes
+into ROLLBACK depending on the return status.
+
+#### Oplog Fetcher Implementation Details
Let’s refer to the sync source as node A and the fetching node as node B.
+After starting up, the `OplogFetcher` first creates a connection to sync source A. Through this
+connection, it will establish an **exhaust cursor** to fetch oplog entries. This means that after
+the initial `find` and `getMore` are sent, A will keep sending all subsequent batches without
+needing B to run any additional `getMore`s.
+
The `find` command that B’s `OplogFetcher` first sends to sync source A has a greater than or equal
predicate on the timestamp of the last oplog entry it has fetched. The original `find` command
-should always return at least 1 document due to the greater than or equal predicate. If it does not,
-that means that the A’s oplog is behind B's and thus A should not be B’s sync source. If it does
+should always return at least 1 document due to the greater than or equal predicate. If it does
+not, that means that A’s oplog is behind B's and thus A should not be B’s sync source. If it does
return a non-empty batch, but the first document returned does not match the last entry in B’s
-oplog, that means that B's oplog has diverged from A's and it should go into
+oplog, there are two possibilities. If the oldest entry in A's oplog is newer than B's latest
+entry, that means that B is too stale to sync from A. As a result, B blacklists A as a sync source
+candidate. Otherwise, B's oplog has diverged from A's and it should go into
[**ROLLBACK**](https://docs.mongodb.com/manual/core/replica-set-rollbacks/).
After getting the original `find` response, secondaries check the metadata that accompanies the
@@ -80,8 +111,8 @@ not rolled back since it was chosen and that it is still ahead of them.
The `OplogFetcher` specifies `awaitData: true, tailable: true` on the cursor so that subsequent
batches block until their `maxTimeMS` expires waiting for more data instead of returning
-immediately. If there is no data to return at the end of `maxTimeMS`, the `OplogFetcher` receives an
-empty batch and will wait on the next batch.
+immediately. If there is no data to return at the end of `maxTimeMS`, the `OplogFetcher` receives
+an empty batch and will wait on the next batch.
If the `OplogFetcher` encounters any errors while trying to connect to the sync source or get a
batch, it will use `OplogFetcherRestartDecision` to check that it has enough retries left to create
@@ -91,20 +122,28 @@ errors enough times in a row to exhaust its retries, that might be an indication
something wrong with the connection or the sync source. In that case, the `OplogFetcher` will shut
down with an error status.
-The `OplogFetcher` is owned by the
-[`BackgroundSync`](https://github.com/mongodb/mongo/blob/r4.2.0/src/mongo/db/repl/bgsync.h) thread.
-The `BackgroundSync` thread runs continuously while a node is in `SECONDARY` state. `BackgroundSync`
-sits in a loop, where each iteration it first chooses a sync source with the `SyncSourceResolver`
-and then starts up the `OplogFetcher`. When the `OplogFetcher` terminates, `BackgroundSync` restarts
-sync source selection, exits, or goes into ROLLBACK depending on the return status. The
-`OplogFetcher` could terminate because the first batch implies that a rollback is required, it could
-receive an error from the sync source, or it could just be shut down by its owner, such as when
-`BackgroundSync` itself is shut down.
-
-The `OplogFetcher` does not directly apply the operations it retrieves from the sync source. Rather,
-it puts them into a buffer (the **`OplogBuffer`**) and another thread is in charge of taking the
-operations off the buffer and applying them. That buffer uses an in-memory blocking queue for steady
-state replication; there is a similar collection-backed buffer used for initial sync.
+The `OplogFetcher` may shut down for a variety of other reasons as well. After each successful
+batch, the `OplogFetcher` decides if it should continue syncing from the current sync source. If
+the `OplogFetcher` decides to continue, it will wait for the next batch to arrive and repeat. If
+not, the `OplogFetcher` will terminate, which will lead to `BackgroundSync` choosing a new sync
+source. Reasons for changing sync sources include:
+
+* If the node is no longer in the replica set configuration.
+* If the current sync source is no longer in the replica set configuration.
+* If the user has requested another sync source via the `replSetSyncFrom` command.
+* If chaining is disabled and the node is not currently syncing from the primary.
+* If the sync source is not the primary, does not have its own sync source, and is not ahead of
+ the node. This indicates that the sync source will not receive writes in a timely manner. As a
+ result, continuing to sync from it will likely cause the node to be lagged.
+* If the most recent OpTime of the sync source is more than `maxSyncSourceLagSecs` seconds behind
+ another member's latest oplog entry. This ensures that the sync source is not too far behind
+ other nodes in the set. `maxSyncSourceLagSecs` is a server parameter and has a default value of
+ 30 seconds.
+* If the node has discovered another eligible sync source that is significantly closer. A
+ significantly closer node has a ping time that is at least `changeSyncSourceThresholdMillis`
+ lower than our current sync source. This minimizes the number of nodes that have sync sources
+ located far away.`changeSyncSourceThresholdMillis` is a server parameter and has a default value
+ of 5 ms.
### Sync Source Selection