SERVER-47294 Concurrency Architecture Guide

author: Eric Milkie <milkie@10gen.com> 2020-08-03 14:51:42 -0400
committer: Evergreen Agent <no-reply@evergreen.mongodb.com> 2020-08-05 14:10:58 +0000
commit: 0340f9679e05ca78bc7fea39808e80d7c7f80bc9 (patch)
tree: 240959598d0886f143de59678a08e152f485fa1f
parent: a0345214d33cbc02ff9f1587d4b74757217cd8d7 (diff)
download: mongo-0340f9679e05ca78bc7fea39808e80d7c7f80bc9.tar.gz
1 files changed, 105 insertions, 31 deletions
diff --git a/src/mongo/db/catalog/README.md b/src/mongo/db/catalog/README.md
index fe0d83e210a..37f5adb5cde 100644
--- a/src/mongo/db/catalog/README.md
+++ b/src/mongo/db/catalog/README.md
@@ -5,7 +5,7 @@ catalog, in-memory and on-disk, of collections and indexes. It also implements a
 whatever a storage engine implements) concurrency control layer to safely modify the catalog while
 sustaining correct and consistent collection and index data formatting.
 
-Execution faciliates reads and writes to the storage engine with various persistence guarantees,
+Execution facilitates reads and writes to the storage engine with various persistence guarantees,
 builds indexes, supports replication rollback, manages oplog visibility, repairs data corruption
 and inconsistencies, and much more.
 
@@ -29,7 +29,7 @@ index implementations found in the [**index/**][] directory; and the sorter foun
 include discussion of RecordStore interface
 
 ### Index Catalog
-include discussion of SortedDataInferface interface
+include discussion of SortedDataInterface interface
 
 ### Versioning
 in memory versioning (or lack thereof) is separate from on disk
@@ -173,7 +173,7 @@ changes on the collection (see
 ## Secondary Reads
 
 The oplog applier applies entries out-of-order to provide parallelism for data replication. This
-exposes readers with no set read timetsamp to the possibility of seeing inconsistent states of data.
+exposes readers with no set read timestamp to the possibility of seeing inconsistent states of data.
 To solve this problem, the oplog applier takes the ParallelBatchWriterMode (PBWM) lock in X mode,
 and readers using no read timestamp are expected to take the PBWM lock in IS mode to avoid observing
 inconsistent data mid-batch.
@@ -203,52 +203,126 @@ how index tables also get updated when a write happens, (numIndexes + 1) writes
 ## Vectored Insert
 
 # Concurrency Control
-We have the catalog described above; now how do we protect it?
 
-## Lock Modes
+Theoretically, one could design a database that used only mutexes to maintain database consistency
+while supporting multiple simultaneous operations; however, this solution would result in pretty bad
+performance and would a be strain on the operating system. Therefore, databases typically use a more
+complex method of coordinating operations. This design consists of Resources (lockable entities),
+some of which may be organized in a Hierarchy, and Locks (requests for access to a resource). A Lock
+Manager is responsible for keeping track of Resources and Locks, and for managing each Resource's
+Lock Queue.  The Lock Manager identifies Resources with a ResourceId.
+
+## Resource Hierarchy
+
+In MongoDB, Resources are arranged in a hierarchy, in order to provide an ordering to prevent
+deadlocks when locking multiple Resources, and also as an implementation of Intent Locking (an
+optimization for locking higher level resources). The hierarchy of ResourceTypes is as follows:
+
+1. Global (three - see below) 
+1. Database (one per database on the server)
+1. Collection (one per collection on the server)
 
-## Lock Granularity
-Different storage engines can support different levels of granularity.
+Each resource must be locked in order from the top. Therefore, if a Collection resource is desired
+to be locked, one must first lock the one Global resource, and then lock the Database resource that
+is the parent of the Collection. Finally, the Collection resource is locked.
 
-### Lock Acquisition Order
-discuss lock acquisition order
+In addition to these ResourceTypes, there also exists ResourceMutex, which is independent of this
+hierarchy.  One can use ResourceMutex instead of a regular mutex if one desires the features of the
+lock manager, such as fair queuing and the ability to have multiple simultaneous lock holders.
 
-mention risk of deadlocks motivation
+## Lock Modes
+
+The lock manager keeps track of each Resource's _granted locks_ and a queue of _waiting locks_.
+Rather than the binary "locked-or-not" modes of a mutex, a MongoDB lock can have one of several
+_modes_. Modes have different _compatibilities_ with other locks for the same resource. Locks with
+compatible modes can be simultaneously granted to the same resource, while locks with modes that are
+incompatible with any currently granted lock on a resource must wait in the waiting queue for that
+resource until the conflicting granted locks are unlocked.  The different types of modes are:
+1. X (exclusive): Used to perform writes and reads on the resource.
+2. S (shared): Used to perform only reads on the resource (thus, it is okay to Share with other
+   compatible locks).
+3. IX (intent-exclusive): Used to indicate that an X lock is taken at a level in the hierarchy below
+   this resource.  This lock mode is used to block X or S locks on this resource.
+4. IS (intent-shared): Used to indicate that an S lock is taken at a level in the hierarchy below
+   this resource. This lock mode is used to block X locks on this resource.
+
+## Lock Compatibility Matrix
+
+This matrix answers the question, given a granted lock on a resource with the mode given, is a
+requested lock on that same resource with the given mode compatible?
+
+| Requested Mode |||                  Granted Mode               |||
+|:---------------|:---------:|:-------:|:-------:|:------:|:------:|
+|                | MODE_NONE | MODE_IS | MODE_IX | MODE_S | MODE_X |
+| MODE_IS        |     Y     |    Y    |    Y    |    Y   |   N    |
+| MODE_IX        |     Y     |    Y    |    Y    |    N   |   N    |
+| MODE_S         |     Y     |    Y    |    N    |    Y   |   N    |
+| MODE_X         |     Y     |    N    |    N    |    N   |   N    |
+
+Typically, locks are granted in the order they are queued, but some LockRequest behaviors can be
+specially selected to break this rule. One behavior is _enqueueAtFront_, which allows important lock
+acquisitions to cut to the front of the line, in order to expedite them. Currently, all mode X and S
+locks for the three Global Resources (Global, RSTL, and PBWM) automatically use this option.
+Another behavior is _compatibleFirst_, which allows compatible lock requests to cut ahead of others
+waiting in the queue and be granted immediately; note that this mode might starve queued lock
+requests indefinitely.
 
 ### Replication State Transition Lock (RSTL)
 
+The Replication State Transition Lock is of ResourceType Global, so it must be locked prior to
+locking any Database level resource. This lock is used to synchronize replica state transitions
+(typically transitions between PRIMARY, SECONDARY, and ROLLBACK states).
+More information on the RSTL is contained in the [Replication Architecture Guide](https://github.com/mongodb/mongo/blob/b4db8c01a13fd70997a05857be17548b0adec020/src/mongo/db/repl/README.md#replication-state-transition-lock)
+
 ### Parallel Batch Writer Mode Lock (PBWM)
 
+The Parallel Batch Writer Mode lock is of ResourceType Global, so it must be locked prior to locking
+any Database level resource. This lock is used to synchronize secondary oplog application with other
+readers, so that they do not observe inconsistent snapshots of the data. Typically this is only an
+issue with readers that read with no timestamp, readers at explicit timestamps can acquire this lock
+in a compatible mode with the oplog applier and thus are not blocked when the oplog applier is
+running.
+More information on the PBWM lock is contained in the [Replication Architecture Guide.](https://github.com/mongodb/mongo/blob/b4db8c01a13fd70997a05857be17548b0adec020/src/mongo/db/repl/README.md#parallel-batch-writer-mode)
+
 ### Global Lock
 
+The resource known as the Global Lock is of ResourceType Global.  It is currently used to
+synchronize shutdown, so that all operations are finished with the storage engine before closing it.
+Certain types of global storage engine operations, such as recoverToStableTimestamp(), also require
+this lock to be held in exclusive mode.
+
 ### Database Lock
 
+Any resource of ResourceType Database protects certain database-wide operations such as database
+drop.  These operations are being phased out, in the hopes that we can eliminate this ResourceType
+completely.
+
 ### Collection Lock
 
-### Document Level Concurrency Control
-Explain WT's optimistic concurrency control, and why we do not need document locks in the MongoDB layer.
+Any resource of ResourceType Collection protects certain collection-wide operations, and in some
+cases also protects the in-memory catalog structure consistency in the face of concurrent readers
+and writers of the catalog. Acquiring this resource with an intent lock is an indication that the
+operation is doing explicit reads (IS) or writes (IX) at the document level.  There is no Document
+ResourceType, as locking at this level is done in the storage engine itself for efficiency reasons.
 
-### Mutexes
+### Document Level Concurrency Control
 
-### FCV Lock
+Each storage engine is responsible for locking at the document level.  The WiredTiger storage engine
+uses MVCC (multiversion concurrency control) along with optimistic locking in order to provide
+concurrency guarantees.
 
 ## Two-Phase Locking
-We use this for transactions? Explain.
-
-## Replica Set Transaction Locking
-TBD: title of this section -- there is some confusion over what terminology will be best understood
-Stashing and unstashing locks for replica set level transactions across multiple statements.
-Read's IS locks are converted to IX locks in replica set transactions.
-
-## Locking Best Practices
-
-### Network Calls
-i.e., never hold a lock across a network call unless absolutely necessary
 
-### Long Running I/O
-i.e., don't hold a lock across journal flushing
+The lock manager automatically provides _two-phase locking_ for a given storage transaction.
+Two-phase locking consists of an Expanding phase where locks are acquired but not released, and a
+subsequent Shrinking phase where locks are released but not acquired.  By adhering to this protocol,
+a transaction will be guaranteed to be serializable with other concurrent transactions. The
+WriteUnitOfWork class manages two-phase locking in MongoDB. This results in the somewhat unexpected
+behavior of the RAII locking types acquiring locks on resources upon their construction but not
+unlocking the lock upon their destruction when going out of scope. Instead, the responsibility of
+unlocking the locks is transferred to the WriteUnitOfWork destructor.  Note this is only true for
+transactions that do writes, and therefore only for code that uses WriteUnitOfWork.
 
-### FCV Lock Usage
 
 # Indexes
 
@@ -314,7 +388,7 @@ have the following procedure:
 ## Hybrid Index Builds
 
 Hybrid index builds refer to the default procedure introduced in 4.2 that produces efficient index
-data structures without blocking reads or writes for extended periods of time. This is acheived by
+data structures without blocking reads or writes for extended periods of time. This is achieved by
 performing a full collection scan and bulk-loading keys (described above) while concurrently
 intercepting new writes into a temporary storage engine table.
 
@@ -420,7 +494,7 @@ data on a collection and performed the first drain of side-writes. Voting is imp
 `voteCommitIndexBuild` command, and is persisted as a write to the replicated
 `config.system.indexBuilds` collection.
 
-While waiting for a commit decision, primaries and secondaries continue receiving and applying new
+While waiting for a commit decision, primaries and secondaries continue recieving and applying new
 side writes. When a quorum is reached, the current primary, under a collection X lock, will check
 all index constraints. If there are errors, it will replicate an `abortIndexBuild` oplog entry. If
 the index build is successful, it will replicate a `commitIndexBuild` oplog entry.
@@ -620,7 +694,7 @@ MongoDB repair attempts to address the following forms of corruption:
    configuration](https://github.com/mongodb/mongo/blob/r4.5.0/src/mongo/db/repair_database_and_check_version.cpp#L460-L485)
    if data has been or could have been modified. This [prevents a repaired node from
    joining](https://github.com/mongodb/mongo/blob/r4.5.0/src/mongo/db/repl/replication_coordinator_impl.cpp#L486-L494)
-   and threatening the consisency of its replica set.
+   and threatening the consistency of its replica set.
 
 Additionally:
 * When repair starts, it creates a temporary file, `_repair_incomplete` that is only removed when
author	Eric Milkie <milkie@10gen.com>	2020-08-03 14:51:42 -0400
committer	Evergreen Agent <no-reply@evergreen.mongodb.com>	2020-08-05 14:10:58 +0000
commit	0340f9679e05ca78bc7fea39808e80d7c7f80bc9 (patch)
tree	240959598d0886f143de59678a08e152f485fa1f
parent	a0345214d33cbc02ff9f1587d4b74757217cd8d7 (diff)
download	mongo-0340f9679e05ca78bc7fea39808e80d7c7f80bc9.tar.gz