summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorEric Milkie <milkie@10gen.com>2020-08-03 14:51:42 -0400
committerEvergreen Agent <no-reply@evergreen.mongodb.com>2020-08-05 14:10:58 +0000
commit0340f9679e05ca78bc7fea39808e80d7c7f80bc9 (patch)
tree240959598d0886f143de59678a08e152f485fa1f
parenta0345214d33cbc02ff9f1587d4b74757217cd8d7 (diff)
downloadmongo-0340f9679e05ca78bc7fea39808e80d7c7f80bc9.tar.gz
SERVER-47294 Concurrency Architecture Guide
-rw-r--r--src/mongo/db/catalog/README.md136
1 files changed, 105 insertions, 31 deletions
diff --git a/src/mongo/db/catalog/README.md b/src/mongo/db/catalog/README.md
index fe0d83e210a..37f5adb5cde 100644
--- a/src/mongo/db/catalog/README.md
+++ b/src/mongo/db/catalog/README.md
@@ -5,7 +5,7 @@ catalog, in-memory and on-disk, of collections and indexes. It also implements a
whatever a storage engine implements) concurrency control layer to safely modify the catalog while
sustaining correct and consistent collection and index data formatting.
-Execution faciliates reads and writes to the storage engine with various persistence guarantees,
+Execution facilitates reads and writes to the storage engine with various persistence guarantees,
builds indexes, supports replication rollback, manages oplog visibility, repairs data corruption
and inconsistencies, and much more.
@@ -29,7 +29,7 @@ index implementations found in the [**index/**][] directory; and the sorter foun
include discussion of RecordStore interface
### Index Catalog
-include discussion of SortedDataInferface interface
+include discussion of SortedDataInterface interface
### Versioning
in memory versioning (or lack thereof) is separate from on disk
@@ -173,7 +173,7 @@ changes on the collection (see
## Secondary Reads
The oplog applier applies entries out-of-order to provide parallelism for data replication. This
-exposes readers with no set read timetsamp to the possibility of seeing inconsistent states of data.
+exposes readers with no set read timestamp to the possibility of seeing inconsistent states of data.
To solve this problem, the oplog applier takes the ParallelBatchWriterMode (PBWM) lock in X mode,
and readers using no read timestamp are expected to take the PBWM lock in IS mode to avoid observing
inconsistent data mid-batch.
@@ -203,52 +203,126 @@ how index tables also get updated when a write happens, (numIndexes + 1) writes
## Vectored Insert
# Concurrency Control
-We have the catalog described above; now how do we protect it?
-## Lock Modes
+Theoretically, one could design a database that used only mutexes to maintain database consistency
+while supporting multiple simultaneous operations; however, this solution would result in pretty bad
+performance and would a be strain on the operating system. Therefore, databases typically use a more
+complex method of coordinating operations. This design consists of Resources (lockable entities),
+some of which may be organized in a Hierarchy, and Locks (requests for access to a resource). A Lock
+Manager is responsible for keeping track of Resources and Locks, and for managing each Resource's
+Lock Queue. The Lock Manager identifies Resources with a ResourceId.
+
+## Resource Hierarchy
+
+In MongoDB, Resources are arranged in a hierarchy, in order to provide an ordering to prevent
+deadlocks when locking multiple Resources, and also as an implementation of Intent Locking (an
+optimization for locking higher level resources). The hierarchy of ResourceTypes is as follows:
+
+1. Global (three - see below)
+1. Database (one per database on the server)
+1. Collection (one per collection on the server)
-## Lock Granularity
-Different storage engines can support different levels of granularity.
+Each resource must be locked in order from the top. Therefore, if a Collection resource is desired
+to be locked, one must first lock the one Global resource, and then lock the Database resource that
+is the parent of the Collection. Finally, the Collection resource is locked.
-### Lock Acquisition Order
-discuss lock acquisition order
+In addition to these ResourceTypes, there also exists ResourceMutex, which is independent of this
+hierarchy. One can use ResourceMutex instead of a regular mutex if one desires the features of the
+lock manager, such as fair queuing and the ability to have multiple simultaneous lock holders.
-mention risk of deadlocks motivation
+## Lock Modes
+
+The lock manager keeps track of each Resource's _granted locks_ and a queue of _waiting locks_.
+Rather than the binary "locked-or-not" modes of a mutex, a MongoDB lock can have one of several
+_modes_. Modes have different _compatibilities_ with other locks for the same resource. Locks with
+compatible modes can be simultaneously granted to the same resource, while locks with modes that are
+incompatible with any currently granted lock on a resource must wait in the waiting queue for that
+resource until the conflicting granted locks are unlocked. The different types of modes are:
+1. X (exclusive): Used to perform writes and reads on the resource.
+2. S (shared): Used to perform only reads on the resource (thus, it is okay to Share with other
+ compatible locks).
+3. IX (intent-exclusive): Used to indicate that an X lock is taken at a level in the hierarchy below
+ this resource. This lock mode is used to block X or S locks on this resource.
+4. IS (intent-shared): Used to indicate that an S lock is taken at a level in the hierarchy below
+ this resource. This lock mode is used to block X locks on this resource.
+
+## Lock Compatibility Matrix
+
+This matrix answers the question, given a granted lock on a resource with the mode given, is a
+requested lock on that same resource with the given mode compatible?
+
+| Requested Mode ||| Granted Mode |||
+|:---------------|:---------:|:-------:|:-------:|:------:|:------:|
+| | MODE_NONE | MODE_IS | MODE_IX | MODE_S | MODE_X |
+| MODE_IS | Y | Y | Y | Y | N |
+| MODE_IX | Y | Y | Y | N | N |
+| MODE_S | Y | Y | N | Y | N |
+| MODE_X | Y | N | N | N | N |
+
+Typically, locks are granted in the order they are queued, but some LockRequest behaviors can be
+specially selected to break this rule. One behavior is _enqueueAtFront_, which allows important lock
+acquisitions to cut to the front of the line, in order to expedite them. Currently, all mode X and S
+locks for the three Global Resources (Global, RSTL, and PBWM) automatically use this option.
+Another behavior is _compatibleFirst_, which allows compatible lock requests to cut ahead of others
+waiting in the queue and be granted immediately; note that this mode might starve queued lock
+requests indefinitely.
### Replication State Transition Lock (RSTL)
+The Replication State Transition Lock is of ResourceType Global, so it must be locked prior to
+locking any Database level resource. This lock is used to synchronize replica state transitions
+(typically transitions between PRIMARY, SECONDARY, and ROLLBACK states).
+More information on the RSTL is contained in the [Replication Architecture Guide](https://github.com/mongodb/mongo/blob/b4db8c01a13fd70997a05857be17548b0adec020/src/mongo/db/repl/README.md#replication-state-transition-lock)
+
### Parallel Batch Writer Mode Lock (PBWM)
+The Parallel Batch Writer Mode lock is of ResourceType Global, so it must be locked prior to locking
+any Database level resource. This lock is used to synchronize secondary oplog application with other
+readers, so that they do not observe inconsistent snapshots of the data. Typically this is only an
+issue with readers that read with no timestamp, readers at explicit timestamps can acquire this lock
+in a compatible mode with the oplog applier and thus are not blocked when the oplog applier is
+running.
+More information on the PBWM lock is contained in the [Replication Architecture Guide.](https://github.com/mongodb/mongo/blob/b4db8c01a13fd70997a05857be17548b0adec020/src/mongo/db/repl/README.md#parallel-batch-writer-mode)
+
### Global Lock
+The resource known as the Global Lock is of ResourceType Global. It is currently used to
+synchronize shutdown, so that all operations are finished with the storage engine before closing it.
+Certain types of global storage engine operations, such as recoverToStableTimestamp(), also require
+this lock to be held in exclusive mode.
+
### Database Lock
+Any resource of ResourceType Database protects certain database-wide operations such as database
+drop. These operations are being phased out, in the hopes that we can eliminate this ResourceType
+completely.
+
### Collection Lock
-### Document Level Concurrency Control
-Explain WT's optimistic concurrency control, and why we do not need document locks in the MongoDB layer.
+Any resource of ResourceType Collection protects certain collection-wide operations, and in some
+cases also protects the in-memory catalog structure consistency in the face of concurrent readers
+and writers of the catalog. Acquiring this resource with an intent lock is an indication that the
+operation is doing explicit reads (IS) or writes (IX) at the document level. There is no Document
+ResourceType, as locking at this level is done in the storage engine itself for efficiency reasons.
-### Mutexes
+### Document Level Concurrency Control
-### FCV Lock
+Each storage engine is responsible for locking at the document level. The WiredTiger storage engine
+uses MVCC (multiversion concurrency control) along with optimistic locking in order to provide
+concurrency guarantees.
## Two-Phase Locking
-We use this for transactions? Explain.
-
-## Replica Set Transaction Locking
-TBD: title of this section -- there is some confusion over what terminology will be best understood
-Stashing and unstashing locks for replica set level transactions across multiple statements.
-Read's IS locks are converted to IX locks in replica set transactions.
-
-## Locking Best Practices
-
-### Network Calls
-i.e., never hold a lock across a network call unless absolutely necessary
-### Long Running I/O
-i.e., don't hold a lock across journal flushing
+The lock manager automatically provides _two-phase locking_ for a given storage transaction.
+Two-phase locking consists of an Expanding phase where locks are acquired but not released, and a
+subsequent Shrinking phase where locks are released but not acquired. By adhering to this protocol,
+a transaction will be guaranteed to be serializable with other concurrent transactions. The
+WriteUnitOfWork class manages two-phase locking in MongoDB. This results in the somewhat unexpected
+behavior of the RAII locking types acquiring locks on resources upon their construction but not
+unlocking the lock upon their destruction when going out of scope. Instead, the responsibility of
+unlocking the locks is transferred to the WriteUnitOfWork destructor. Note this is only true for
+transactions that do writes, and therefore only for code that uses WriteUnitOfWork.
-### FCV Lock Usage
# Indexes
@@ -314,7 +388,7 @@ have the following procedure:
## Hybrid Index Builds
Hybrid index builds refer to the default procedure introduced in 4.2 that produces efficient index
-data structures without blocking reads or writes for extended periods of time. This is acheived by
+data structures without blocking reads or writes for extended periods of time. This is achieved by
performing a full collection scan and bulk-loading keys (described above) while concurrently
intercepting new writes into a temporary storage engine table.
@@ -420,7 +494,7 @@ data on a collection and performed the first drain of side-writes. Voting is imp
`voteCommitIndexBuild` command, and is persisted as a write to the replicated
`config.system.indexBuilds` collection.
-While waiting for a commit decision, primaries and secondaries continue receiving and applying new
+While waiting for a commit decision, primaries and secondaries continue recieving and applying new
side writes. When a quorum is reached, the current primary, under a collection X lock, will check
all index constraints. If there are errors, it will replicate an `abortIndexBuild` oplog entry. If
the index build is successful, it will replicate a `commitIndexBuild` oplog entry.
@@ -620,7 +694,7 @@ MongoDB repair attempts to address the following forms of corruption:
configuration](https://github.com/mongodb/mongo/blob/r4.5.0/src/mongo/db/repair_database_and_check_version.cpp#L460-L485)
if data has been or could have been modified. This [prevents a repaired node from
joining](https://github.com/mongodb/mongo/blob/r4.5.0/src/mongo/db/repl/replication_coordinator_impl.cpp#L486-L494)
- and threatening the consisency of its replica set.
+ and threatening the consistency of its replica set.
Additionally:
* When repair starts, it creates a temporary file, `_repair_incomplete` that is only removed when