diff options
-rw-r--r-- | src/mongo/db/catalog/README.md | 136 |
1 files changed, 105 insertions, 31 deletions
diff --git a/src/mongo/db/catalog/README.md b/src/mongo/db/catalog/README.md index fe0d83e210a..37f5adb5cde 100644 --- a/src/mongo/db/catalog/README.md +++ b/src/mongo/db/catalog/README.md @@ -5,7 +5,7 @@ catalog, in-memory and on-disk, of collections and indexes. It also implements a whatever a storage engine implements) concurrency control layer to safely modify the catalog while sustaining correct and consistent collection and index data formatting. -Execution faciliates reads and writes to the storage engine with various persistence guarantees, +Execution facilitates reads and writes to the storage engine with various persistence guarantees, builds indexes, supports replication rollback, manages oplog visibility, repairs data corruption and inconsistencies, and much more. @@ -29,7 +29,7 @@ index implementations found in the [**index/**][] directory; and the sorter foun include discussion of RecordStore interface ### Index Catalog -include discussion of SortedDataInferface interface +include discussion of SortedDataInterface interface ### Versioning in memory versioning (or lack thereof) is separate from on disk @@ -173,7 +173,7 @@ changes on the collection (see ## Secondary Reads The oplog applier applies entries out-of-order to provide parallelism for data replication. This -exposes readers with no set read timetsamp to the possibility of seeing inconsistent states of data. +exposes readers with no set read timestamp to the possibility of seeing inconsistent states of data. To solve this problem, the oplog applier takes the ParallelBatchWriterMode (PBWM) lock in X mode, and readers using no read timestamp are expected to take the PBWM lock in IS mode to avoid observing inconsistent data mid-batch. @@ -203,52 +203,126 @@ how index tables also get updated when a write happens, (numIndexes + 1) writes ## Vectored Insert # Concurrency Control -We have the catalog described above; now how do we protect it? -## Lock Modes +Theoretically, one could design a database that used only mutexes to maintain database consistency +while supporting multiple simultaneous operations; however, this solution would result in pretty bad +performance and would a be strain on the operating system. Therefore, databases typically use a more +complex method of coordinating operations. This design consists of Resources (lockable entities), +some of which may be organized in a Hierarchy, and Locks (requests for access to a resource). A Lock +Manager is responsible for keeping track of Resources and Locks, and for managing each Resource's +Lock Queue. The Lock Manager identifies Resources with a ResourceId. + +## Resource Hierarchy + +In MongoDB, Resources are arranged in a hierarchy, in order to provide an ordering to prevent +deadlocks when locking multiple Resources, and also as an implementation of Intent Locking (an +optimization for locking higher level resources). The hierarchy of ResourceTypes is as follows: + +1. Global (three - see below) +1. Database (one per database on the server) +1. Collection (one per collection on the server) -## Lock Granularity -Different storage engines can support different levels of granularity. +Each resource must be locked in order from the top. Therefore, if a Collection resource is desired +to be locked, one must first lock the one Global resource, and then lock the Database resource that +is the parent of the Collection. Finally, the Collection resource is locked. -### Lock Acquisition Order -discuss lock acquisition order +In addition to these ResourceTypes, there also exists ResourceMutex, which is independent of this +hierarchy. One can use ResourceMutex instead of a regular mutex if one desires the features of the +lock manager, such as fair queuing and the ability to have multiple simultaneous lock holders. -mention risk of deadlocks motivation +## Lock Modes + +The lock manager keeps track of each Resource's _granted locks_ and a queue of _waiting locks_. +Rather than the binary "locked-or-not" modes of a mutex, a MongoDB lock can have one of several +_modes_. Modes have different _compatibilities_ with other locks for the same resource. Locks with +compatible modes can be simultaneously granted to the same resource, while locks with modes that are +incompatible with any currently granted lock on a resource must wait in the waiting queue for that +resource until the conflicting granted locks are unlocked. The different types of modes are: +1. X (exclusive): Used to perform writes and reads on the resource. +2. S (shared): Used to perform only reads on the resource (thus, it is okay to Share with other + compatible locks). +3. IX (intent-exclusive): Used to indicate that an X lock is taken at a level in the hierarchy below + this resource. This lock mode is used to block X or S locks on this resource. +4. IS (intent-shared): Used to indicate that an S lock is taken at a level in the hierarchy below + this resource. This lock mode is used to block X locks on this resource. + +## Lock Compatibility Matrix + +This matrix answers the question, given a granted lock on a resource with the mode given, is a +requested lock on that same resource with the given mode compatible? + +| Requested Mode ||| Granted Mode ||| +|:---------------|:---------:|:-------:|:-------:|:------:|:------:| +| | MODE_NONE | MODE_IS | MODE_IX | MODE_S | MODE_X | +| MODE_IS | Y | Y | Y | Y | N | +| MODE_IX | Y | Y | Y | N | N | +| MODE_S | Y | Y | N | Y | N | +| MODE_X | Y | N | N | N | N | + +Typically, locks are granted in the order they are queued, but some LockRequest behaviors can be +specially selected to break this rule. One behavior is _enqueueAtFront_, which allows important lock +acquisitions to cut to the front of the line, in order to expedite them. Currently, all mode X and S +locks for the three Global Resources (Global, RSTL, and PBWM) automatically use this option. +Another behavior is _compatibleFirst_, which allows compatible lock requests to cut ahead of others +waiting in the queue and be granted immediately; note that this mode might starve queued lock +requests indefinitely. ### Replication State Transition Lock (RSTL) +The Replication State Transition Lock is of ResourceType Global, so it must be locked prior to +locking any Database level resource. This lock is used to synchronize replica state transitions +(typically transitions between PRIMARY, SECONDARY, and ROLLBACK states). +More information on the RSTL is contained in the [Replication Architecture Guide](https://github.com/mongodb/mongo/blob/b4db8c01a13fd70997a05857be17548b0adec020/src/mongo/db/repl/README.md#replication-state-transition-lock) + ### Parallel Batch Writer Mode Lock (PBWM) +The Parallel Batch Writer Mode lock is of ResourceType Global, so it must be locked prior to locking +any Database level resource. This lock is used to synchronize secondary oplog application with other +readers, so that they do not observe inconsistent snapshots of the data. Typically this is only an +issue with readers that read with no timestamp, readers at explicit timestamps can acquire this lock +in a compatible mode with the oplog applier and thus are not blocked when the oplog applier is +running. +More information on the PBWM lock is contained in the [Replication Architecture Guide.](https://github.com/mongodb/mongo/blob/b4db8c01a13fd70997a05857be17548b0adec020/src/mongo/db/repl/README.md#parallel-batch-writer-mode) + ### Global Lock +The resource known as the Global Lock is of ResourceType Global. It is currently used to +synchronize shutdown, so that all operations are finished with the storage engine before closing it. +Certain types of global storage engine operations, such as recoverToStableTimestamp(), also require +this lock to be held in exclusive mode. + ### Database Lock +Any resource of ResourceType Database protects certain database-wide operations such as database +drop. These operations are being phased out, in the hopes that we can eliminate this ResourceType +completely. + ### Collection Lock -### Document Level Concurrency Control -Explain WT's optimistic concurrency control, and why we do not need document locks in the MongoDB layer. +Any resource of ResourceType Collection protects certain collection-wide operations, and in some +cases also protects the in-memory catalog structure consistency in the face of concurrent readers +and writers of the catalog. Acquiring this resource with an intent lock is an indication that the +operation is doing explicit reads (IS) or writes (IX) at the document level. There is no Document +ResourceType, as locking at this level is done in the storage engine itself for efficiency reasons. -### Mutexes +### Document Level Concurrency Control -### FCV Lock +Each storage engine is responsible for locking at the document level. The WiredTiger storage engine +uses MVCC (multiversion concurrency control) along with optimistic locking in order to provide +concurrency guarantees. ## Two-Phase Locking -We use this for transactions? Explain. - -## Replica Set Transaction Locking -TBD: title of this section -- there is some confusion over what terminology will be best understood -Stashing and unstashing locks for replica set level transactions across multiple statements. -Read's IS locks are converted to IX locks in replica set transactions. - -## Locking Best Practices - -### Network Calls -i.e., never hold a lock across a network call unless absolutely necessary -### Long Running I/O -i.e., don't hold a lock across journal flushing +The lock manager automatically provides _two-phase locking_ for a given storage transaction. +Two-phase locking consists of an Expanding phase where locks are acquired but not released, and a +subsequent Shrinking phase where locks are released but not acquired. By adhering to this protocol, +a transaction will be guaranteed to be serializable with other concurrent transactions. The +WriteUnitOfWork class manages two-phase locking in MongoDB. This results in the somewhat unexpected +behavior of the RAII locking types acquiring locks on resources upon their construction but not +unlocking the lock upon their destruction when going out of scope. Instead, the responsibility of +unlocking the locks is transferred to the WriteUnitOfWork destructor. Note this is only true for +transactions that do writes, and therefore only for code that uses WriteUnitOfWork. -### FCV Lock Usage # Indexes @@ -314,7 +388,7 @@ have the following procedure: ## Hybrid Index Builds Hybrid index builds refer to the default procedure introduced in 4.2 that produces efficient index -data structures without blocking reads or writes for extended periods of time. This is acheived by +data structures without blocking reads or writes for extended periods of time. This is achieved by performing a full collection scan and bulk-loading keys (described above) while concurrently intercepting new writes into a temporary storage engine table. @@ -420,7 +494,7 @@ data on a collection and performed the first drain of side-writes. Voting is imp `voteCommitIndexBuild` command, and is persisted as a write to the replicated `config.system.indexBuilds` collection. -While waiting for a commit decision, primaries and secondaries continue receiving and applying new +While waiting for a commit decision, primaries and secondaries continue recieving and applying new side writes. When a quorum is reached, the current primary, under a collection X lock, will check all index constraints. If there are errors, it will replicate an `abortIndexBuild` oplog entry. If the index build is successful, it will replicate a `commitIndexBuild` oplog entry. @@ -620,7 +694,7 @@ MongoDB repair attempts to address the following forms of corruption: configuration](https://github.com/mongodb/mongo/blob/r4.5.0/src/mongo/db/repair_database_and_check_version.cpp#L460-L485) if data has been or could have been modified. This [prevents a repaired node from joining](https://github.com/mongodb/mongo/blob/r4.5.0/src/mongo/db/repl/replication_coordinator_impl.cpp#L486-L494) - and threatening the consisency of its replica set. + and threatening the consistency of its replica set. Additionally: * When repair starts, it creates a temporary file, `_repair_incomplete` that is only removed when |