diff options
author | Louis Williams <louis.williams@mongodb.com> | 2020-06-10 13:48:14 -0400 |
---|---|---|
committer | Evergreen Agent <no-reply@evergreen.mongodb.com> | 2020-06-16 18:54:40 +0000 |
commit | 02eb9bc441386fa2d6fd1999c2f713dccacb9f24 (patch) | |
tree | d60fa5ae04ee28a12547b6070c7fd52c7d36cfaf | |
parent | 27f03afb0680736eca3a849f4cd00743ff2623f3 (diff) | |
download | mongo-02eb9bc441386fa2d6fd1999c2f713dccacb9f24.tar.gz |
SERVER-47289 Execution Architecture Guide: Index Builds
-rw-r--r-- | src/mongo/db/catalog/README.md | 226 |
1 files changed, 214 insertions, 12 deletions
diff --git a/src/mongo/db/catalog/README.md b/src/mongo/db/catalog/README.md index 92a38d62480..811ea7e5e44 100644 --- a/src/mongo/db/catalog/README.md +++ b/src/mongo/db/catalog/README.md @@ -16,8 +16,37 @@ in memory versioning (or lack thereof) is separate from on disk #### The Minimum Visible Snapshot -## On-Disk Catalog -Discuss what the catalog looks like on disk -- e.g. just another WT data table we structure specially +## Durable Catalog +Discuss what the catalog looks like on disk -- e.g. just another WT data table we structure +specially + +**Example**: an entry in the durable catalog for a collection `test.employees` with an in-progress +index build on `{lastName: 1}`: + +``` + {'ident': 'collection-0--2147780727179663754', + 'idxIdent': {'_id_': 'index-1--2147780727179663754', + 'lastName_1': 'index-2--2147780727179663754'}, + 'md': {'indexes': [{'backgroundSecondary': False, + 'multikey': False, + 'multikeyPaths': {'_id': Binary('\x00', 0)}, + 'ready': True, + 'spec': {'key': {'_id': 1}, + 'name': '_id_', + 'v': 2}}, + {'backgroundSecondary': False, + 'multikey': False, + 'multikeyPaths': {'_id': Binary('\x00', 0)}, + 'ready': False, + 'buildUUID': UUID('d86e8657-1060-4efd-b891-0034d28c3078'), + 'spec': {'key': {'lastName': 1}, + 'name': 'lastName_1', + 'v': 2}}], + 'ns': 'test.employees', + 'options': {'uuid': UUID('795453e9-867b-4804-a432-43637f500cf7')}}, + 'ns': 'test.employees'} +``` + ### Catalog Data Formats What do the catalog documents look like? Are there catalog concepts constructed only in-memory and not on disk? @@ -116,23 +145,196 @@ i.e., don't hold a lock across journal flushing ### FCV Lock Usage -# Index Builds -How do indexs work? +# Indexes -Read collection table, sort in-memory, write to index table. +An index is a storage engine data structure that provides efficient lookup on fields in a +collection's data set. Indexes map document fields, keys, to documents such that a full collection +scan is not required when querying on a specific field. -## Hybrid Index Build -KeyString has its own section below. Index builds discussion may need to reference it +All user collections have a unique index on the `_id` field, which is required. The oplog and some +system collections do not have an _id index. -### Temporary Side Table For New Writes +Also see [MongoDB Manual - Indexes](https://docs.mongodb.com/manual/indexes/). + +## Index Constraints + +### Unique indexes + +A unique index maintains a constraint such that duplicate values are not allowed on the indexed +field(s). + +### Multikey Indexes -### Temporary Table For Transient Conflicts +An index is considered "multikey" if there are multiple keys that map to the same record. That is, +there are indexed fields with array values. For example, with an index on `{a: 1}`, the document +`{a: [1, 2, 3]}` automatically makes the index multikey. If an index is flagged as multikey, queries +change behavior when reading from the index. It makes reads less efficient because queries can no +longer assume that after reading an index entry, no further entries will have the same key values. -## Index Locks +When the first multikey document is inserted into an index, a `multikey: true` flag is set on the +index in the durable catalog entry for the collection. Since this catalog entry is a document shared +across the entire collection, allowing any writer to modify the catalog entry would result in +excessive WriteConflictExceptions for other writers. + +To solve this problem, the multikey state is tracked in memory, and only persisted when it changes +to `true`. Once `true`, an index is always multikey. + +See +[MultiKeyPaths](https://github.com/mongodb/mongo/blob/r4.4.0-rc9/src/mongo/db/index/multikey_paths.h#L57), +[IndexCatalogEntryImpl::setMultikey](https://github.com/mongodb/mongo/blob/r4.4.0-rc9/src/mongo/db/catalog/index_catalog_entry_impl.cpp#L184), +and [Multikey Indexes - MongoDB Manual](https://docs.mongodb.com/manual/core/index-multikey/). + +# Index Builds + +Indexes are built by performing a full scan of collection data. To be considered consistent, an +index must correctly map keys to all documents. + +At a high level, omitting details that will be elaborated upon in further sections, index builds +have the following procedure: +* While holding a collection X lock, write a new index entry to the array of indexes included as + part of a durable catalog entry. This entry has a `ready: false` component. See [Durable + Catalog](#durable-catalog). +* Downgrade to a collection IX lock. +* Scan all documents on the collection to be indexed + * Generate [KeyString](#keystring) keys for the indexed fields for each document + * Periodically yield locks and storage engine snapshots + * Insert the generated keys into the [external sorter](#the-external-sorter) +* Read the sorted keys from the external sorter and [bulk + load](http://source.wiredtiger.com/3.2.1/tune_bulk_load.html) into the storage engine index. + Bulk-loading requires keys to be inserted in sorted order, but builds a B-tree structure that is + more efficiently filled than with random insertion. +* While holding a collection X lock, make a final `ready: true` write to the durable catalog. + + +## Hybrid Index Builds + +Hybrid index builds refer to the default procedure introduced in 4.2 that produces efficient index +data structures without blocking reads or writes for extended periods of time. This is acheived by +performing a full collection scan and bulk-loading keys (described above) while concurrently +intercepting new writes into a temporary storage engine table. + +### Temporary Side Table For New Writes -## Multikey Indexes +During an index build, new writes (i.e. inserts, updates, and deletes) are applied to the collection +as usual. However, instead of writing directly into the index table as a normal write would, index +keys for documents are generated and intercepted by inserting into a temporary _side-writes_ table. +Writes are intercepted for the duration of the index build, from before the collection scan begins +until the build is completed. + +Both inserted and removed keys are recorded in the _side-writes_ table. For example, during an index +build on `{a: 1}`, an update on a document from `{_id: 0, a: 1}` to `{_id: 0, a: 2}` is recorded as +a deletion of the key `1` and an insertion of the key `2`. + +Once the collection scan and bulk-load phases of the index build are complete, these intercepted +keys are applied directly to the index in three phases: +* While holding a collection IX lock to allow concurrent reads and writes + * Because writes are still accepted, new keys may appear at the end of the _side-writes_ table. + They will be applied in subsequent steps. +* While holding a collection S lock to block concurrent writes, but not reads +* While holding a collection X lock to block all reads and writes + +See +[IndexBuildInterceptor::sideWrite](https://github.com/mongodb/mongo/blob/r4.5.0/src/mongo/db/index/index_build_interceptor.cpp#L403) +and +[IndexBuildInterceptor::drainWritesIntoIndex](https://github.com/mongodb/mongo/blob/r4.5.0/src/mongo/db/index/index_build_interceptor.cpp#L135). + +### Temporary Table For Duplicate Key Violations + +Unique indexes created with `{unique: true}` enforce a constraint that there are no duplicate keys +in an index. The hybrid index procedure makes it challenging to detect duplicates because keys are +split between the bulk-loaded index and the side-writes table. Additionally, during the lifetime of +an index build, concurrent writes may introduce and resolve duplicate key conflicts on the index. + +For those reasons, during an index build we temporarily allow duplicate key violations, and record +any detected violations in a temporary table, the _duplicate key table_. At the conclusion of the +index build, under a collection X lock, [duplicate keys are +re-checked](https://github.com/mongodb/mongo/blob/r4.4.0-rc9/src/mongo/db/index_builds_coordinator.cpp#L2312). +If there are still constraint violations, an error is thrown. + +See +[DuplicateKeyTracker](https://github.com/mongodb/mongo/blob/r4.5.0/src/mongo/db/index/duplicate_key_tracker.h#L48). + +### Temporary Table For Key Generation Errors + +In addition to uniqueness constraints, indexes may have per-key constraints. For example, a compound +index may not be built on documents with parallel arrays. An index build on `{a: 1, b: 1}` will fail +to generate a key for `{a: [1, 2, 3], b: [4, 5, 6]}`. + +On a primary under normal circumstances, we could fail an index build immediately after encountering +a key generation error. Since secondaries apply oplog entries [out of +order](../repl/README.md#oplog-entry-application), however, spurious key generation errors may be +encountered on otherwise consistent data. To solve this problem, we can relax key constraints and +suppress key generation errors on secondaries. + +With the introduction of simultaneous index builds, an index build may be started on a secondary +node, but complete while it is a primary after a state transition. If we ignored constraints while +in the secondary state, we would not be able to commit the index build and guarantee its consistency +since we may have suppressed valid key generation errors. + +To solve this problem, on both primaries and secondaries, the records associated with key generation +errors are skipped and recorded in a temporary table, the _skipped record table_. Like duplicate key +constraints, but only on primaries at the conclusion of the index build, the keys for the [skipped +records are +re-generated](https://github.com/mongodb/mongo/blob/r4.4.0-rc9/src/mongo/db/index_builds_coordinator.cpp#L2294) +and re-inserted under a collection X lock. If there are still constraint violations, an error is +thrown. Secondaries rely on the primary's decision to commit as assurance that skipped records do +not need to be checked. + +See +[SkippedRecordTracker](https://github.com/mongodb/mongo/blob/r4.5.0/src/mongo/db/index/skipped_record_tracker.h#L45). + +## Replica Set Index Builds + +Also referred to as "simultaneous index builds" and "two-phase index builds". + +As of 4.4, index builds in a replica set use a two-phase commit protocol. When a primary starts an +index build, it spawns a background thread and replicates a `startIndexBuild` oplog entry. Secondary +nodes will start the index build in the background as soon as they apply that oplog entry. When a +primary is done with its indexing, it will decide to replicate either an `abortIndexBuild` or +`commitIndexBuild` oplog entry. + +Simultaneous index builds are resilient to replica set state transitions. The node that starts an +index build does not need to be the same node that decides to commit it. + +See [Index Builds in Replicated Environments - MongoDB +Manual](https://docs.mongodb.com/master/core/index-creation/#index-builds-in-replicated-environments). + +### Commit Quorum + +A primary will not commit an index build until a minimum number of data-bearing nodes have completed +the index build and are ready to commit. This threshold is called the _commit quorum_. + +A `commitQuorum` option can be provided to the `createIndexes` command and specifies the number of +nodes, including itself, a primary must wait to be ready before committing. The `commitQuorum` +option accepts the same range of values as the writeConcern `"w"` option. This can be an integer +specifying the number of nodes, `"majority"`, `"votingMembers"`, or a replica set tag. The default value +is `"votingMembers"`, or all voting data-bearing nodes. + +Nodes (both primary and secondary) submit votes to the primary when they have finished scanning all +data on a collection and performed the first drain of side-writes. Voting is implemented by a +`voteCommitIndexBuild` command, and is persisted as a write to the replicated +`config.system.indexBuilds` collection. + +While waiting for a commit decision, primaries and secondaries continue receiving and applying new +side writes. When a quorum is reached, the current primary, under a collection X lock, will check +all index constraints. If there are errors, it will replicate an `abortIndexBuild` oplog entry. If +the index build is successful, it will replicate a `commitIndexBuild` oplog entry. + +Secondaries that were not included in the commit quorum and recieve a `commitIndexBuild` oplog entry +will block replication until their index build is complete. + +See +[IndexBuildsCoordinator::_waitForNextIndexBuildActionAndCommit](https://github.com/mongodb/mongo/blob/r4.4.0-rc9/src/mongo/db/index_builds_coordinator_mongod.cpp#L632). + +## Single-Phase Index Builds + +Index builds on empty collections replicate a `createIndexes` oplog entry. This oplog entry was used +before FCV 4.4 for all index builds, but continues to be used in FCV 4.4 only for index builds that +are considered "single-phase" and do not need to run in the background. Unlike two-phase index +builds, the `createIndexes` oplog entry is always applied synchronously on secondaries during batch +application. -## Cross-Replica Set Index Builds +See [createIndexForApplyOps](https://github.com/mongodb/mongo/blob/6ea7d1923619b600ea0f16d7ea6e82369f288fd4/src/mongo/db/repl/oplog.cpp#L176-L183). # KeyString |