summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorLouis Williams <louis.williams@mongodb.com>2020-06-10 13:48:14 -0400
committerEvergreen Agent <no-reply@evergreen.mongodb.com>2020-06-16 18:54:40 +0000
commit02eb9bc441386fa2d6fd1999c2f713dccacb9f24 (patch)
treed60fa5ae04ee28a12547b6070c7fd52c7d36cfaf
parent27f03afb0680736eca3a849f4cd00743ff2623f3 (diff)
downloadmongo-02eb9bc441386fa2d6fd1999c2f713dccacb9f24.tar.gz
SERVER-47289 Execution Architecture Guide: Index Builds
-rw-r--r--src/mongo/db/catalog/README.md226
1 files changed, 214 insertions, 12 deletions
diff --git a/src/mongo/db/catalog/README.md b/src/mongo/db/catalog/README.md
index 92a38d62480..811ea7e5e44 100644
--- a/src/mongo/db/catalog/README.md
+++ b/src/mongo/db/catalog/README.md
@@ -16,8 +16,37 @@ in memory versioning (or lack thereof) is separate from on disk
#### The Minimum Visible Snapshot
-## On-Disk Catalog
-Discuss what the catalog looks like on disk -- e.g. just another WT data table we structure specially
+## Durable Catalog
+Discuss what the catalog looks like on disk -- e.g. just another WT data table we structure
+specially
+
+**Example**: an entry in the durable catalog for a collection `test.employees` with an in-progress
+index build on `{lastName: 1}`:
+
+```
+ {'ident': 'collection-0--2147780727179663754',
+ 'idxIdent': {'_id_': 'index-1--2147780727179663754',
+ 'lastName_1': 'index-2--2147780727179663754'},
+ 'md': {'indexes': [{'backgroundSecondary': False,
+ 'multikey': False,
+ 'multikeyPaths': {'_id': Binary('\x00', 0)},
+ 'ready': True,
+ 'spec': {'key': {'_id': 1},
+ 'name': '_id_',
+ 'v': 2}},
+ {'backgroundSecondary': False,
+ 'multikey': False,
+ 'multikeyPaths': {'_id': Binary('\x00', 0)},
+ 'ready': False,
+ 'buildUUID': UUID('d86e8657-1060-4efd-b891-0034d28c3078'),
+ 'spec': {'key': {'lastName': 1},
+ 'name': 'lastName_1',
+ 'v': 2}}],
+ 'ns': 'test.employees',
+ 'options': {'uuid': UUID('795453e9-867b-4804-a432-43637f500cf7')}},
+ 'ns': 'test.employees'}
+```
+
### Catalog Data Formats
What do the catalog documents look like? Are there catalog concepts constructed only in-memory and not on disk?
@@ -116,23 +145,196 @@ i.e., don't hold a lock across journal flushing
### FCV Lock Usage
-# Index Builds
-How do indexs work?
+# Indexes
-Read collection table, sort in-memory, write to index table.
+An index is a storage engine data structure that provides efficient lookup on fields in a
+collection's data set. Indexes map document fields, keys, to documents such that a full collection
+scan is not required when querying on a specific field.
-## Hybrid Index Build
-KeyString has its own section below. Index builds discussion may need to reference it
+All user collections have a unique index on the `_id` field, which is required. The oplog and some
+system collections do not have an _id index.
-### Temporary Side Table For New Writes
+Also see [MongoDB Manual - Indexes](https://docs.mongodb.com/manual/indexes/).
+
+## Index Constraints
+
+### Unique indexes
+
+A unique index maintains a constraint such that duplicate values are not allowed on the indexed
+field(s).
+
+### Multikey Indexes
-### Temporary Table For Transient Conflicts
+An index is considered "multikey" if there are multiple keys that map to the same record. That is,
+there are indexed fields with array values. For example, with an index on `{a: 1}`, the document
+`{a: [1, 2, 3]}` automatically makes the index multikey. If an index is flagged as multikey, queries
+change behavior when reading from the index. It makes reads less efficient because queries can no
+longer assume that after reading an index entry, no further entries will have the same key values.
-## Index Locks
+When the first multikey document is inserted into an index, a `multikey: true` flag is set on the
+index in the durable catalog entry for the collection. Since this catalog entry is a document shared
+across the entire collection, allowing any writer to modify the catalog entry would result in
+excessive WriteConflictExceptions for other writers.
+
+To solve this problem, the multikey state is tracked in memory, and only persisted when it changes
+to `true`. Once `true`, an index is always multikey.
+
+See
+[MultiKeyPaths](https://github.com/mongodb/mongo/blob/r4.4.0-rc9/src/mongo/db/index/multikey_paths.h#L57),
+[IndexCatalogEntryImpl::setMultikey](https://github.com/mongodb/mongo/blob/r4.4.0-rc9/src/mongo/db/catalog/index_catalog_entry_impl.cpp#L184),
+and [Multikey Indexes - MongoDB Manual](https://docs.mongodb.com/manual/core/index-multikey/).
+
+# Index Builds
+
+Indexes are built by performing a full scan of collection data. To be considered consistent, an
+index must correctly map keys to all documents.
+
+At a high level, omitting details that will be elaborated upon in further sections, index builds
+have the following procedure:
+* While holding a collection X lock, write a new index entry to the array of indexes included as
+ part of a durable catalog entry. This entry has a `ready: false` component. See [Durable
+ Catalog](#durable-catalog).
+* Downgrade to a collection IX lock.
+* Scan all documents on the collection to be indexed
+ * Generate [KeyString](#keystring) keys for the indexed fields for each document
+ * Periodically yield locks and storage engine snapshots
+ * Insert the generated keys into the [external sorter](#the-external-sorter)
+* Read the sorted keys from the external sorter and [bulk
+ load](http://source.wiredtiger.com/3.2.1/tune_bulk_load.html) into the storage engine index.
+ Bulk-loading requires keys to be inserted in sorted order, but builds a B-tree structure that is
+ more efficiently filled than with random insertion.
+* While holding a collection X lock, make a final `ready: true` write to the durable catalog.
+
+
+## Hybrid Index Builds
+
+Hybrid index builds refer to the default procedure introduced in 4.2 that produces efficient index
+data structures without blocking reads or writes for extended periods of time. This is acheived by
+performing a full collection scan and bulk-loading keys (described above) while concurrently
+intercepting new writes into a temporary storage engine table.
+
+### Temporary Side Table For New Writes
-## Multikey Indexes
+During an index build, new writes (i.e. inserts, updates, and deletes) are applied to the collection
+as usual. However, instead of writing directly into the index table as a normal write would, index
+keys for documents are generated and intercepted by inserting into a temporary _side-writes_ table.
+Writes are intercepted for the duration of the index build, from before the collection scan begins
+until the build is completed.
+
+Both inserted and removed keys are recorded in the _side-writes_ table. For example, during an index
+build on `{a: 1}`, an update on a document from `{_id: 0, a: 1}` to `{_id: 0, a: 2}` is recorded as
+a deletion of the key `1` and an insertion of the key `2`.
+
+Once the collection scan and bulk-load phases of the index build are complete, these intercepted
+keys are applied directly to the index in three phases:
+* While holding a collection IX lock to allow concurrent reads and writes
+ * Because writes are still accepted, new keys may appear at the end of the _side-writes_ table.
+ They will be applied in subsequent steps.
+* While holding a collection S lock to block concurrent writes, but not reads
+* While holding a collection X lock to block all reads and writes
+
+See
+[IndexBuildInterceptor::sideWrite](https://github.com/mongodb/mongo/blob/r4.5.0/src/mongo/db/index/index_build_interceptor.cpp#L403)
+and
+[IndexBuildInterceptor::drainWritesIntoIndex](https://github.com/mongodb/mongo/blob/r4.5.0/src/mongo/db/index/index_build_interceptor.cpp#L135).
+
+### Temporary Table For Duplicate Key Violations
+
+Unique indexes created with `{unique: true}` enforce a constraint that there are no duplicate keys
+in an index. The hybrid index procedure makes it challenging to detect duplicates because keys are
+split between the bulk-loaded index and the side-writes table. Additionally, during the lifetime of
+an index build, concurrent writes may introduce and resolve duplicate key conflicts on the index.
+
+For those reasons, during an index build we temporarily allow duplicate key violations, and record
+any detected violations in a temporary table, the _duplicate key table_. At the conclusion of the
+index build, under a collection X lock, [duplicate keys are
+re-checked](https://github.com/mongodb/mongo/blob/r4.4.0-rc9/src/mongo/db/index_builds_coordinator.cpp#L2312).
+If there are still constraint violations, an error is thrown.
+
+See
+[DuplicateKeyTracker](https://github.com/mongodb/mongo/blob/r4.5.0/src/mongo/db/index/duplicate_key_tracker.h#L48).
+
+### Temporary Table For Key Generation Errors
+
+In addition to uniqueness constraints, indexes may have per-key constraints. For example, a compound
+index may not be built on documents with parallel arrays. An index build on `{a: 1, b: 1}` will fail
+to generate a key for `{a: [1, 2, 3], b: [4, 5, 6]}`.
+
+On a primary under normal circumstances, we could fail an index build immediately after encountering
+a key generation error. Since secondaries apply oplog entries [out of
+order](../repl/README.md#oplog-entry-application), however, spurious key generation errors may be
+encountered on otherwise consistent data. To solve this problem, we can relax key constraints and
+suppress key generation errors on secondaries.
+
+With the introduction of simultaneous index builds, an index build may be started on a secondary
+node, but complete while it is a primary after a state transition. If we ignored constraints while
+in the secondary state, we would not be able to commit the index build and guarantee its consistency
+since we may have suppressed valid key generation errors.
+
+To solve this problem, on both primaries and secondaries, the records associated with key generation
+errors are skipped and recorded in a temporary table, the _skipped record table_. Like duplicate key
+constraints, but only on primaries at the conclusion of the index build, the keys for the [skipped
+records are
+re-generated](https://github.com/mongodb/mongo/blob/r4.4.0-rc9/src/mongo/db/index_builds_coordinator.cpp#L2294)
+and re-inserted under a collection X lock. If there are still constraint violations, an error is
+thrown. Secondaries rely on the primary's decision to commit as assurance that skipped records do
+not need to be checked.
+
+See
+[SkippedRecordTracker](https://github.com/mongodb/mongo/blob/r4.5.0/src/mongo/db/index/skipped_record_tracker.h#L45).
+
+## Replica Set Index Builds
+
+Also referred to as "simultaneous index builds" and "two-phase index builds".
+
+As of 4.4, index builds in a replica set use a two-phase commit protocol. When a primary starts an
+index build, it spawns a background thread and replicates a `startIndexBuild` oplog entry. Secondary
+nodes will start the index build in the background as soon as they apply that oplog entry. When a
+primary is done with its indexing, it will decide to replicate either an `abortIndexBuild` or
+`commitIndexBuild` oplog entry.
+
+Simultaneous index builds are resilient to replica set state transitions. The node that starts an
+index build does not need to be the same node that decides to commit it.
+
+See [Index Builds in Replicated Environments - MongoDB
+Manual](https://docs.mongodb.com/master/core/index-creation/#index-builds-in-replicated-environments).
+
+### Commit Quorum
+
+A primary will not commit an index build until a minimum number of data-bearing nodes have completed
+the index build and are ready to commit. This threshold is called the _commit quorum_.
+
+A `commitQuorum` option can be provided to the `createIndexes` command and specifies the number of
+nodes, including itself, a primary must wait to be ready before committing. The `commitQuorum`
+option accepts the same range of values as the writeConcern `"w"` option. This can be an integer
+specifying the number of nodes, `"majority"`, `"votingMembers"`, or a replica set tag. The default value
+is `"votingMembers"`, or all voting data-bearing nodes.
+
+Nodes (both primary and secondary) submit votes to the primary when they have finished scanning all
+data on a collection and performed the first drain of side-writes. Voting is implemented by a
+`voteCommitIndexBuild` command, and is persisted as a write to the replicated
+`config.system.indexBuilds` collection.
+
+While waiting for a commit decision, primaries and secondaries continue receiving and applying new
+side writes. When a quorum is reached, the current primary, under a collection X lock, will check
+all index constraints. If there are errors, it will replicate an `abortIndexBuild` oplog entry. If
+the index build is successful, it will replicate a `commitIndexBuild` oplog entry.
+
+Secondaries that were not included in the commit quorum and recieve a `commitIndexBuild` oplog entry
+will block replication until their index build is complete.
+
+See
+[IndexBuildsCoordinator::_waitForNextIndexBuildActionAndCommit](https://github.com/mongodb/mongo/blob/r4.4.0-rc9/src/mongo/db/index_builds_coordinator_mongod.cpp#L632).
+
+## Single-Phase Index Builds
+
+Index builds on empty collections replicate a `createIndexes` oplog entry. This oplog entry was used
+before FCV 4.4 for all index builds, but continues to be used in FCV 4.4 only for index builds that
+are considered "single-phase" and do not need to run in the background. Unlike two-phase index
+builds, the `createIndexes` oplog entry is always applied synchronously on secondaries during batch
+application.
-## Cross-Replica Set Index Builds
+See [createIndexForApplyOps](https://github.com/mongodb/mongo/blob/6ea7d1923619b600ea0f16d7ea6e82369f288fd4/src/mongo/db/repl/oplog.cpp#L176-L183).
# KeyString