From 134c3ee37576de7ec36456fe3420305d6871b874 Mon Sep 17 00:00:00 2001
From: Louis Williams <louis.williams@mongodb.com>
Date: Fri, 31 Jul 2020 16:17:02 -0400
Subject: SERVER-47287 Execution Architecture Guide: Read Operations

---
 src/mongo/db/catalog/README.md | 70 +++++++++++++++++++++++++++++++++++++-----
 src/mongo/db/repl/README.md    |  7 +++--
 2 files changed, 67 insertions(+), 10 deletions(-)

(limited to 'src')

diff --git a/src/mongo/db/catalog/README.md b/src/mongo/db/catalog/README.md
index 55135fa67f3..1544263788a 100644
--- a/src/mongo/db/catalog/README.md
+++ b/src/mongo/db/catalog/README.md
@@ -126,15 +126,71 @@ Maybe include a discussion of how MongoDB read concerns translate into particula
 ## MongoDB Point-in-Time Read
 
 # Read Operations
-How does a read work?
 
-## Collection Read
-how it works, what tables
+All read operations on collections and indexes are required to take collection locks. Storage
+engines that provide document-level concurrency require all operations to hold at least a collection
+IS lock. With the WiredTiger storage engine, the MongoDB integration layer implicitly starts a
+storage transaction on the first attempt to read from a collection or index. Unless a read operation
+is part of a larger write operation, the transaction is rolled-back automatically when the last
+GlobalLock is released, explicitly during query yielding, or from a call to abandonSnapshot();
 
-## Index Read
-_could pull out index reads and writes into its own section, if preferable_
-
-how it works, goes from index table to collection table -- two lookups
+See
+[WiredTigerCursor](https://github.com/mongodb/mongo/blob/r4.4.0-rc13/src/mongo/db/storage/wiredtiger/wiredtiger_cursor.cpp#L48),
+[WiredTigerRecoveryUnit::getSession](https://github.com/mongodb/mongo/blob/r4.4.0-rc13/src/mongo/db/storage/wiredtiger/wiredtiger_recovery_unit.cpp#L303-L305),
+[GlobalLock dtor](https://github.com/mongodb/mongo/blob/r4.4.0-rc13/src/mongo/db/concurrency/d_concurrency.h#L228-L239),
+[PlanYieldPolicy::_yieldAllLocks](https://github.com/mongodb/mongo/blob/r4.4.0-rc13/src/mongo/db/query/plan_yield_policy.cpp#L182),
+[RecoveryUnit::abandonSnapshot](https://github.com/mongodb/mongo/blob/r4.4.0-rc13/src/mongo/db/storage/recovery_unit.h#L217).
+
+## Collection Reads
+
+Collection reads act directly on a
+[RecordStore](https://github.com/mongodb/mongo/blob/r4.4.0-rc13/src/mongo/db/storage/record_store.h#L202)
+or
+[RecordCursor](https://github.com/mongodb/mongo/blob/r4.4.0-rc13/src/mongo/db/storage/record_store.h#L102).
+The Collection object also provides [higher-level
+accessors](https://github.com/mongodb/mongo/blob/r4.4.0-rc13/src/mongo/db/catalog/collection.h#L279)
+to the RecordStore.
+
+## Index Reads
+
+Index reads act directly on a
+[SortedDataInterface::Cursor](https://github.com/mongodb/mongo/blob/r4.4.0-rc13/src/mongo/db/storage/sorted_data_interface.h#L214).
+Most readers create cursors rather than interacting with indexes through the
+[IndexAccessMethod](https://github.com/mongodb/mongo/blob/r4.4.0-rc13/src/mongo/db/index/index_access_method.h#L142).
+
+## AutoGetCollectionForRead 
+
+The
+[AutoGetCollectionForRead](https://github.com/mongodb/mongo/blob/58283ca178782c4d1c4a4d2acd4313f6f6f86fd5/src/mongo/db/db_raii.cpp#L89)
+(AGCFR) RAII type is used by most client read operations. In addition to acquiring all necessary
+locks in the hierarchy, it ensures that operations reading at points in time are respecting the
+visibility rules of collection data and metadata.
+
+AGCFR ensures that operations reading at a timestamp do not read at times later than metadata
+changes on the collection (see
+[here](https://github.com/mongodb/mongo/blob/58283ca178782c4d1c4a4d2acd4313f6f6f86fd5/src/mongo/db/db_raii.cpp#L158)).
+
+## Secondary Reads
+
+The oplog applier applies entries out-of-order to provide parallelism for data replication. This
+exposes readers with no set read timetsamp to the possibility of seeing inconsistent states of data.
+To solve this problem, the oplog applier takes the ParallelBatchWriterMode (PBWM) lock in X mode,
+and readers using no read timestamp are expected to take the PBWM lock in IS mode to avoid observing
+inconsistent data mid-batch.
+
+Reads on secondaries are able to opt-out of taking the PBWM lock and read at replication's
+[lastApplied](../repl/README.md#replication-timestamp-glossary) optime instead (see
+[SERVER-34192](https://jira.mongodb.org/browse/SERVER-34192)). LastApplied is used because on
+secondaries it is only updated after each oplog batch, which is a known consistent state of data.
+This allows operations to avoid taking the PBWM lock, and thus not conflict with oplog application.
+
+AGCFR provides the mechanism for secondary reads. This is implemented by [opting-out of the
+ParallelBatchWriterMode
+lock](https://github.com/mongodb/mongo/blob/58283ca178782c4d1c4a4d2acd4313f6f6f86fd5/src/mongo/db/db_raii.cpp#L98)
+and switching the ReadSource of [eligible
+readers](https://github.com/mongodb/mongo/blob/58283ca178782c4d1c4a4d2acd4313f6f6f86fd5/src/mongo/db/storage/snapshot_helper.cpp#L106)
+to read at
+[kLastApplied](https://github.com/mongodb/mongo/blob/58283ca178782c4d1c4a4d2acd4313f6f6f86fd5/src/mongo/db/storage/recovery_unit.h#L411).
 
 # Write Operations
 an overview of how writes (insert, update, delete) are processed
diff --git a/src/mongo/db/repl/README.md b/src/mongo/db/repl/README.md
index 08f58e5723b..a2d94f8e230 100644
--- a/src/mongo/db/repl/README.md
+++ b/src/mongo/db/repl/README.md
@@ -1814,9 +1814,10 @@ Unstable checkpoints simply open a transaction and read all data that is current
 time the transaction is opened. They read a consistent snapshot of data, but the snapshot they read
 from is not associated with any particular timestamp.
 
-**`lastApplied`**: In-memory record of the latest applied oplog entry optime. It may lag behind the
-optime of the newest oplog entry that is visible in the storage engine because it is updated after
-a storage transaction commits.
+**`lastApplied`**: In-memory record of the latest applied oplog entry optime. On primaries, it may
+lag behind the optime of the newest oplog entry that is visible in the storage engine because it is
+updated after a storage transaction commits. On secondaries, lastApplied is only updated at the
+completion of an oplog batch.
 
 **`lastCommittedOpTime`**: A node’s local view of the latest majority committed optime. Every time
 we update this optime, we also recalculate the `stable_timestamp`. Note that the
-- 
cgit v1.2.1