From 8593afe076b6035c92f52da0426a3f09534ffc5f Mon Sep 17 00:00:00 2001
From: Spencer T Brody <spencer@mongodb.com>
Date: Wed, 30 Sep 2020 14:48:43 -0400
Subject: SERVER-50786 Add architecture guide section on PrimaryOnlyService

---
 docs/primary_only_service.md | 106 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 106 insertions(+)
 create mode 100644 docs/primary_only_service.md

(limited to 'docs')

diff --git a/docs/primary_only_service.md b/docs/primary_only_service.md
new file mode 100644
index 00000000000..36660b141a1
--- /dev/null
+++ b/docs/primary_only_service.md
@@ -0,0 +1,106 @@
+# PrimaryOnlyService
+
+The PrimaryOnlyService machinery provides a way to register tasks that should run only when current
+node is Primary, and should be driven to completion across replica set failovers on the new
+Primary. It is intended to be used by tasks that can be modeled as a state machine with a single
+MongoDB document containing the current state, which newly-elected Primaries can use to rebuild the
+state of the task after failover and pick up where the old Primary left off.
+
+## Classes
+
+There are three main classes/interfaces that make up the PrimaryOnlyService machinery.
+
+### PrimaryOnlyServiceRegistry
+
+The PrimaryOnlyServiceRegistry is a singleton that is installed as a decoration on the
+ServiceContext at startup and lives for the lifetime of the mongod process.  During mongod global
+startup, all PrimaryOnlyServices must be registered against the PrimaryOnlyServiceRegistry before
+the ReplicationCoordinator is started up (as it is the ReplicationCoordinator startup that starts up
+the registered PrimaryOnlyServices). Specific PrimaryOnlyServices can be looked up from the registry
+at runtime, and are handed out by raw pointer, which is safe since the set of registered
+PrimaryOnlyServices does not change during runtime.  The PrimaryOnlyServiceRegistry is itself a
+[ReplicaSetAwareService](../src/mongo/db/repl/README.md#ReplicaSetAwareService-interface), which is
+how it receives notifications about changes in and out of Primary state.
+
+### PrimaryOnlyService
+
+The PrimaryOnlyService interface is used to define a new Primary Only Service.  A PrimaryOnlyService
+is a grouping of tasks (Instances) that run only when the node is Primary and are resumed after
+failover.  Each PrimaryOnlyService must declare a unique, replicated collection (most likely in the
+admin or config databases), where the state documents for all Instances of the service will be
+persisted.  At stepUp, each PrimaryOnlyService will create and launch Instance objects for each
+document found in this collection. This is how PrimaryOnlyService tasks get resumed after failover.
+
+
+### PrimaryOnlyService::Instance/TypedInstance
+
+The PrimaryOnlyService::Instance interface is used to contain the state and core logic for running a
+single task belonging to a PrimaryOnlyService. The Instance interface includes a "run()" virtual
+method which is provided an executor which is used to run all work that is done on behalf of the
+Instance. Implementations should not extend PrimaryOnlyService::Instance directly, instead they
+should extend PrimaryOnlyService::TypedInstance, which allows individual Instances to be looked up
+and returned as pointers to the proper Instance sub-type. The InstanceID for an Instance is the _id
+field of its state document.
+
+
+## Defining a new PrimaryOnlyService
+
+To define a new PrimaryOnlyService one must add corresponding subclasses of both PrimaryOnlyService
+and PrimaryOnlyService::TypedInstance.  The PrimaryOnlyService subclass just exists to specify what
+collection state documents for this service are stored in, and to hand out corresponding Instances
+of the proper type.  Most of the work of a new PrimaryOnlyService will be implemented in the
+PrimaryOnlyService::Instance subclass. PrimaryOnlyService::Instance subclasses will be responsible
+for running the work they need to perform to complete their task, as well as for managing and
+synchronizing their own in-memory and on-disk state. No part of the PrimaryOnlyService **machinery**
+ever performs writes to the PrimaryOnlyService state document collections.  All writes to a given
+Instance's state document (including creating it initially and deleting it when the work has been
+completed) are performed by Instance implementations.  This means that for the majority of
+PrimaryOnlyServices, the first step of its Instance's run() method will be to insert an initial
+state document into the state document collection, to ensure that the Instance is now persisted and
+will be resumed after failover.  When an Instance is resumed after failover, it is provided the
+current version of the state document as it exists in the state document collection.  That document
+can be used to rebuild the in-memory state for this Instance so that when run() is called it knows
+what state it is in and thus what work still needs to be performed, and what work has already been
+completed by the previous Primary.
+
+To see an example bare-bones PrimaryOnlyService implementation to use as a reference, check out the
+TestService defined in this unit test: https://github.com/mongodb/mongo/blob/master/src/mongo/db/repl/primary_only_service_test.cpp
+
+
+## Behavior during state transitions
+
+At stepUp, each PrimaryOnlyService queries its state document collection, and for each document
+found, creates and launches a PrimaryOnlyService::Instance initialized off of the state
+document. This happens asynchronously relative to the core replication stepUp process - there is no
+guarantee that when stepUp completes and the RSTL lock is dropped that the PrimaryOnlyServices have
+finished rebuilding all their Instances. At stepDown all Instances are interrupted, but the threads
+running their work are not joined, and the Instance objects containing their in-memory state are not
+released, until the next stepUp. This is done to reduce the likelihood of blocking within the state
+transition process and delaying it for the entire node. This behavior does, however, guarantee that
+there will never be two Instances of the same PrimaryOnlyService with the same InstanceID running at
+the same time on the same node.
+
+### Interrupting Instances at stepDown
+
+At stepDown, there are 3 main ways that Instances are interrupted and we guarantee that no more work
+is performed on behalf of any PrimaryOnlyServices.  The first is that the executor provided to each
+Instance's run() method gets shut down, preventing any more work from being scheduled on behalf of
+that Instance.  The second is that all OperationContexts created on threads (Clients) that are part
+of an Executor owned by a PrimaryOnlyService get interrupted. The third is that each individual
+Instance is explicitly interrupted, so that it can unblock any work running on threads that are
+*not* a part of an executor owned by the PrimaryOnlyService that are dependent on that Instance
+signaling them (e.g. commands that are waiting on the Instance to reach a certain state). Currently
+this happens via a call to an interrupt() method that each Instance must override, but in the future
+this is likely to change to signaling a CancelationToken owned by the Instance instead.
+
+## Instance lifetime
+
+Instances are held by shared_ptr in their parent PrimaryOnlyService. Each PrimaryOnlyService
+releases all Instance shared_ptrs it owns on stepDown.  Additionally, a PrimaryOnlyService will
+release an Instance shared_ptr when the state document for that Instance is deleted (via an
+OpObserver).  Since generally speaking it is logic from an Instance's run() method that will be
+responsible for deleting its state document, such logic needs to be careful as the moment the state
+document is deleted, the corresponding PrimaryOnlyService is no longer keeping that Instance alive.
+If an Instance has any additional logic or internal state to update after deleting its state
+document, it must extend its own lifetime by capturing a shared_ptr to itself by calling
+shared_from_this() before deleting its state document.
\ No newline at end of file
-- 
cgit v1.2.1