2 files changed, 131 insertions, 1 deletions
diff --git a/docs/primary_only_service.md b/docs/primary_only_service.md
new file mode 100644
index 00000000000..36660b141a1
--- /dev/null
+++ b/docs/primary_only_service.md
@@ -0,0 +1,106 @@
+# PrimaryOnlyService
+
+The PrimaryOnlyService machinery provides a way to register tasks that should run only when current
+node is Primary, and should be driven to completion across replica set failovers on the new
+Primary. It is intended to be used by tasks that can be modeled as a state machine with a single
+MongoDB document containing the current state, which newly-elected Primaries can use to rebuild the
+state of the task after failover and pick up where the old Primary left off.
+
+## Classes
+
+There are three main classes/interfaces that make up the PrimaryOnlyService machinery.
+
+### PrimaryOnlyServiceRegistry
+
+The PrimaryOnlyServiceRegistry is a singleton that is installed as a decoration on the
+ServiceContext at startup and lives for the lifetime of the mongod process.  During mongod global
+startup, all PrimaryOnlyServices must be registered against the PrimaryOnlyServiceRegistry before
+the ReplicationCoordinator is started up (as it is the ReplicationCoordinator startup that starts up
+the registered PrimaryOnlyServices). Specific PrimaryOnlyServices can be looked up from the registry
+at runtime, and are handed out by raw pointer, which is safe since the set of registered
+PrimaryOnlyServices does not change during runtime.  The PrimaryOnlyServiceRegistry is itself a
+[ReplicaSetAwareService](../src/mongo/db/repl/README.md#ReplicaSetAwareService-interface), which is
+how it receives notifications about changes in and out of Primary state.
+
+### PrimaryOnlyService
+
+The PrimaryOnlyService interface is used to define a new Primary Only Service.  A PrimaryOnlyService
+is a grouping of tasks (Instances) that run only when the node is Primary and are resumed after
+failover.  Each PrimaryOnlyService must declare a unique, replicated collection (most likely in the
+admin or config databases), where the state documents for all Instances of the service will be
+persisted.  At stepUp, each PrimaryOnlyService will create and launch Instance objects for each
+document found in this collection. This is how PrimaryOnlyService tasks get resumed after failover.
+
+
+### PrimaryOnlyService::Instance/TypedInstance
+
+The PrimaryOnlyService::Instance interface is used to contain the state and core logic for running a
+single task belonging to a PrimaryOnlyService. The Instance interface includes a "run()" virtual
+method which is provided an executor which is used to run all work that is done on behalf of the
+Instance. Implementations should not extend PrimaryOnlyService::Instance directly, instead they
+should extend PrimaryOnlyService::TypedInstance, which allows individual Instances to be looked up
+and returned as pointers to the proper Instance sub-type. The InstanceID for an Instance is the _id
+field of its state document.
+
+
+## Defining a new PrimaryOnlyService
+
+To define a new PrimaryOnlyService one must add corresponding subclasses of both PrimaryOnlyService
+and PrimaryOnlyService::TypedInstance.  The PrimaryOnlyService subclass just exists to specify what
+collection state documents for this service are stored in, and to hand out corresponding Instances
+of the proper type.  Most of the work of a new PrimaryOnlyService will be implemented in the
+PrimaryOnlyService::Instance subclass. PrimaryOnlyService::Instance subclasses will be responsible
+for running the work they need to perform to complete their task, as well as for managing and
+synchronizing their own in-memory and on-disk state. No part of the PrimaryOnlyService **machinery**
+ever performs writes to the PrimaryOnlyService state document collections.  All writes to a given
+Instance's state document (including creating it initially and deleting it when the work has been
+completed) are performed by Instance implementations.  This means that for the majority of
+PrimaryOnlyServices, the first step of its Instance's run() method will be to insert an initial
+state document into the state document collection, to ensure that the Instance is now persisted and
+will be resumed after failover.  When an Instance is resumed after failover, it is provided the
+current version of the state document as it exists in the state document collection.  That document
+can be used to rebuild the in-memory state for this Instance so that when run() is called it knows
+what state it is in and thus what work still needs to be performed, and what work has already been
+completed by the previous Primary.
+
+To see an example bare-bones PrimaryOnlyService implementation to use as a reference, check out the
+TestService defined in this unit test: https://github.com/mongodb/mongo/blob/master/src/mongo/db/repl/primary_only_service_test.cpp
+
+
+## Behavior during state transitions
+
+At stepUp, each PrimaryOnlyService queries its state document collection, and for each document
+found, creates and launches a PrimaryOnlyService::Instance initialized off of the state
+document. This happens asynchronously relative to the core replication stepUp process - there is no
+guarantee that when stepUp completes and the RSTL lock is dropped that the PrimaryOnlyServices have
+finished rebuilding all their Instances. At stepDown all Instances are interrupted, but the threads
+running their work are not joined, and the Instance objects containing their in-memory state are not
+released, until the next stepUp. This is done to reduce the likelihood of blocking within the state
+transition process and delaying it for the entire node. This behavior does, however, guarantee that
+there will never be two Instances of the same PrimaryOnlyService with the same InstanceID running at
+the same time on the same node.
+
+### Interrupting Instances at stepDown
+
+At stepDown, there are 3 main ways that Instances are interrupted and we guarantee that no more work
+is performed on behalf of any PrimaryOnlyServices.  The first is that the executor provided to each
+Instance's run() method gets shut down, preventing any more work from being scheduled on behalf of
+that Instance.  The second is that all OperationContexts created on threads (Clients) that are part
+of an Executor owned by a PrimaryOnlyService get interrupted. The third is that each individual
+Instance is explicitly interrupted, so that it can unblock any work running on threads that are
+*not* a part of an executor owned by the PrimaryOnlyService that are dependent on that Instance
+signaling them (e.g. commands that are waiting on the Instance to reach a certain state). Currently
+this happens via a call to an interrupt() method that each Instance must override, but in the future
+this is likely to change to signaling a CancelationToken owned by the Instance instead.
+
+## Instance lifetime
+
+Instances are held by shared_ptr in their parent PrimaryOnlyService. Each PrimaryOnlyService
+releases all Instance shared_ptrs it owns on stepDown.  Additionally, a PrimaryOnlyService will
+release an Instance shared_ptr when the state document for that Instance is deleted (via an
+OpObserver).  Since generally speaking it is logic from an Instance's run() method that will be
+responsible for deleting its state document, such logic needs to be careful as the moment the state
+document is deleted, the corresponding PrimaryOnlyService is no longer keeping that Instance alive.
+If an Instance has any additional logic or internal state to update after deleting its state
+document, it must extend its own lifetime by capturing a shared_ptr to itself by calling
+shared_from_this() before deleting its state document.
+\ No newline at end of file
diff --git a/src/mongo/db/repl/README.md b/src/mongo/db/repl/README.md
index d8db94f17ae..58370de7a2f 100644
--- a/src/mongo/db/repl/README.md
+++ b/src/mongo/db/repl/README.md
@@ -1975,4 +1975,28 @@ special case during rollback it is possible for the `stableTimestamp` to move ba
 
 The calculation of this value in the replication layer occurs [here](https://github.com/mongodb/mongo/blob/00fbc981646d9e6ebc391f45a31f4070d4466753/src/mongo/db/repl/replication_coordinator_impl.cpp#L4824-L4881).
 The replication layer will [skip setting the stable timestamp](https://github.com/mongodb/mongo/blob/00fbc981646d9e6ebc391f45a31f4070d4466753/src/mongo/db/repl/replication_coordinator_impl.cpp#L4907-L4921) if it is earlier than the
-`initialDataTimestamp`, since data earlier than that timestamp may be inconsistent.
-\ No newline at end of file
+`initialDataTimestamp`, since data earlier than that timestamp may be inconsistent.
+
+# Non-replication subsystems dependent on replication state transitions.
+
+The replication machinery provides two different APIs for mongod subsystems to receive notifications
+about replication state transitions. The first, simpler API is the ReplicaSetAwareService interface.
+The second, more sophisticated but also more prescriptive API is the PrimaryOnlyService interface.
+
+## ReplicaSetAwareService interface
+
+The ReplicaSetAwareService interface provides simple hooks to receive notifications on transitions
+into and out of the Primary state. By extending ReplicaSetAwareService and overriding its virtual
+methods, it is possible to get notified every time the current mongod node steps up or steps down.
+Because the onStepUp and onStepDown methods of ReplicaSetAwareServices are called inline as part of
+the stepUp and stepDown processes, while the RSTL is held, ReplicaSetAwareService subclasses should
+strive to do as little work as possible in the bodies of these methods, and should avoid performing
+blocking i/o, as all work performed in these methods delays the replica set state transition for the
+entire node which can result in longer periods of write unavailability for the replica set.
+
+## PrimaryOnlyService interface
+
+The PrimaryOnlyService interface is more sophisticated than the ReplicaSetAwareService interface and
+is designed specifically for services built on persistent state machines that must be driven to
+conclusion by the Primary node of the replica set, even across failovers.  Check out [this
+document](../../../../docs/primary_only_service.md) for more information about PrimaryOnlyServices.
+\ No newline at end of file