docs/primary_only_service.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106

# PrimaryOnlyService

The PrimaryOnlyService machinery provides a way to register tasks that should run only when current
node is Primary, and should be driven to completion across replica set failovers on the new
Primary. It is intended to be used by tasks that can be modeled as a state machine with a single
MongoDB document containing the current state, which newly-elected Primaries can use to rebuild the
state of the task after failover and pick up where the old Primary left off.

## Classes

There are three main classes/interfaces that make up the PrimaryOnlyService machinery.

### PrimaryOnlyServiceRegistry

The PrimaryOnlyServiceRegistry is a singleton that is installed as a decoration on the
ServiceContext at startup and lives for the lifetime of the mongod process.  During mongod global
startup, all PrimaryOnlyServices must be registered against the PrimaryOnlyServiceRegistry before
the ReplicationCoordinator is started up (as it is the ReplicationCoordinator startup that starts up
the registered PrimaryOnlyServices). Specific PrimaryOnlyServices can be looked up from the registry
at runtime, and are handed out by raw pointer, which is safe since the set of registered
PrimaryOnlyServices does not change during runtime.  The PrimaryOnlyServiceRegistry is itself a
[ReplicaSetAwareService](../src/mongo/db/repl/README.md#ReplicaSetAwareService-interface), which is
how it receives notifications about changes in and out of Primary state.

### PrimaryOnlyService

The PrimaryOnlyService interface is used to define a new Primary Only Service.  A PrimaryOnlyService
is a grouping of tasks (Instances) that run only when the node is Primary and are resumed after
failover.  Each PrimaryOnlyService must declare a unique, replicated collection (most likely in the
admin or config databases), where the state documents for all Instances of the service will be
persisted.  At stepUp, each PrimaryOnlyService will create and launch Instance objects for each
document found in this collection. This is how PrimaryOnlyService tasks get resumed after failover.


### PrimaryOnlyService::Instance/TypedInstance

The PrimaryOnlyService::Instance interface is used to contain the state and core logic for running a
single task belonging to a PrimaryOnlyService. The Instance interface includes a "run()" virtual
method which is provided an executor which is used to run all work that is done on behalf of the
Instance. Implementations should not extend PrimaryOnlyService::Instance directly, instead they
should extend PrimaryOnlyService::TypedInstance, which allows individual Instances to be looked up
and returned as pointers to the proper Instance sub-type. The InstanceID for an Instance is the _id
field of its state document.


## Defining a new PrimaryOnlyService

To define a new PrimaryOnlyService one must add corresponding subclasses of both PrimaryOnlyService
and PrimaryOnlyService::TypedInstance.  The PrimaryOnlyService subclass just exists to specify what
collection state documents for this service are stored in, and to hand out corresponding Instances
of the proper type.  Most of the work of a new PrimaryOnlyService will be implemented in the
PrimaryOnlyService::Instance subclass. PrimaryOnlyService::Instance subclasses will be responsible
for running the work they need to perform to complete their task, as well as for managing and
synchronizing their own in-memory and on-disk state. No part of the PrimaryOnlyService **machinery**
ever performs writes to the PrimaryOnlyService state document collections.  All writes to a given
Instance's state document (including creating it initially and deleting it when the work has been
completed) are performed by Instance implementations.  This means that for the majority of
PrimaryOnlyServices, the first step of its Instance's run() method will be to insert an initial
state document into the state document collection, to ensure that the Instance is now persisted and
will be resumed after failover.  When an Instance is resumed after failover, it is provided the
current version of the state document as it exists in the state document collection.  That document
can be used to rebuild the in-memory state for this Instance so that when run() is called it knows
what state it is in and thus what work still needs to be performed, and what work has already been
completed by the previous Primary.

To see an example bare-bones PrimaryOnlyService implementation to use as a reference, check out the
TestService defined in this unit test: https://github.com/mongodb/mongo/blob/master/src/mongo/db/repl/primary_only_service_test.cpp


## Behavior during state transitions

At stepUp, each PrimaryOnlyService queries its state document collection, and for each document
found, creates and launches a PrimaryOnlyService::Instance initialized off of the state
document. This happens asynchronously relative to the core replication stepUp process - there is no
guarantee that when stepUp completes and the RSTL lock is dropped that the PrimaryOnlyServices have
finished rebuilding all their Instances. At stepDown all Instances are interrupted, but the threads
running their work are not joined, and the Instance objects containing their in-memory state are not
released, until the next stepUp. This is done to reduce the likelihood of blocking within the state
transition process and delaying it for the entire node. This behavior does, however, guarantee that
there will never be two Instances of the same PrimaryOnlyService with the same InstanceID running at
the same time on the same node.

### Interrupting Instances at stepDown

At stepDown, there are 3 main ways that Instances are interrupted and we guarantee that no more work
is performed on behalf of any PrimaryOnlyServices.  The first is that the executor provided to each
Instance's run() method gets shut down, preventing any more work from being scheduled on behalf of
that Instance.  The second is that all OperationContexts created on threads (Clients) that are part
of an Executor owned by a PrimaryOnlyService get interrupted. The third is that each individual
Instance is explicitly interrupted, so that it can unblock any work running on threads that are
*not* a part of an executor owned by the PrimaryOnlyService that are dependent on that Instance
signaling them (e.g. commands that are waiting on the Instance to reach a certain state). Currently
this happens via a call to an interrupt() method that each Instance must override, but in the future
this is likely to change to signaling a CancellationToken owned by the Instance instead.

## Instance lifetime

Instances are held by shared_ptr in their parent PrimaryOnlyService. Each PrimaryOnlyService
releases all Instance shared_ptrs it owns on stepDown.  Additionally, a PrimaryOnlyService will
release an Instance shared_ptr when the state document for that Instance is deleted (via an
OpObserver).  Since generally speaking it is logic from an Instance's run() method that will be
responsible for deleting its state document, such logic needs to be careful as the moment the state
document is deleted, the corresponding PrimaryOnlyService is no longer keeping that Instance alive.
If an Instance has any additional logic or internal state to update after deleting its state
document, it must extend its own lifetime by capturing a shared_ptr to itself by calling
shared_from_this() before deleting its state document.