1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
|
/**
* Copyright (C) 2018-present MongoDB, Inc.
*
* This program is free software: you can redistribute it and/or modify
* it under the terms of the Server Side Public License, version 1,
* as published by MongoDB, Inc.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* Server Side Public License for more details.
*
* You should have received a copy of the Server Side Public License
* along with this program. If not, see
* <http://www.mongodb.com/licensing/server-side-public-license>.
*
* As a special exception, the copyright holders give permission to link the
* code of portions of this program with the OpenSSL library under certain
* conditions as described in each individual source file and distribute
* linked combinations including the program with the OpenSSL library. You
* must comply with the Server Side Public License in all respects for
* all of the code used other than as permitted herein. If you modify file(s)
* with this exception, you may extend this exception to your version of the
* file(s), but you are not obligated to do so. If you do not wish to do so,
* delete this exception statement from your version. If you delete this
* exception statement from all source files in the program, then also delete
* it in the license file.
*/
#pragma once
#include "mongo/base/status.h"
#include "mongo/db/jsobj.h"
#include "mongo/db/namespace_string.h"
#include "mongo/db/record_id.h"
#include "mongo/db/repl/optime.h"
#include "mongo/stdx/functional.h"
#include "mongo/util/time_support.h"
#include "mongo/util/uuid.h"
/**
* This rollback algorithm is used when the storage engine does not support recovering to a stable
* timestamp, or if the forceRollbackViaRefetch parameter is set to true.
*
* Rollback via Refetch Overview:
*
* Rollback occurs when a node's oplog diverges from its sync source's oplog and needs to regain
* consistency with the sync source's oplog.
*
* R and S are defined below to represent two nodes involved in rollback.
*
* R = The node whose oplog has diverged from its sync source and is rolling back.
* S = The sync source of node R.
*
* The rollback algorithm is designed to keep S's data and to make node R consistent with node S.
* One could argue here that keeping R's data has some merits, however, in most
* cases S will have significantly more data. Also note that S may have a proper subset of R's
* stream if there were no subsequent writes. Our goal is to get R back in sync with S.
*
* A visualization of what happens in the oplogs of the sync source and node that is rolling back
* is shown below. On the left side of each example are the oplog entries of the nodes before
* rollback occurs and on the right are the oplog entries of the node after rollback occurs.
* During rollback only the oplog entries of R change.
*
* #1: Status of R after operations e, f, and g are rolled back to the common point [d].
* Since there were no other writes to node S after [d], we do not need to apply
* more writes to node R after rolling back.
*
* R : a b c d e f g -> a b c d
* S : a b c d
*
* #2: In this case, we first roll back to [d], and since S has written q to z oplog entries,
* we need to replay these oplog operations onto R after it has rolled back to the common
* point.
*
* R : a b c d e f g -> a b c d q r s t u v w x z
* S : a b c d q r s t u v w x z
*
* Rollback via Refetch Algorithm:
*
* We will continue to use the notation of R as the node whose oplog is inconsistent with
* its sync source and S as the sync source of R. We will also represent the common point
* as point C.
*
* 1. Increment rollback ID of node R.
* 2. Find the most recent common oplog entry, which we will say is point C. In the above
* example, the common point was oplog entry 'd'.
* 3. Undo all the oplog entries that occurred after point C on the node R.
* a. Consider how to revert the oplog entries (i.e. for a create collection, drops the
* collection) and place that information into a FixUpInfo struct.
* b. Cancel out unnecessary operations (i.e. If dropping a collection, there is no need
* to do dropIndex if the index is within the collection that will eventually be
* dropped).
* c. Undo all operations done on node R until the point. We attempt to revert all data
* and metadata until point C. However, if we need to refetch data from the sync
* source, the data on node R will not be completely consistent with what it was
* previously at point C, as some of the data may have been modified by the sync
* source after the common point.
* i. Refetch any documents from node S that are needed for the
* rollback.
* ii. Find minValid, which is the last OpTime of node S.
* i.e. the last oplog entry of node S at the time that rollback occurs.
* iii. Resync collection data and metadata.
* iv. Update minValid if necessary, as more fetching may have occurred from
* the sync source requiring that minValid is updated to an even later
* point.
* v. Drop all collections that were created after point C.
* vi. Drop all indexes that were created after point C.
* vii. Delete, update and insert necessary documents that were modified after
* point C.
* viii. Truncate the oplog to point C.
* 4. After rolling back to point C, node R transitions from ROLLBACK to RECOVERING mode.
*
* Steps 5 and 6 occur in ordinary replication code and are not done in this file.
*
* 5. Retrieve the oplog entries from node S until reaching the minValid oplog entry.
* a. Fetch the oplog entries from node S.
* b. Apply the oplog entries of node S to node R starting from point C up until
* the minValid
* 6. Transition node R from RECOVERING to SECONDARY state.
*/
namespace mongo {
class DBClientConnection;
class OperationContext;
namespace repl {
class OplogInterface;
class ReplicationCoordinator;
class ReplicationProcess;
class RollbackSource;
/**
* Entry point to rollback process.
* Set state to ROLLBACK while we are in this function. This prevents serving reads, even from
* the oplog. This can fail if we are elected PRIMARY, in which case we better not do any
* rolling back. If we successfully enter ROLLBACK, we will only exit this function fatally or
* after transition to RECOVERING.
*
* 'sleepSecsFn' is an optional testing-only argument for overriding mongo::sleepsecs().
*/
void rollback(OperationContext* opCtx,
const OplogInterface& localOplog,
const RollbackSource& rollbackSource,
int requiredRBID,
ReplicationCoordinator* replCoord,
ReplicationProcess* replicationProcess,
stdx::function<void(int)> sleepSecsFn = [](int secs) { sleepsecs(secs); });
/**
* Initiates the rollback process after transition to ROLLBACK.
* This function assumes the preconditions for undertaking rollback have already been met;
* we have ops in our oplog that our sync source does not have, and we are not currently
* PRIMARY.
*
* This function can throw exceptions on failures.
* This function runs a command on the sync source to detect if the sync source rolls back
* while our rollback is in progress.
*
* @param opCtx: Used to read and write from this node's databases.
* @param localOplog: reads the oplog on this server.
* @param rollbackSource: Interface for sync source. Provides the oplog and
* supports fetching documents and copying collections.
* @param requiredRBID: Rollback ID we are required to have throughout rollback.
* @param replCoord: Used to track the rollback ID and to change the follower state.
* @param replicationProcess: Used to update minValid.
*
* If requiredRBID is supplied, we error if the upstream node has a different RBID (i.e. it rolled
* back) after fetching any information from it.
*
* Failures: If a Status with code UnrecoverableRollbackError is returned, the caller must exit
* fatally. All other errors should be considered recoverable regardless of whether reported as a
* status or exception.
*/
Status syncRollback(OperationContext* opCtx,
const OplogInterface& localOplog,
const RollbackSource& rollbackSource,
int requiredRBID,
ReplicationCoordinator* replCoord,
ReplicationProcess* replicationProcess);
/*
Rollback function flowchart:
1. rollback() called.
a. syncRollback() called by rollback().
i. _syncRollback() called by syncRollback().
I. syncRollbackLocalOperations() called by _syncRollback().
A. processOperationFixUp called by syncRollbackLocalOperations().
1. updateFixUpInfoFromLocalOplogEntry called by
processOperationFixUp().
II. removeRedundantOperations() called by _syncRollback().
III. syncFixUp() called by _syncRollback().
1. Retrieves documents to refetch.
2. Checks the rollback ID and updates minValid.
3. Resyncs collection data and metadata.
4. Checks the rollbackID and updates minValid.
5. Drops collections.
6. Drops indexes.
7. Deletes, updates and inserts individual oplogs.
8. Truncates the oplog.
IV. Returns back to syncRollback().
ii. Returns back to rollback().
b. Rollback ends.
*/
/**
* This namespace contains internal details of the rollback system. It is only exposed in a header
* for unit testing. Nothing here should be used outside of rs_rollback.cpp or its unit test.
*/
namespace rollback_internal {
struct DocID {
BSONObj ownedObj;
StringData ns;
BSONElement _id;
UUID uuid;
DocID(BSONObj obj, BSONElement id, UUID ui)
: ownedObj(obj), ns(obj.getStringField("ns")), _id(id), uuid(ui) {}
bool operator<(const DocID& other) const;
bool operator==(const DocID& other) const;
static DocID minFor(UUID uuid) {
auto obj = BSON("" << MINKEY);
return DocID(obj, obj.firstElement(), uuid);
}
static DocID maxFor(UUID uuid) {
auto obj = BSON("" << MAXKEY);
return DocID(obj, obj.firstElement(), uuid);
}
};
struct RenameCollectionInfo {
// The renameFrom and renameTo fields are the fields necessary to roll back the original
// renameCollection operation. For example, if the original command was test.x -> test.y,
// renameFrom would be "test.y" and renameTo would be "test.x".
NamespaceString renameFrom;
NamespaceString renameTo;
};
struct FixUpInfo {
// Note this is a set -- if there are many $inc's on a single document we need to roll back,
// we only need to refetch it once.
std::set<DocID> docsToRefetch;
// UUID of collections that need to be dropped.
stdx::unordered_set<UUID, UUID::Hash> collectionsToDrop;
// Key is the UUID of the collection. Value is the set of index names to drop for each
// collection.
stdx::unordered_map<UUID, std::set<std::string>, UUID::Hash> indexesToDrop;
// Key is the UUID of the collection. Value is a map from indexName to indexSpec for the index.
stdx::unordered_map<UUID, std::map<std::string, BSONObj>, UUID::Hash> indexesToCreate;
// UUIDs of collections that need to have their metadata resynced from the sync source.
stdx::unordered_set<UUID, UUID::Hash> collectionsToResyncMetadata;
// Map of collections to rename. The key is the UUID of the collection and
// the value is a RenameCollectionInfo that contains the current namespace of the
// collection and the namespace to rename it to. Among the collections that need to
// be renamed, it is possible that a collection was 2-phase dropped and needs to be
// renamed from its drop pending namespace to its original namespace.
stdx::unordered_map<UUID, RenameCollectionInfo, UUID::Hash> collectionsToRename;
// When collections are dropped, they are added to a list of drop-pending collections. We keep
// the OpTime and the namespace of the collection because the DropPendingCollectionReaper
// does not store the original name or UUID of the collection.
stdx::unordered_map<UUID, std::pair<OpTime, NamespaceString>, UUID::Hash>
collectionsToRemoveFromDropPendingCollections;
// The UUID of the transactions collection. Set at the beginning of rollback.
boost::optional<UUID> transactionTableUUID = boost::none;
// True if rollback requires re-fetching documents in the session transaction table. If true,
// after rollback the in-memory transaction table is cleared.
bool refetchTransactionDocs = false;
OpTime commonPoint;
RecordId commonPointOurDiskloc;
/**
* Remote server's current rollback id. Keeping track of this
* allows us to determine if the sync source has rolled back, in which case
* we can terminate the rollback of the local node, as we cannot
* roll back against a sync source that is also rolled back.
*/
int rbid;
/**
* Removes all documents in the docsToRefetch set that are in
* the collection passed into the function.
*/
void removeAllDocsToRefetchFor(UUID uuid);
/**
* Removes any redundant operations that may have happened during
* the period of time that the rolling back node was out of sync
* with its sync source. For example, if a collection is dropped, there is
* no need to also drop the indexes that are part of the collection. This
* function removes any operations that were recorded that are unnecessary
* because the collection that the operation is part of is either going
* to be dropped, or fully resynced.
*/
void removeRedundantOperations();
/**
* Removes any redundant index commands. For example, if we create an index with name "a_1" and
* then later proceed to drop that index, we can ignore the first index creation. We return true
* if a redundant index command was removed and false if it was not.
*/
bool removeRedundantIndexCommands(UUID uuid, std::string indexName);
/**
* We roll back all collection drops in two steps. We need to rename the collection
* from its drop pending namespace back to its original namespace. Additionally,
* we need to remove the collection from the list of drop-pending collections
* in the DropPendingCollectionReaper. This function records the necessary information
* into the fixUpInfo struct to undo the drop collection command.
*/
void recordRollingBackDrop(const NamespaceString& nss, OpTime opTime, UUID uuid);
/**
* When we roll back a renameCollection that had previously set dropTarget to a UUID,
* we record the necessary information for undoing the collection drop by adding
* the collection which was dropped into collectionsToRemoveFromDropPendingCollections
* and collectionsToRename by calling recordRollingBackDrop().
*/
Status recordDropTargetInfo(const BSONElement& dropTarget, const BSONObj& obj, OpTime opTime);
};
// Indicates that rollback cannot complete and the server must abort.
class RSFatalException : public std::exception {
public:
RSFatalException(std::string m = "replica set fatal exception") : msg(m) {}
virtual const char* what() const throw() {
return msg.c_str();
}
private:
std::string msg;
};
/**
* This function goes through a single oplog document of the node and records the necessary
* information in order to undo the given oplog entry. The data is placed into a FixUpInfo
* struct that holds all the necessary information to undo all of the oplog entries of the
* rolling back node from after the common point. "ourObj" is the oplog document that needs
* to be reverted.
*/
Status updateFixUpInfoFromLocalOplogEntry(FixUpInfo& fixUpInfo,
const BSONObj& ourObj,
bool isNestedApplyOpsCommand);
/**
* This function uses the FixUpInfo struct to undo all of the operations that occurred after the
* common point on the rolling back node, checking the rollback ID and updating minValid as
* necessary. This includes refetching, updating, and deleting individual documents, resyncing
* collection data and metadata, and dropping and creating collections and indexes. Truncates the
* oplog and triggers necessary in-memory refreshes before returning.
*/
void syncFixUp(OperationContext* opCtx,
const FixUpInfo& fixUpInfo,
const RollbackSource& rollbackSource,
ReplicationCoordinator* replCoord,
ReplicationProcess* replicationProcess);
} // namespace rollback_internal
} // namespace repl
} // namespace mongo
|