1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
|
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<title>Master leases</title>
<link rel="stylesheet" href="gettingStarted.css" type="text/css" />
<meta name="generator" content="DocBook XSL Stylesheets V1.73.2" />
<link rel="start" href="index.html" title="Berkeley DB Programmer's Reference Guide" />
<link rel="up" href="rep.html" title="Chapter 12. Berkeley DB Replication" />
<link rel="prev" href="rep_trans.html" title="Transactional guarantees" />
<link rel="next" href="rep_ryw.html" title="Read your writes consistency" />
</head>
<body>
<div xmlns="" class="navheader">
<div class="libver">
<p>Library Version 12.1.6.1</p>
</div>
<table width="100%" summary="Navigation header">
<tr>
<th colspan="3" align="center">Master leases</th>
</tr>
<tr>
<td width="20%" align="left"><a accesskey="p" href="rep_trans.html">Prev</a> </td>
<th width="60%" align="center">Chapter 12. Berkeley DB Replication </th>
<td width="20%" align="right"> <a accesskey="n" href="rep_ryw.html">Next</a></td>
</tr>
</table>
<hr />
</div>
<div class="sect1" lang="en" xml:lang="en">
<div class="titlepage">
<div>
<div>
<h2 class="title" style="clear: both"><a id="rep_lease"></a>Master leases</h2>
</div>
</div>
</div>
<div class="toc">
<dl>
<dt>
<span class="sect2">
<a href="rep_lease.html#masterlease_change_groupsize">Changing group
size</a>
</span>
</dt>
</dl>
</div>
<p>
Some applications have strict requirements about the
consistency of data read on a master site. Berkeley DB
provides a mechanism called master leases to provide such
consistency. Without master leases, it is sometimes possible
for Berkeley DB to return old data to an application when
newer data is available due to unfortunate scheduling as
illustrated below:
</p>
<div class="orderedlist">
<ol type="1">
<li><span class="bold"><strong>Application on master
site</strong></span>: Read data item
<span class="emphasis"><em>foo</em></span> via Berkeley DB <a href="../api_reference/C/dbget.html" class="olink">DB->get()</a> or
<a href="../api_reference/C/dbcget.html" class="olink">DBC->get()</a> call.
</li>
<li><span class="bold"><strong>Application on master
site</strong></span>: sleep, get descheduled, etc.
</li>
<li><span class="bold"><strong>System</strong></span>: Master changes
role, becomes a client.
</li>
<li><span class="bold"><strong>System</strong></span>: New site is
elected master.
</li>
<li><span class="bold"><strong>System</strong></span>: New master
modifies data item <span class="emphasis"><em>foo</em></span>.
</li>
<li><span class="bold"><strong>Application</strong></span>: Berkeley DB
returns old data for <span class="emphasis"><em>foo</em></span> to
application.
</li>
</ol>
</div>
<p>
By using master leases, Berkeley DB can provide guarantees
about the consistency of data read on a master site. The
master site can be considered a recognized authority for the
data and consequently can provide authoritative reads. Clients
grant master leases to a master site. By doing so, clients
acknowledge the right of that site to retain the role of
master for a period of time. During that period of time,
clients cannot elect a new master, become master, or grant
their lease to another site.
</p>
<p>
By holding a collection of granted leases, a master site
can guarantee to the application that the data returned is the
current, authoritative value. As a master performs operations,
it continually requests updated grants from the clients. When
a read operation is required, the master guarantees that it
holds a valid collection of lease grants from clients before
returning data to the application. By holding leases, Berkeley
DB provides several guarantees to the application:
</p>
<div class="orderedlist">
<ol type="1">
<li>
Authoritative reads: A guarantee that the data
being read by the application is the current value.
</li>
<li>
<p>
Durability from rollback: A guarantee that the data
being written or read by the application is permanent
across a majority of client sites and will never be
rolled back.
</p>
<p>
The rollback guarantee also depends on the
<a href="../api_reference/C/envset_flags.html#envset_flags_DB_TXN_NOSYNC" class="olink">DB_TXN_NOSYNC</a> flag. The guarantee is effective as
long as there isn't a failure of half of the
replication group while clients have granted leases
but are holding the updates in their cache. The
application must weigh the performance impact of
synchronous transactions against the risk of the
failure of at least half of the replication group. If
clients grant a lease while holding updated data in
cache, and failure occurs, then the data is no longer
present on the clients and rollback can occur if a
sufficient number of other sites also crash.
</p>
<p>
The guarantee that data will not be rolled back
applies only to data successfully committed on a
master. Data read on a client, or read while ignoring
leases can be rolled back.
</p>
</li>
<li>
<p>
Freshness: A guarantee that the data being read by
the application on the <span class="emphasis"><em>master</em></span> is
up-to-date and has not been modified or removed during
the read.
</p>
<p>
The read authority is only on the master. Read
operations on a client always ignore leases and
consequently, these operations can return stale data.
</p>
</li>
<li>
<p>
Master viability: A guarantee that a current master
with valid leases cannot encounter a duplicate master
situation.
</p>
<p>
Leases remove the possibility of a duplicate master
situation that forces the current master to downgrade
to a client. However, it is still possible that old
masters with expired leases can discover a later
master and return <a href="../api_reference/C/repmessage.html#repmsg_DB_REP_DUPMASTER" class="olink">DB_REP_DUPMASTER</a> to the
application.
</p>
</li>
</ol>
</div>
<p>
There are several requirements of the application using
leases:
</p>
<div class="orderedlist">
<ol type="1">
<li>
Replication Manager applications must configure a
majority (or larger) acknowledgement policy via the
<a href="../api_reference/C/repmgrset_ack_policy.html" class="olink">DB_ENV->repmgr_set_ack_policy()</a> method. Base API applications must
implement and enforce such a policy on their own.
</li>
<li>
Base API applications must return an error from the
send callback function when the majority acknowledgement
policy is not met for permanent records marked with
<a href="../api_reference/C/reptransport.html#transport_DB_REP_PERMANENT" class="olink">DB_REP_PERMANENT</a>. Note that the Replication Manager
automatically fulfills this requirement.
</li>
<li>
Base API applications must set the number of sites
in the group using the <a href="../api_reference/C/repnsites.html" class="olink">DB_ENV->rep_set_nsites()</a> method before starting
replication and cannot change it during operation.
</li>
<li>
Using leases in a replication group is all or none.
Behavior is undefined when some sites configure leases and
others do not. Use the <a href="../api_reference/C/repconfig.html" class="olink">DB_ENV->rep_set_config()</a> method to turn on
leases.
</li>
<li>
The configured lease timeout value must be the same
on all sites in a replication group, set via the
<a href="../api_reference/C/repset_timeout.html" class="olink">DB_ENV->rep_set_timeout()</a> method.
</li>
<li>
The configured clock skew ratio must be the same on
all sites in a replication group. This value defaults to
no skew, but can be set via the <a href="../api_reference/C/repclockskew.html" class="olink">DB_ENV->rep_set_clockskew()</a> method.
</li>
<li>
Applications that care about read guarantees must
perform all read operations on the master. Reading on a
client does not guarantee freshness.
</li>
<li>
The application must use elections to choose a
master site. It must never simply declare a master without
having won an election (as is allowed without Master
Leases).
</li>
<li>
Unelectable (zero priority) sites never grant
leases and cannot be used to guarantee data durability. A
majority of sites in the replication group must be
electable in order to meet the requirement of getting
lease grants from a majority of sites. Minimizing the
number of unelectable sites improves replication group
availability.
</li>
</ol>
</div>
<p>
Master leases are based on timeouts. Berkeley DB assumes
that time always runs forward. Users who change the system
clock on either client or master sites when leases are in use
void all guarantees and can get undefined behavior. See the
<a href="../api_reference/C/repset_timeout.html" class="olink">DB_ENV->rep_set_timeout()</a> method for more information.
</p>
<p>
Applications using master leases should be prepared to
handle <code class="literal">DB_REP_LEASE_EXPIRED</code> errors from
read operations on a master and from the <a href="../api_reference/C/txncommit.html" class="olink">DB_TXN->commit()</a> method.
</p>
<p>
Read operations on a master that should not be subject to
leases can use the <a href="../api_reference/C/dbget.html#get_DB_IGNORE_LEASE" class="olink">DB_IGNORE_LEASE</a> flag to the <a href="../api_reference/C/dbget.html" class="olink">DB->get()</a>
method. Read operations on a client always imply leases are
ignored.
</p>
<p>
Master lease checks cannot succeed until a majority of
sites have completed client synchronization. Read operations
on a master performed before this condition is met can use the
<a href="../api_reference/C/dbget.html#get_DB_IGNORE_LEASE" class="olink">DB_IGNORE_LEASE</a> flag to avoid errors.
</p>
<p>
Clients are forbidden from participating in elections while
they have an outstanding lease granted to a master. Therefore,
if the <a href="../api_reference/C/repelect.html" class="olink">DB_ENV->rep_elect()</a> method is called, then Berkeley DB will
block, waiting until its lease grant expires before
participating in any election. While it waits, the client
attempts to contact the current master. If the client finds a
current master, then it returns from the <a href="../api_reference/C/repelect.html" class="olink">DB_ENV->rep_elect()</a> method.
When leases are configured and the lease has never yet been
granted (on start-up), clients must wait a full lease timeout
before participating in an election.
</p>
<div class="sect2" lang="en" xml:lang="en">
<div class="titlepage">
<div>
<div>
<h3 class="title"><a id="masterlease_change_groupsize"></a>Changing group
size</h3>
</div>
</div>
</div>
<p>
If you are using master leases and you change the size
of your replication group, there is a remote possibility
that you can lose some data previously thought to be
durable. This is only true for users of the Base API.
</p>
<p>
The problem can arise if you are removing sites from
your replication group. (You might be increasing the size
of your group overall, but if you remove all of the wrong
sites you can lose data.)
</p>
<p>
Suppose you have a replication group with five sites;
A, B, C, D and E; and you are using a quorum
acknowledgement policy. Then:
</p>
<div class="orderedlist">
<ol type="1">
<li>
<p>
Master A replicates a transaction to replicas B
and C. Those sites acknowledge the write activity.
</p>
</li>
<li>
<p>
Sites D and E do not receive the transaction.
However, B and C have acknowledged the
transaction, which means the acknowledgement
policy is met and so the transaction is considered
durable.
</p>
</li>
<li>
<p>
You shut down sites B and C. Now only A has the
transaction.
</p>
</li>
<li>
<p>
You decrease the size of your replication group
to 3 using <a href="../api_reference/C/repnsites.html" class="olink">DB_ENV->rep_set_nsites()</a>.
</p>
</li>
<li>
<p>
You shut down or otherwise lose site A.
</p>
</li>
<li>
<p>
Sites D and E hold an election. Because the
size of the replication group is 3, they have
enough sites to successfully hold an election.
However, neither site has the transaction in
question. In this way, the transaction can become
lost.
</p>
</li>
</ol>
</div>
<p>
An alternative scenario exists where you do not change
the size of your replication group, or you actually
increase the size of your replication group, but in the
process you happen to remove the exact wrong sites:
</p>
<div class="orderedlist">
<ol type="1">
<li>
<p>
Master A replicates a transaction to replicas B
and C. Those sites acknowledge the write activity.
</p>
</li>
<li>
<p>
Sites D and E do not receive the transaction.
However, B and C have acknowledged the
transaction, which means the acknowledgement
policy is met and so the transaction is considered
durable.
</p>
</li>
<li>
<p>
You shut down sites B and C. Now only A has the
transaction.
</p>
</li>
<li>
<p>
You add three new sites to your replication
group: F, G and H, increasing the size of your
replication group to 6 using <a href="../api_reference/C/repnsites.html" class="olink">DB_ENV->rep_set_nsites()</a>.
</p>
</li>
<li>
<p>
You shut down or otherwise lose site A before
F, G and H can be fully populated with data.
</p>
</li>
<li>
<p>
Sites D, E, F, G and H hold an election.
Because the size of the replication group is 6,
they have enough sites to successfully hold an
election. However, none of these sites has the
transaction in question. In this way, the
transaction can become lost.
</p>
</li>
</ol>
</div>
<p>
This scenario represents a race condition that would be
highly unlikely to be seen outside of a lab environment.
To minimize the chance of this race condition occurring to
the absolute minimum, do one or more of the following when
using master leases with the Base API:
</p>
<div class="orderedlist">
<ol type="1">
<li>
<p>
Require all sites to acknowledge transaction
commits.
</p>
</li>
<li>
<p>
Never change the size of your replication group
unless all sites in the group are running and
communicating normally with one another.
</p>
</li>
<li>
<p>
Don't remove (or replace) a large percentage of
your sites from your replication group unless all
sites in the group are running and communicating
normally with one another. If you are going to
remove a large percentage of your sites from your
replication group, try removing just one site at a
time, pausing in between each removal to give the
replication group a chance to fully distribute all
writes before removing the next site.
</p>
</li>
</ol>
</div>
</div>
</div>
<div class="navfooter">
<hr />
<table width="100%" summary="Navigation footer">
<tr>
<td width="40%" align="left"><a accesskey="p" href="rep_trans.html">Prev</a> </td>
<td width="20%" align="center">
<a accesskey="u" href="rep.html">Up</a>
</td>
<td width="40%" align="right"> <a accesskey="n" href="rep_ryw.html">Next</a></td>
</tr>
<tr>
<td width="40%" align="left" valign="top">Transactional guarantees </td>
<td width="20%" align="center">
<a accesskey="h" href="index.html">Home</a>
</td>
<td width="40%" align="right" valign="top"> Read your writes consistency</td>
</tr>
</table>
</div>
</body>
</html>
|