1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
|
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<title>Transactional guarantees</title>
<link rel="stylesheet" href="gettingStarted.css" type="text/css" />
<meta name="generator" content="DocBook XSL Stylesheets V1.73.2" />
<link rel="start" href="index.html" title="Berkeley DB Programmer's Reference Guide" />
<link rel="up" href="rep.html" title="Chapter 12. Berkeley DB Replication" />
<link rel="prev" href="rep_bulk.html" title="Bulk transfer" />
<link rel="next" href="rep_lease.html" title="Master leases" />
</head>
<body>
<div xmlns="" class="navheader">
<div class="libver">
<p>Library Version 12.1.6.1</p>
</div>
<table width="100%" summary="Navigation header">
<tr>
<th colspan="3" align="center">Transactional guarantees</th>
</tr>
<tr>
<td width="20%" align="left"><a accesskey="p" href="rep_bulk.html">Prev</a> </td>
<th width="60%" align="center">Chapter 12. Berkeley DB Replication </th>
<td width="20%" align="right"> <a accesskey="n" href="rep_lease.html">Next</a></td>
</tr>
</table>
<hr />
</div>
<div class="sect1" lang="en" xml:lang="en">
<div class="titlepage">
<div>
<div>
<h2 class="title" style="clear: both"><a id="rep_trans"></a>Transactional guarantees</h2>
</div>
</div>
</div>
<p>
It is important to consider replication in the context of
the overall database environment's transactional guarantees.
To briefly review, transactional guarantees in a
non-replicated application are based on the writing of log
file records to "stable storage", usually a disk drive. If the
application or system then fails, the Berkeley DB logging
information is reviewed during recovery, and the databases are
updated so that all changes made as part of committed
transactions appear, and all changes made as part of
uncommitted transactions do not appear. In this case, no
information will have been lost.
</p>
<p>
If a database environment does not require the log be
flushed to stable storage on transaction commit (using the
<a href="../api_reference/C/envset_flags.html#envset_flags_DB_TXN_NOSYNC" class="olink">DB_TXN_NOSYNC</a> flag to increase performance at the cost of
sacrificing transactional durability), Berkeley DB recovery
will only be able to restore the system to the state of the
last commit found on stable storage. In this case, information
may have been lost (for example, the changes made by some
committed transactions may not appear in the databases after
recovery).
</p>
<p>
Further, if there is database or log file loss or corruption
(for example, if a disk drive fails), then catastrophic
recovery is necessary, and Berkeley DB recovery will only be
able to restore the system to the state of the last archived
log file. In this case, information may also have been
lost.
</p>
<p>
Replicating the database environment extends this model, by
adding a new component to "stable storage": the client's
replicated information. If a database environment is
replicated, there is no lost information in the case of
database or log file loss, because the replicated system can
be configured to contain a complete set of databases and log
records up to the point of failure. A database environment
that loses a disk drive can have the drive replaced, and it
can then rejoin the replication group.
</p>
<p>
Because of this new component of stable storage, specifying
<a href="../api_reference/C/envset_flags.html#envset_flags_DB_TXN_NOSYNC" class="olink">DB_TXN_NOSYNC</a> in a replicated environment no longer
sacrifices durability, as long as one or more clients have
acknowledged receipt of the messages sent by the master. Since
network connections are often faster than local synchronous
disk writes, replication becomes a way for applications to
significantly improve their performance as well as their
reliability.
</p>
<p>
Applications using Replication Manager are free to use
<a href="../api_reference/C/envset_flags.html#envset_flags_DB_TXN_NOSYNC" class="olink">DB_TXN_NOSYNC</a> at the master and/or clients as they see fit.
The behavior of the <span class="bold"><strong>send</strong></span>
function that Replication Manager provides on the
application's behalf is determined by an "acknowledgement
policy", which is configured by the <a href="../api_reference/C/repmgrset_ack_policy.html" class="olink">DB_ENV->repmgr_set_ack_policy()</a> method.
Clients always send acknowledgements for <a href="../api_reference/C/reptransport.html#transport_DB_REP_PERMANENT" class="olink">DB_REP_PERMANENT</a>
messages (unless the acknowledgement policy in effect
indicates that the master doesn't care about them). For a
<a href="../api_reference/C/reptransport.html#transport_DB_REP_PERMANENT" class="olink">DB_REP_PERMANENT</a> message, the master blocks the sending
thread until either it receives the proper number of
acknowledgements, or the <a href="../api_reference/C/repset_timeout.html#set_timeout_DB_REP_ACK_TIMEOUT" class="olink">DB_REP_ACK_TIMEOUT</a> expires. In the
case of timeout, Replication Manager returns an error code
from the <span class="bold"><strong>send</strong></span> function,
causing Berkeley DB to flush the transaction log before
returning to the application, as previously described. The
default acknowledgement policy is <a href="../api_reference/C/repmgrset_ack_policy.html#ackspolicy_DB_REPMGR_ACKS_QUORUM" class="olink">DB_REPMGR_ACKS_QUORUM</a>,
which ensures that the effect of a permanent record remains
durable following an election.
</p>
<p>
The rest of this section discusses transactional guarantee
considerations in Base API applications.
</p>
<p>
The return status from the application's <span class="bold"><strong>send</strong></span> function must be set by the
application to ensure the transactional guarantees the
application wants to provide. Whenever the <span class="bold"><strong>send</strong></span> function returns failure, the
local database environment's log is flushed as necessary to
ensure that any information critical to database integrity is
not lost. Because this flush is an expensive operation in
terms of database performance, applications should avoid
returning an error from the <span class="bold"><strong>send</strong></span> function,
if at all possible.
</p>
<p>
The only interesting message type for replication
transactional guarantees is when the application's <span class="bold"><strong>send</strong></span> function was called with the
<a href="../api_reference/C/reptransport.html#transport_DB_REP_PERMANENT" class="olink">DB_REP_PERMANENT</a> flag specified. There is no reason for the
<span class="bold"><strong>send</strong></span> function to ever
return failure unless the <a href="../api_reference/C/reptransport.html#transport_DB_REP_PERMANENT" class="olink">DB_REP_PERMANENT</a> flag was
specified -- messages without the <a href="../api_reference/C/reptransport.html#transport_DB_REP_PERMANENT" class="olink">DB_REP_PERMANENT</a> flag do
not make visible changes to databases, and the <span class="bold"><strong>send</strong></span> function can return success to
Berkeley DB as soon as the message has been sent to the
client(s) or even just copied to local application memory in
preparation for being sent.
</p>
<p>
When a client receives a <a href="../api_reference/C/reptransport.html#transport_DB_REP_PERMANENT" class="olink">DB_REP_PERMANENT</a> message, the
client will flush its log to stable storage before returning
(unless the client environment has been configured with the
<a href="../api_reference/C/envset_flags.html#envset_flags_DB_TXN_NOSYNC" class="olink">DB_TXN_NOSYNC</a> option). If the client is unable to flush a
complete transactional record to disk for any reason (for
example, there is a missing log record before the flagged
message), the call to the <a href="../api_reference/C/repmessage.html" class="olink">DB_ENV->rep_process_message()</a> method on the client
will return <a href="../api_reference/C/repmessage.html#repmsg_DB_REP_NOTPERM" class="olink">DB_REP_NOTPERM</a> and return the LSN of this record
to the application in the <span class="bold"><strong>ret_lsnp</strong></span> parameter.
The application's client or master message handling loops should take proper
action to ensure the correct transactional guarantees in this case. When
missing records arrive and allow subsequent processing of
previously stored permanent records, the call to the
<a href="../api_reference/C/repmessage.html" class="olink">DB_ENV->rep_process_message()</a> method on the client will return <a href="../api_reference/C/repmessage.html#repmsg_DB_REP_ISPERM" class="olink">DB_REP_ISPERM</a>
and return the largest LSN of the permanent records that were
flushed to disk. Client applications can use these LSNs to
know definitively if any particular LSN is permanently stored
or not.
</p>
<p>
An application relying on a client's ability to become a
master and guarantee that no data has been lost will need to
write the <span class="bold"><strong>send</strong></span> function to
return an error whenever it cannot guarantee the site that
will win the next election has the record. Applications not
requiring this level of transactional guarantees need not have
the <span class="bold"><strong>send</strong></span> function return
failure (unless the master's database environment has been
configured with <a href="../api_reference/C/envset_flags.html#envset_flags_DB_TXN_NOSYNC" class="olink">DB_TXN_NOSYNC</a>), as any information critical
to database integrity has already been flushed to the local
log before <span class="bold"><strong>send</strong></span> was
called.
</p>
<p>
To sum up, the only reason for the <span class="bold"><strong>send</strong></span> function
to return failure is when the
master database environment has been configured to not
synchronously flush the log on transaction commit (that is,
<a href="../api_reference/C/envset_flags.html#envset_flags_DB_TXN_NOSYNC" class="olink">DB_TXN_NOSYNC</a> was configured on the master), the
<a href="../api_reference/C/reptransport.html#transport_DB_REP_PERMANENT" class="olink">DB_REP_PERMANENT</a> flag is specified for the message, and the
<span class="bold"><strong>send</strong></span> function was unable
to determine that some number of clients have received the
current message (and all messages preceding the current
message). How many clients need to receive the message before
the <span class="bold"><strong>send</strong></span> function can return
success is an application choice (and may not depend as much
on a specific number of clients reporting success as one or
more geographically distributed clients).
</p>
<p>
If, however, the application does require on-disk durability
on the master, the master should be configured to
synchronously flush the log on commit. If clients are not
configured to synchronously flush the log, that is, if a
client is running with <a href="../api_reference/C/envset_flags.html#envset_flags_DB_TXN_NOSYNC" class="olink">DB_TXN_NOSYNC</a> configured, then it is
up to the application to reconfigure that client appropriately
when it becomes a master. That is, the application must
explicitly call <a href="../api_reference/C/envset_flags.html" class="olink">DB_ENV->set_flags()</a> to disable asynchronous log
flushing as part of re-configuring the client as the new
master.
</p>
<p>
Of course, it is important to ensure that the replicated
master and client environments are truly independent of each
other. For example, it does not help matters that a client has
acknowledged receipt of a message if both master and clients
are on the same power supply, as the failure of the power
supply will still potentially lose information.
</p>
<p>
Configuring a Base API application to achieve the proper mix
of performance and transactional guarantees can be complex. In
brief, there are a few controls an application can set to
configure the guarantees it makes: specification of
<a href="../api_reference/C/envset_flags.html#envset_flags_DB_TXN_NOSYNC" class="olink">DB_TXN_NOSYNC</a> for the master environment, specification of
<a href="../api_reference/C/envset_flags.html#envset_flags_DB_TXN_NOSYNC" class="olink">DB_TXN_NOSYNC</a> for the client environment, the priorities of
different sites participating in an election, and the behavior
of the application's <span class="bold"><strong>send</strong></span>
function.
</p>
<p>
First, it is rarely useful to write and synchronously flush
the log when a transaction commits on a replication client. It
may be useful where systems share resources and multiple
systems commonly fail at the same time. By default, all
Berkeley DB database environments, whether master or client,
synchronously flush the log on transaction commit or prepare.
Generally, replication masters and clients turn log flush off
for transaction commit using the <a href="../api_reference/C/envset_flags.html#envset_flags_DB_TXN_NOSYNC" class="olink">DB_TXN_NOSYNC</a> flag.
</p>
<p>
Consider two systems connected by a network interface. One
acts as the master, the other as a read-only client. The
client takes over as master if the master crashes and the
master rejoins the replication group after such a failure.
Both master and client are configured to not synchronously
flush the log on transaction commit (that is, <a href="../api_reference/C/envset_flags.html#envset_flags_DB_TXN_NOSYNC" class="olink">DB_TXN_NOSYNC</a>
was configured on both systems). The application's <span class="bold"><strong>send</strong></span> function never returns failure
to the Berkeley DB library, simply forwarding messages to the
client (perhaps over a broadcast mechanism), and always
returning success. On the client, any <a href="../api_reference/C/repmessage.html#repmsg_DB_REP_NOTPERM" class="olink">DB_REP_NOTPERM</a> returns
from the client's <a href="../api_reference/C/repmessage.html" class="olink">DB_ENV->rep_process_message()</a> method are ignored, as well.
This system configuration has excellent performance, but may
lose data in some failure modes.
</p>
<p>
If both the master and the client crash at once, it is
possible to lose committed transactions, that is,
transactional durability is not being maintained. Reliability
can be increased by providing separate power supplies for the
systems and placing them in separate physical
locations.
</p>
<p>
If the connection between the two machines fails (or just
some number of messages are lost), and subsequently the master
crashes, it is possible to lose committed transactions. Again,
transactional durability is not being maintained. Reliability
can be improved in a couple of ways:
</p>
<div class="orderedlist">
<ol type="1">
<li>
<p>
Use a reliable network protocol (for example,
TCP/IP instead of UDP).
</p>
</li>
<li>
<p>
Increase the number of clients and network paths to
make it less likely that a message will be lost. In
this case, it is important to also make sure a client
that did receive the message wins any subsequent
election. If a client that did not receive the message
wins a subsequent election, data can still be lost.
</p>
</li>
</ol>
</div>
<p>
Further, systems may want to guarantee message delivery to
the client(s) (for example, to prevent a network connection
from simply discarding messages). Some systems may want to
ensure clients never return out-of-date information, that is,
once a transaction commit returns success on the master, no
client will return old information to a read-only query. Some
of the following changes to a Base API application may be used
to address these issues:
</p>
<div class="orderedlist">
<ol type="1">
<li>
<p>
Write the application's <span class="bold"><strong>send</strong></span> function
to not return to Berkeley DB until one or more clients have
acknowledged receipt of the message. The number of
clients chosen will be dependent on the application:
you will want to consider likely network partitions
(ensure that a client at each physical site receives
the message) and geographical diversity (ensure that a
client on each coast receives the message).
</p>
</li>
<li>
<p>
Write the client's message processing loop to not
acknowledge receipt of the message until a call to the
<a href="../api_reference/C/repmessage.html" class="olink">DB_ENV->rep_process_message()</a> method has returned success. Messages
resulting in a return of <a href="../api_reference/C/repmessage.html#repmsg_DB_REP_NOTPERM" class="olink">DB_REP_NOTPERM</a> from the
<a href="../api_reference/C/repmessage.html" class="olink">DB_ENV->rep_process_message()</a> method mean the message could not be
flushed to the client's disk. If the client does not
acknowledge receipt of such messages to the master
until a subsequent call to the <a href="../api_reference/C/repmessage.html" class="olink">DB_ENV->rep_process_message()</a> method
returns <a href="../api_reference/C/repmessage.html#repmsg_DB_REP_ISPERM" class="olink">DB_REP_ISPERM</a> and the LSN returned is at
least as large as this message's LSN, then the
master's <span class="bold"><strong>send</strong></span>
function will not return success to the Berkeley DB
library. This means the thread committing the
transaction on the master will not be allowed to
proceed based on the transaction having committed
until the selected set of clients have received the
message and consider it complete.
</p>
<p>
Alternatively, the client's message processing loop
could acknowledge the message to the master, but with
an error code indicating that the application's
<span class="bold"><strong>send</strong></span> function
should not return to the Berkeley DB library until a
subsequent acknowledgement from the same client
indicates success.
</p>
<p>
The application send callback function invoked by
Berkeley DB contains an LSN of the record being sent
(if appropriate for that record). When <a href="../api_reference/C/repmessage.html" class="olink">DB_ENV->rep_process_message()</a>
method returns indicators that a permanent record has
been written then it also returns the maximum LSN of
the permanent record written.
</p>
</li>
</ol>
</div>
<p>
There is one final pair of failure scenarios to consider.
First, it is not possible to abort transactions after the
application's <span class="bold"><strong>send</strong></span> function
has been called, as the master may have already written the
commit log records to disk, and so abort is no longer an
option. Second, a related problem is that even though the
master will attempt to flush the local log if the <span class="bold"><strong>send</strong></span> function returns failure, that
flush may fail (for example, when the local disk is full).
Again, the transaction cannot be aborted as one or more
clients may have committed the transaction even if <span class="bold"><strong>send</strong></span> returns failure. Rare
applications may not be able to tolerate these unlikely
failure modes. In that case the application may want
to:
</p>
<div class="orderedlist">
<ol type="1">
<li>
<p>
Configure the master to do always local synchronous
commits (turning off the <a href="../api_reference/C/envset_flags.html#envset_flags_DB_TXN_NOSYNC" class="olink">DB_TXN_NOSYNC</a>
configuration). This will decrease performance
significantly, of course (one of the reasons to use
replication is to avoid local disk writes.) In this
configuration, failure to write the local log will
cause the transaction to abort in all cases.
</p>
</li>
<li>
<p>
Do not return from the application's <span class="bold"><strong>send</strong></span> function under any
conditions, until the selected set of clients has
acknowledged the message. Until the <span class="bold"><strong>send</strong></span> function returns to
the Berkeley DB library, the thread committing the
transaction on the master will wait, and so no
application will be able to act on the knowledge that
the transaction has committed.
</p>
</li>
</ol>
</div>
</div>
<div class="navfooter">
<hr />
<table width="100%" summary="Navigation footer">
<tr>
<td width="40%" align="left"><a accesskey="p" href="rep_bulk.html">Prev</a> </td>
<td width="20%" align="center">
<a accesskey="u" href="rep.html">Up</a>
</td>
<td width="40%" align="right"> <a accesskey="n" href="rep_lease.html">Next</a></td>
</tr>
<tr>
<td width="40%" align="left" valign="top">Bulk transfer </td>
<td width="20%" align="center">
<a accesskey="h" href="index.html">Home</a>
</td>
<td width="40%" align="right" valign="top"> Master leases</td>
</tr>
</table>
</div>
</body>
</html>
|