It is important to consider replication in the context of the overall -database environment's transactional guarantees. To briefly review, -transactional guarantees in a non-replicated application are based on -the writing of log file records to "stable storage", usually a disk -drive. If the application or system then fails, the Berkeley DB logging -information is reviewed during recovery, and the databases are updated -so that all changes made as part of committed transactions appear, and -all changes made as part of uncommitted transactions do not appear. In -this case, no information will have been lost.

Prev	Chapter 12. - Berkeley DB Replication -	Chapter 12. Berkeley DB Replication	Next

If a database environment does not require the log be flushed to -stable storage on transaction commit (using the DB_TXN_NOSYNC -flag to increase performance at the cost of sacrificing transactional -durability), Berkeley DB recovery will only be able to restore the system to -the state of the last commit found on stable storage. In this case, -information may have been lost (for example, the changes made by some -committed transactions may not appear in the databases after recovery).

Further, if there is database or log file loss or corruption (for -example, if a disk drive fails), then catastrophic recovery is -necessary, and Berkeley DB recovery will only be able to restore the system -to the state of the last archived log file. In this case, information -may also have been lost.

Replicating the database environment extends this model, by adding a -new component to "stable storage": the client's replicated information. -If a database environment is replicated, there is no lost information -in the case of database or log file loss, because the replicated system -can be configured to contain a complete set of databases and log records -up to the point of failure. A database environment that loses a disk -drive can have the drive replaced, and it can then rejoin the -replication group.

Because of this new component of stable storage, specifying -DB_TXN_NOSYNC in a replicated environment no longer sacrifices -durability, as long as one or more clients have acknowledged receipt of -the messages sent by the master. Since network connections are often -faster than local synchronous disk writes, replication becomes a way -for applications to significantly improve their performance as well as -their reliability.

The return status from the application's send function must be -set by the application to ensure the transactional guarantees the -application wants to provide. Whenever the send function -returns failure, the local database environment's log is flushed as -necessary to ensure that any information critical to database integrity -is not lost. Because this flush is an expensive operation in terms of -database performance, applications should avoid returning an error from -the send function, if at all possible.

The only interesting message type for replication transactional -guarantees is when the application's send function was called -with the DB_REP_PERMANENT flag specified. There is no reason -for the send function to ever return failure unless the -DB_REP_PERMANENT flag was specified -- messages without the -DB_REP_PERMANENT flag do not make visible changes to databases, -and the send function can return success to Berkeley DB as soon as -the message has been sent to the client(s) or even just copied to local -application memory in preparation for being sent.

When a client receives a DB_REP_PERMANENT message, the client -will flush its log to stable storage before returning (unless the client -environment has been configured with the DB_TXN_NOSYNC option). -If the client is unable to flush a complete transactional record to disk -for any reason (for example, there is a missing log record before the -flagged message), the call to the DB_ENV->rep_process_message() method on the client -will return DB_REP_NOTPERM and return the LSN of this record -to the application in the ret_lsnp parameter. -The application's client or master -message handling loops should take proper action to ensure the correct -transactional guarantees in this case. When missing records arrive -and allow subsequent processing of previously stored permanent -records, the call to the DB_ENV->rep_process_message() method on the client will -return DB_REP_ISPERM and return the largest LSN of the -permanent records that were flushed to disk. Client applications -can use these LSNs to know definitively if any particular LSN is -permanently stored or not.

An application relying on a client's ability to become a master and -guarantee that no data has been lost will need to write the send -function to return an error whenever it cannot guarantee the site that -will win the next election has the record. Applications not requiring -this level of transactional guarantees need not have the send -function return failure (unless the master's database environment has -been configured with DB_TXN_NOSYNC), as any information critical -to database integrity has already been flushed to the local log before -send was called.

To sum up, the only reason for the send function to return -failure is when the master database environment has been configured to -not synchronously flush the log on transaction commit (that is, -DB_TXN_NOSYNC was configured on the master), the -DB_REP_PERMANENT flag is specified for the message, and the -send function was unable to determine that some number of -clients have received the current message (and all messages preceding -the current message). How many clients need to receive the message -before the send function can return success is an application -choice (and may not depend as much on a specific number of clients -reporting success as one or more geographically distributed clients).

If, however, the application does require on-disk durability on the master, -the master should be configured to synchronously flush the log on commit. -If clients are not configured to synchronously flush the log, -that is, if a client is running with DB_TXN_NOSYNC configured, -then it is up to the application to reconfigure that client -appropriately when it becomes a master. That is, the -application must explicitly call DB_ENV->set_flags() to -disable asynchronous log flushing as part of re-configuring -the client as the new master.

Of course, it is important to ensure that the replicated master and -client environments are truly independent of each other. For example, -it does not help matters that a client has acknowledged receipt of a -message if both master and clients are on the same power supply, as the -failure of the power supply will still potentially lose information.

Configuring your replication-based application to achieve the proper -mix of performance and transactional guarantees can be complex. In -brief, there are a few controls an application can set to configure the -guarantees it makes: specification of DB_TXN_NOSYNC for the -master environment, specification of DB_TXN_NOSYNC for the -client environment, the priorities of different sites participating in -an election, and the behavior of the application's send -function.

Applications using Replication Manager are free to use -DB_TXN_NOSYNC at the master and/or clients as they see fit. The -behavior of the send function that Replication Manager provides -on the application's behalf is determined by an "acknowledgement -policy", which is configured by the DB_ENV->repmgr_set_ack_policy() method. -Clients always send acknowledgements for DB_REP_PERMANENT -messages (unless the acknowledgement policy in effect indicates that the -master doesn't care about them). For a DB_REP_PERMANENT -message, the master blocks the sending thread until either it receives -the proper number of acknowledgements, or the DB_REP_ACK_TIMEOUT -expires. In the case of timeout, Replication Manager returns an error -code from the send function, causing Berkeley DB to flush the -transaction log before returning to the application, as previously -described. The default acknowledgement policy is -DB_REPMGR_ACKS_QUORUM, which ensures that the effect of a -permanent record remains durable following an election.

First, it is rarely useful to write and synchronously flush the log when -a transaction commits on a replication client. It may be useful where -systems share resources and multiple systems commonly fail at the same -time. By default, all Berkeley DB database environments, whether master or -client, synchronously flush the log on transaction commit or prepare. -Generally, replication masters and clients turn log flush off for -transaction commit using the DB_TXN_NOSYNC flag.

Consider two systems connected by a network interface. One acts as the -master, the other as a read-only client. The client takes over as -master if the master crashes and the master rejoins the replication -group after such a failure. Both master and client are configured to -not synchronously flush the log on transaction commit (that is, -DB_TXN_NOSYNC was configured on both systems). The -application's send function never returns failure to the Berkeley DB -library, simply forwarding messages to the client (perhaps over a -broadcast mechanism), and always returning success. On the client, any -DB_REP_NOTPERM returns from the client's DB_ENV->rep_process_message() method -are ignored, as well. This system configuration has excellent -performance, but may lose data in some failure modes.

If both the master and the client crash at once, it is possible to lose -committed transactions, that is, transactional durability is not being -maintained. Reliability can be increased by providing separate power -supplies for the systems and placing them in separate physical locations.

If the connection between the two machines fails (or just some number -of messages are lost), and subsequently the master crashes, it is -possible to lose committed transactions. Again, transactional -durability is not being maintained. Reliability can be improved in a -couple of ways:

+ It is important to consider replication in the context of + the overall database environment's transactional guarantees. + To briefly review, transactional guarantees in a + non-replicated application are based on the writing of log + file records to "stable storage", usually a disk drive. If the + application or system then fails, the Berkeley DB logging + information is reviewed during recovery, and the databases are + updated so that all changes made as part of committed + transactions appear, and all changes made as part of + uncommitted transactions do not appear. In this case, no + information will have been lost. +

+ If a database environment does not require the log be + flushed to stable storage on transaction commit (using the + DB_TXN_NOSYNC flag to increase performance at the cost of + sacrificing transactional durability), Berkeley DB recovery + will only be able to restore the system to the state of the + last commit found on stable storage. In this case, information + may have been lost (for example, the changes made by some + committed transactions may not appear in the databases after + recovery). +

+ Further, if there is database or log file loss or corruption + (for example, if a disk drive fails), then catastrophic + recovery is necessary, and Berkeley DB recovery will only be + able to restore the system to the state of the last archived + log file. In this case, information may also have been + lost. +

+ Replicating the database environment extends this model, by + adding a new component to "stable storage": the client's + replicated information. If a database environment is + replicated, there is no lost information in the case of + database or log file loss, because the replicated system can + be configured to contain a complete set of databases and log + records up to the point of failure. A database environment + that loses a disk drive can have the drive replaced, and it + can then rejoin the replication group. +

+ Because of this new component of stable storage, specifying + DB_TXN_NOSYNC in a replicated environment no longer + sacrifices durability, as long as one or more clients have + acknowledged receipt of the messages sent by the master. Since + network connections are often faster than local synchronous + disk writes, replication becomes a way for applications to + significantly improve their performance as well as their + reliability. +

+ Applications using Replication Manager are free to use + DB_TXN_NOSYNC at the master and/or clients as they see fit. + The behavior of the send + function that Replication Manager provides on the + application's behalf is determined by an "acknowledgement + policy", which is configured by the DB_ENV->repmgr_set_ack_policy() method. + Clients always send acknowledgements for DB_REP_PERMANENT + messages (unless the acknowledgement policy in effect + indicates that the master doesn't care about them). For a + DB_REP_PERMANENT message, the master blocks the sending + thread until either it receives the proper number of + acknowledgements, or the DB_REP_ACK_TIMEOUT expires. In the + case of timeout, Replication Manager returns an error code + from the send function, + causing Berkeley DB to flush the transaction log before + returning to the application, as previously described. The + default acknowledgement policy is DB_REPMGR_ACKS_QUORUM, + which ensures that the effect of a permanent record remains + durable following an election. +

+ The rest of this section discusses transactional guarantee + considerations in Base API applications. +

+ The return status from the application's send function must be set by the + application to ensure the transactional guarantees the + application wants to provide. Whenever the send function returns failure, the + local database environment's log is flushed as necessary to + ensure that any information critical to database integrity is + not lost. Because this flush is an expensive operation in + terms of database performance, applications should avoid + returning an error from the send function, + if at all possible. +

+ The only interesting message type for replication + transactional guarantees is when the application's send function was called with the + DB_REP_PERMANENT flag specified. There is no reason for the + send function to ever + return failure unless the DB_REP_PERMANENT flag was + specified -- messages without the DB_REP_PERMANENT flag do + not make visible changes to databases, and the send function can return success to + Berkeley DB as soon as the message has been sent to the + client(s) or even just copied to local application memory in + preparation for being sent. +

+ When a client receives a DB_REP_PERMANENT message, the + client will flush its log to stable storage before returning + (unless the client environment has been configured with the + DB_TXN_NOSYNC option). If the client is unable to flush a + complete transactional record to disk for any reason (for + example, there is a missing log record before the flagged + message), the call to the DB_ENV->rep_process_message() method on the client + will return DB_REP_NOTPERM and return the LSN of this record + to the application in the ret_lsnp parameter. + The application's client or master message handling loops should take proper + action to ensure the correct transactional guarantees in this case. When + missing records arrive and allow subsequent processing of + previously stored permanent records, the call to the + DB_ENV->rep_process_message() method on the client will return DB_REP_ISPERM + and return the largest LSN of the permanent records that were + flushed to disk. Client applications can use these LSNs to + know definitively if any particular LSN is permanently stored + or not. +

+ An application relying on a client's ability to become a + master and guarantee that no data has been lost will need to + write the send function to + return an error whenever it cannot guarantee the site that + will win the next election has the record. Applications not + requiring this level of transactional guarantees need not have + the send function return + failure (unless the master's database environment has been + configured with DB_TXN_NOSYNC), as any information critical + to database integrity has already been flushed to the local + log before send was + called. +

+ To sum up, the only reason for the send function + to return failure is when the + master database environment has been configured to not + synchronously flush the log on transaction commit (that is, + DB_TXN_NOSYNC was configured on the master), the + DB_REP_PERMANENT flag is specified for the message, and the + send function was unable + to determine that some number of clients have received the + current message (and all messages preceding the current + message). How many clients need to receive the message before + the send function can return + success is an application choice (and may not depend as much + on a specific number of clients reporting success as one or + more geographically distributed clients). +

+ If, however, the application does require on-disk durability + on the master, the master should be configured to + synchronously flush the log on commit. If clients are not + configured to synchronously flush the log, that is, if a + client is running with DB_TXN_NOSYNC configured, then it is + up to the application to reconfigure that client appropriately + when it becomes a master. That is, the application must + explicitly call DB_ENV->set_flags() to disable asynchronous log + flushing as part of re-configuring the client as the new + master. +

+ Of course, it is important to ensure that the replicated + master and client environments are truly independent of each + other. For example, it does not help matters that a client has + acknowledged receipt of a message if both master and clients + are on the same power supply, as the failure of the power + supply will still potentially lose information. +

+ Configuring a Base API application to achieve the proper mix + of performance and transactional guarantees can be complex. In + brief, there are a few controls an application can set to + configure the guarantees it makes: specification of + DB_TXN_NOSYNC for the master environment, specification of + DB_TXN_NOSYNC for the client environment, the priorities of + different sites participating in an election, and the behavior + of the application's send + function. +

+ First, it is rarely useful to write and synchronously flush + the log when a transaction commits on a replication client. It + may be useful where systems share resources and multiple + systems commonly fail at the same time. By default, all + Berkeley DB database environments, whether master or client, + synchronously flush the log on transaction commit or prepare. + Generally, replication masters and clients turn log flush off + for transaction commit using the DB_TXN_NOSYNC flag. +

+ Consider two systems connected by a network interface. One + acts as the master, the other as a read-only client. The + client takes over as master if the master crashes and the + master rejoins the replication group after such a failure. + Both master and client are configured to not synchronously + flush the log on transaction commit (that is, DB_TXN_NOSYNC + was configured on both systems). The application's send function never returns failure + to the Berkeley DB library, simply forwarding messages to the + client (perhaps over a broadcast mechanism), and always + returning success. On the client, any DB_REP_NOTPERM returns + from the client's DB_ENV->rep_process_message() method are ignored, as well. + This system configuration has excellent performance, but may + lose data in some failure modes. +

+ If both the master and the client crash at once, it is + possible to lose committed transactions, that is, + transactional durability is not being maintained. Reliability + can be increased by providing separate power supplies for the + systems and placing them in separate physical + locations. +

+ If the connection between the two machines fails (or just + some number of messages are lost), and subsequently the master + crashes, it is possible to lose committed transactions. Again, + transactional durability is not being maintained. Reliability + can be improved in a couple of ways: +

- Use a reliable network protocol (for example, TCP/IP instead of UDP). -
+ Use a reliable network protocol (for example, + TCP/IP instead of UDP). +
- Increase the number of clients and network paths to make it - less likely that a message will be lost. In this case, it is - important to also make sure a client that did receive the - message wins any subsequent election. If a client that did not - receive the message wins a subsequent election, data can still - be lost. -
+ Increase the number of clients and network paths to + make it less likely that a message will be lost. In + this case, it is important to also make sure a client + that did receive the message wins any subsequent + election. If a client that did not receive the message + wins a subsequent election, data can still be lost. +

Further, systems may want to guarantee message delivery to the client(s) -(for example, to prevent a network connection from simply discarding -messages). Some systems may want to ensure clients never return -out-of-date information, that is, once a transaction commit returns -success on the master, no client will return old information to a -read-only query. Some of the following changes to a Base API application -may be used to address these issues:

+ Further, systems may want to guarantee message delivery to + the client(s) (for example, to prevent a network connection + from simply discarding messages). Some systems may want to + ensure clients never return out-of-date information, that is, + once a transaction commit returns success on the master, no + client will return old information to a read-only query. Some + of the following changes to a Base API application may be used + to address these issues: +

-
- Write the application's send - function to not return to Berkeley DB until one or more clients - have acknowledged receipt of the message. The number of - clients chosen will be dependent on the application: you will - want to consider likely network partitions (ensure that a - client at each physical site receives the message) and - geographical diversity (ensure that a client on each coast - receives the message). -
+
+ Write the application's send function + to not return to Berkeley DB until one or more clients have + acknowledged receipt of the message. The number of + clients chosen will be dependent on the application: + you will want to consider likely network partitions + (ensure that a client at each physical site receives + the message) and geographical diversity (ensure that a + client on each coast receives the message). +
- Write the client's message processing loop to not acknowledge - receipt of the message until a call to the DB_ENV->rep_process_message() method - has returned success. Messages resulting in a return of - DB_REP_NOTPERM from the DB_ENV->rep_process_message() method mean the message - could not be flushed to the client's disk. If the client does - not acknowledge receipt of such messages to the master until a - subsequent call to the DB_ENV->rep_process_message() method returns - DB_REP_ISPERM and the LSN returned is at least as large as - this message's LSN, then the master's send function will not return - success to the Berkeley DB library. This means the thread - committing the transaction on the master will not be allowed to - proceed based on the transaction having committed until the - selected set of clients have received the message and consider - it complete. -
+ Write the client's message processing loop to not + acknowledge receipt of the message until a call to the + DB_ENV->rep_process_message() method has returned success. Messages + resulting in a return of DB_REP_NOTPERM from the + DB_ENV->rep_process_message() method mean the message could not be + flushed to the client's disk. If the client does not + acknowledge receipt of such messages to the master + until a subsequent call to the DB_ENV->rep_process_message() method + returns DB_REP_ISPERM and the LSN returned is at + least as large as this message's LSN, then the + master's send + function will not return success to the Berkeley DB + library. This means the thread committing the + transaction on the master will not be allowed to + proceed based on the transaction having committed + until the selected set of clients have received the + message and consider it complete. +

- Alternatively, the client's message processing loop could - acknowledge the message to the master, but with an error code - indicating that the application's send function should not return to - the Berkeley DB library until a subsequent acknowledgement from - the same client indicates success. -
+ Alternatively, the client's message processing loop + could acknowledge the message to the master, but with + an error code indicating that the application's + send function + should not return to the Berkeley DB library until a + subsequent acknowledgement from the same client + indicates success. +

- The application send callback function invoked by Berkeley DB - contains an LSN of the record being sent (if appropriate for - that record). When DB_ENV->rep_process_message() method returns indicators that - a permanent record has been written then it also returns the - maximum LSN of the permanent record written. -
+ The application send callback function invoked by + Berkeley DB contains an LSN of the record being sent + (if appropriate for that record). When DB_ENV->rep_process_message() + method returns indicators that a permanent record has + been written then it also returns the maximum LSN of + the permanent record written. +

There is one final pair of failure scenarios to consider. First, it is -not possible to abort transactions after the application's send -function has been called, as the master may have already written the -commit log records to disk, and so abort is no longer an option. -Second, a related problem is that even though the master will attempt -to flush the local log if the send function returns failure, -that flush may fail (for example, when the local disk is full). Again, -the transaction cannot be aborted as one or more clients may have -committed the transaction even if send returns failure. Rare -applications may not be able to tolerate these unlikely failure modes. -In that case the application may want to:

+ There is one final pair of failure scenarios to consider. + First, it is not possible to abort transactions after the + application's send function + has been called, as the master may have already written the + commit log records to disk, and so abort is no longer an + option. Second, a related problem is that even though the + master will attempt to flush the local log if the send function returns failure, that + flush may fail (for example, when the local disk is full). + Again, the transaction cannot be aborted as one or more + clients may have committed the transaction even if send returns failure. Rare + applications may not be able to tolerate these unlikely + failure modes. In that case the application may want + to: +

-
- Configure the master to do always local synchronous commits - (turning off the DB_TXN_NOSYNC configuration). This will - decrease performance significantly, of course (one of the - reasons to use replication is to avoid local disk writes.) In - this configuration, failure to write the local log will cause - the transaction to abort in all cases. -
+
+ Configure the master to do always local synchronous + commits (turning off the DB_TXN_NOSYNC + configuration). This will decrease performance + significantly, of course (one of the reasons to use + replication is to avoid local disk writes.) In this + configuration, failure to write the local log will + cause the transaction to abort in all cases. +
-
- Do not return from the application's send function under any conditions, - until the selected set of clients has acknowledged the message. - Until the send function - returns to the Berkeley DB library, the thread committing the - transaction on the master will wait, and so no application will - be able to act on the knowledge that the transaction has - committed. -
+
+ Do not return from the application's send function under any + conditions, until the selected set of clients has + acknowledged the message. Until the send function returns to + the Berkeley DB library, the thread committing the + transaction on the master will wait, and so no + application will be able to act on the knowledge that + the transaction has committed. +