From 780b92ada9afcf1d58085a83a0b9e6bc982203d1 Mon Sep 17 00:00:00 2001
From: Lorry Tar Creator It is important to consider replication in the context of the overall
-database environment's transactional guarantees. To briefly review,
-transactional guarantees in a non-replicated application are based on
-the writing of log file records to "stable storage", usually a disk
-drive. If the application or system then fails, the Berkeley DB logging
-information is reviewed during recovery, and the databases are updated
-so that all changes made as part of committed transactions appear, and
-all changes made as part of uncommitted transactions do not appear. In
-this case, no information will have been lost. If a database environment does not require the log be flushed to
-stable storage on transaction commit (using the DB_TXN_NOSYNC
-flag to increase performance at the cost of sacrificing transactional
-durability), Berkeley DB recovery will only be able to restore the system to
-the state of the last commit found on stable storage. In this case,
-information may have been lost (for example, the changes made by some
-committed transactions may not appear in the databases after recovery). Further, if there is database or log file loss or corruption (for
-example, if a disk drive fails), then catastrophic recovery is
-necessary, and Berkeley DB recovery will only be able to restore the system
-to the state of the last archived log file. In this case, information
-may also have been lost. Replicating the database environment extends this model, by adding a
-new component to "stable storage": the client's replicated information.
-If a database environment is replicated, there is no lost information
-in the case of database or log file loss, because the replicated system
-can be configured to contain a complete set of databases and log records
-up to the point of failure. A database environment that loses a disk
-drive can have the drive replaced, and it can then rejoin the
-replication group. Because of this new component of stable storage, specifying
-DB_TXN_NOSYNC in a replicated environment no longer sacrifices
-durability, as long as one or more clients have acknowledged receipt of
-the messages sent by the master. Since network connections are often
-faster than local synchronous disk writes, replication becomes a way
-for applications to significantly improve their performance as well as
-their reliability. The return status from the application's send function must be
-set by the application to ensure the transactional guarantees the
-application wants to provide. Whenever the send function
-returns failure, the local database environment's log is flushed as
-necessary to ensure that any information critical to database integrity
-is not lost. Because this flush is an expensive operation in terms of
-database performance, applications should avoid returning an error from
-the send function, if at all possible. The only interesting message type for replication transactional
-guarantees is when the application's send function was called
-with the DB_REP_PERMANENT flag specified. There is no reason
-for the send function to ever return failure unless the
-DB_REP_PERMANENT flag was specified -- messages without the
-DB_REP_PERMANENT flag do not make visible changes to databases,
-and the send function can return success to Berkeley DB as soon as
-the message has been sent to the client(s) or even just copied to local
-application memory in preparation for being sent. When a client receives a DB_REP_PERMANENT message, the client
-will flush its log to stable storage before returning (unless the client
-environment has been configured with the DB_TXN_NOSYNC option).
-If the client is unable to flush a complete transactional record to disk
-for any reason (for example, there is a missing log record before the
-flagged message), the call to the DB_ENV->rep_process_message() method on the client
-will return DB_REP_NOTPERM and return the LSN of this record
-to the application in the ret_lsnp parameter.
-The application's client or master
-message handling loops should take proper action to ensure the correct
-transactional guarantees in this case. When missing records arrive
-and allow subsequent processing of previously stored permanent
-records, the call to the DB_ENV->rep_process_message() method on the client will
-return DB_REP_ISPERM and return the largest LSN of the
-permanent records that were flushed to disk. Client applications
-can use these LSNs to know definitively if any particular LSN is
-permanently stored or not. An application relying on a client's ability to become a master and
-guarantee that no data has been lost will need to write the send
-function to return an error whenever it cannot guarantee the site that
-will win the next election has the record. Applications not requiring
-this level of transactional guarantees need not have the send
-function return failure (unless the master's database environment has
-been configured with DB_TXN_NOSYNC), as any information critical
-to database integrity has already been flushed to the local log before
-send was called. To sum up, the only reason for the send function to return
-failure is when the master database environment has been configured to
-not synchronously flush the log on transaction commit (that is,
-DB_TXN_NOSYNC was configured on the master), the
-DB_REP_PERMANENT flag is specified for the message, and the
-send function was unable to determine that some number of
-clients have received the current message (and all messages preceding
-the current message). How many clients need to receive the message
-before the send function can return success is an application
-choice (and may not depend as much on a specific number of clients
-reporting success as one or more geographically distributed clients). If, however, the application does require on-disk durability on the master,
-the master should be configured to synchronously flush the log on commit.
-If clients are not configured to synchronously flush the log,
-that is, if a client is running with DB_TXN_NOSYNC configured,
-then it is up to the application to reconfigure that client
-appropriately when it becomes a master. That is, the
-application must explicitly call DB_ENV->set_flags() to
-disable asynchronous log flushing as part of re-configuring
-the client as the new master. Of course, it is important to ensure that the replicated master and
-client environments are truly independent of each other. For example,
-it does not help matters that a client has acknowledged receipt of a
-message if both master and clients are on the same power supply, as the
-failure of the power supply will still potentially lose information. Configuring your replication-based application to achieve the proper
-mix of performance and transactional guarantees can be complex. In
-brief, there are a few controls an application can set to configure the
-guarantees it makes: specification of DB_TXN_NOSYNC for the
-master environment, specification of DB_TXN_NOSYNC for the
-client environment, the priorities of different sites participating in
-an election, and the behavior of the application's send
-function. Applications using Replication Manager are free to use
-DB_TXN_NOSYNC at the master and/or clients as they see fit. The
-behavior of the send function that Replication Manager provides
-on the application's behalf is determined by an "acknowledgement
-policy", which is configured by the DB_ENV->repmgr_set_ack_policy() method.
-Clients always send acknowledgements for DB_REP_PERMANENT
-messages (unless the acknowledgement policy in effect indicates that the
-master doesn't care about them). For a DB_REP_PERMANENT
-message, the master blocks the sending thread until either it receives
-the proper number of acknowledgements, or the DB_REP_ACK_TIMEOUT
-expires. In the case of timeout, Replication Manager returns an error
-code from the send function, causing Berkeley DB to flush the
-transaction log before returning to the application, as previously
-described. The default acknowledgement policy is
-DB_REPMGR_ACKS_QUORUM, which ensures that the effect of a
-permanent record remains durable following an election. First, it is rarely useful to write and synchronously flush the log when
-a transaction commits on a replication client. It may be useful where
-systems share resources and multiple systems commonly fail at the same
-time. By default, all Berkeley DB database environments, whether master or
-client, synchronously flush the log on transaction commit or prepare.
-Generally, replication masters and clients turn log flush off for
-transaction commit using the DB_TXN_NOSYNC flag. Consider two systems connected by a network interface. One acts as the
-master, the other as a read-only client. The client takes over as
-master if the master crashes and the master rejoins the replication
-group after such a failure. Both master and client are configured to
-not synchronously flush the log on transaction commit (that is,
-DB_TXN_NOSYNC was configured on both systems). The
-application's send function never returns failure to the Berkeley DB
-library, simply forwarding messages to the client (perhaps over a
-broadcast mechanism), and always returning success. On the client, any
-DB_REP_NOTPERM returns from the client's DB_ENV->rep_process_message() method
-are ignored, as well. This system configuration has excellent
-performance, but may lose data in some failure modes. If both the master and the client crash at once, it is possible to lose
-committed transactions, that is, transactional durability is not being
-maintained. Reliability can be increased by providing separate power
-supplies for the systems and placing them in separate physical locations. If the connection between the two machines fails (or just some number
-of messages are lost), and subsequently the master crashes, it is
-possible to lose committed transactions. Again, transactional
-durability is not being maintained. Reliability can be improved in a
-couple of ways:
+ It is important to consider replication in the context of
+ the overall database environment's transactional guarantees.
+ To briefly review, transactional guarantees in a
+ non-replicated application are based on the writing of log
+ file records to "stable storage", usually a disk drive. If the
+ application or system then fails, the Berkeley DB logging
+ information is reviewed during recovery, and the databases are
+ updated so that all changes made as part of committed
+ transactions appear, and all changes made as part of
+ uncommitted transactions do not appear. In this case, no
+ information will have been lost.
+
+ If a database environment does not require the log be
+ flushed to stable storage on transaction commit (using the
+ DB_TXN_NOSYNC flag to increase performance at the cost of
+ sacrificing transactional durability), Berkeley DB recovery
+ will only be able to restore the system to the state of the
+ last commit found on stable storage. In this case, information
+ may have been lost (for example, the changes made by some
+ committed transactions may not appear in the databases after
+ recovery).
+
+ Further, if there is database or log file loss or corruption
+ (for example, if a disk drive fails), then catastrophic
+ recovery is necessary, and Berkeley DB recovery will only be
+ able to restore the system to the state of the last archived
+ log file. In this case, information may also have been
+ lost.
+
+ Replicating the database environment extends this model, by
+ adding a new component to "stable storage": the client's
+ replicated information. If a database environment is
+ replicated, there is no lost information in the case of
+ database or log file loss, because the replicated system can
+ be configured to contain a complete set of databases and log
+ records up to the point of failure. A database environment
+ that loses a disk drive can have the drive replaced, and it
+ can then rejoin the replication group.
+
+ Because of this new component of stable storage, specifying
+ DB_TXN_NOSYNC in a replicated environment no longer
+ sacrifices durability, as long as one or more clients have
+ acknowledged receipt of the messages sent by the master. Since
+ network connections are often faster than local synchronous
+ disk writes, replication becomes a way for applications to
+ significantly improve their performance as well as their
+ reliability.
+
+ Applications using Replication Manager are free to use
+ DB_TXN_NOSYNC at the master and/or clients as they see fit.
+ The behavior of the send
+ function that Replication Manager provides on the
+ application's behalf is determined by an "acknowledgement
+ policy", which is configured by the DB_ENV->repmgr_set_ack_policy() method.
+ Clients always send acknowledgements for DB_REP_PERMANENT
+ messages (unless the acknowledgement policy in effect
+ indicates that the master doesn't care about them). For a
+ DB_REP_PERMANENT message, the master blocks the sending
+ thread until either it receives the proper number of
+ acknowledgements, or the DB_REP_ACK_TIMEOUT expires. In the
+ case of timeout, Replication Manager returns an error code
+ from the send function,
+ causing Berkeley DB to flush the transaction log before
+ returning to the application, as previously described. The
+ default acknowledgement policy is DB_REPMGR_ACKS_QUORUM,
+ which ensures that the effect of a permanent record remains
+ durable following an election.
+
+ The rest of this section discusses transactional guarantee
+ considerations in Base API applications.
+
+ The return status from the application's send function must be set by the
+ application to ensure the transactional guarantees the
+ application wants to provide. Whenever the send function returns failure, the
+ local database environment's log is flushed as necessary to
+ ensure that any information critical to database integrity is
+ not lost. Because this flush is an expensive operation in
+ terms of database performance, applications should avoid
+ returning an error from the send function,
+ if at all possible.
+
+ The only interesting message type for replication
+ transactional guarantees is when the application's send function was called with the
+ DB_REP_PERMANENT flag specified. There is no reason for the
+ send function to ever
+ return failure unless the DB_REP_PERMANENT flag was
+ specified -- messages without the DB_REP_PERMANENT flag do
+ not make visible changes to databases, and the send function can return success to
+ Berkeley DB as soon as the message has been sent to the
+ client(s) or even just copied to local application memory in
+ preparation for being sent.
+
+ When a client receives a DB_REP_PERMANENT message, the
+ client will flush its log to stable storage before returning
+ (unless the client environment has been configured with the
+ DB_TXN_NOSYNC option). If the client is unable to flush a
+ complete transactional record to disk for any reason (for
+ example, there is a missing log record before the flagged
+ message), the call to the DB_ENV->rep_process_message() method on the client
+ will return DB_REP_NOTPERM and return the LSN of this record
+ to the application in the ret_lsnp parameter.
+ The application's client or master message handling loops should take proper
+ action to ensure the correct transactional guarantees in this case. When
+ missing records arrive and allow subsequent processing of
+ previously stored permanent records, the call to the
+ DB_ENV->rep_process_message() method on the client will return DB_REP_ISPERM
+ and return the largest LSN of the permanent records that were
+ flushed to disk. Client applications can use these LSNs to
+ know definitively if any particular LSN is permanently stored
+ or not.
+
+ An application relying on a client's ability to become a
+ master and guarantee that no data has been lost will need to
+ write the send function to
+ return an error whenever it cannot guarantee the site that
+ will win the next election has the record. Applications not
+ requiring this level of transactional guarantees need not have
+ the send function return
+ failure (unless the master's database environment has been
+ configured with DB_TXN_NOSYNC), as any information critical
+ to database integrity has already been flushed to the local
+ log before send was
+ called.
+
+ To sum up, the only reason for the send function
+ to return failure is when the
+ master database environment has been configured to not
+ synchronously flush the log on transaction commit (that is,
+ DB_TXN_NOSYNC was configured on the master), the
+ DB_REP_PERMANENT flag is specified for the message, and the
+ send function was unable
+ to determine that some number of clients have received the
+ current message (and all messages preceding the current
+ message). How many clients need to receive the message before
+ the send function can return
+ success is an application choice (and may not depend as much
+ on a specific number of clients reporting success as one or
+ more geographically distributed clients).
+
+ If, however, the application does require on-disk durability
+ on the master, the master should be configured to
+ synchronously flush the log on commit. If clients are not
+ configured to synchronously flush the log, that is, if a
+ client is running with DB_TXN_NOSYNC configured, then it is
+ up to the application to reconfigure that client appropriately
+ when it becomes a master. That is, the application must
+ explicitly call DB_ENV->set_flags() to disable asynchronous log
+ flushing as part of re-configuring the client as the new
+ master.
+
+ Of course, it is important to ensure that the replicated
+ master and client environments are truly independent of each
+ other. For example, it does not help matters that a client has
+ acknowledged receipt of a message if both master and clients
+ are on the same power supply, as the failure of the power
+ supply will still potentially lose information.
+
+ Configuring a Base API application to achieve the proper mix
+ of performance and transactional guarantees can be complex. In
+ brief, there are a few controls an application can set to
+ configure the guarantees it makes: specification of
+ DB_TXN_NOSYNC for the master environment, specification of
+ DB_TXN_NOSYNC for the client environment, the priorities of
+ different sites participating in an election, and the behavior
+ of the application's send
+ function.
+
+ First, it is rarely useful to write and synchronously flush
+ the log when a transaction commits on a replication client. It
+ may be useful where systems share resources and multiple
+ systems commonly fail at the same time. By default, all
+ Berkeley DB database environments, whether master or client,
+ synchronously flush the log on transaction commit or prepare.
+ Generally, replication masters and clients turn log flush off
+ for transaction commit using the DB_TXN_NOSYNC flag.
+
+ Consider two systems connected by a network interface. One
+ acts as the master, the other as a read-only client. The
+ client takes over as master if the master crashes and the
+ master rejoins the replication group after such a failure.
+ Both master and client are configured to not synchronously
+ flush the log on transaction commit (that is, DB_TXN_NOSYNC
+ was configured on both systems). The application's send function never returns failure
+ to the Berkeley DB library, simply forwarding messages to the
+ client (perhaps over a broadcast mechanism), and always
+ returning success. On the client, any DB_REP_NOTPERM returns
+ from the client's DB_ENV->rep_process_message() method are ignored, as well.
+ This system configuration has excellent performance, but may
+ lose data in some failure modes.
+
+ If both the master and the client crash at once, it is
+ possible to lose committed transactions, that is,
+ transactional durability is not being maintained. Reliability
+ can be increased by providing separate power supplies for the
+ systems and placing them in separate physical
+ locations.
+
+ If the connection between the two machines fails (or just
+ some number of messages are lost), and subsequently the master
+ crashes, it is possible to lose committed transactions. Again,
+ transactional durability is not being maintained. Reliability
+ can be improved in a couple of ways:
+
- Use a reliable network protocol (for example, TCP/IP instead of UDP).
-
- Increase the number of clients and network paths to make it
- less likely that a message will be lost. In this case, it is
- important to also make sure a client that did receive the
- message wins any subsequent election. If a client that did not
- receive the message wins a subsequent election, data can still
- be lost.
- Further, systems may want to guarantee message delivery to the client(s)
-(for example, to prevent a network connection from simply discarding
-messages). Some systems may want to ensure clients never return
-out-of-date information, that is, once a transaction commit returns
-success on the master, no client will return old information to a
-read-only query. Some of the following changes to a Base API application
-may be used to address these issues:
+ Further, systems may want to guarantee message delivery to
+ the client(s) (for example, to prevent a network connection
+ from simply discarding messages). Some systems may want to
+ ensure clients never return out-of-date information, that is,
+ once a transaction commit returns success on the master, no
+ client will return old information to a read-only query. Some
+ of the following changes to a Base API application may be used
+ to address these issues:
+
- Write the application's send
- function to not return to Berkeley DB until one or more clients
- have acknowledged receipt of the message. The number of
- clients chosen will be dependent on the application: you will
- want to consider likely network partitions (ensure that a
- client at each physical site receives the message) and
- geographical diversity (ensure that a client on each coast
- receives the message).
-
+ Write the application's send function
+ to not return to Berkeley DB until one or more clients have
+ acknowledged receipt of the message. The number of
+ clients chosen will be dependent on the application:
+ you will want to consider likely network partitions
+ (ensure that a client at each physical site receives
+ the message) and geographical diversity (ensure that a
+ client on each coast receives the message).
+
- Write the client's message processing loop to not acknowledge
- receipt of the message until a call to the DB_ENV->rep_process_message() method
- has returned success. Messages resulting in a return of
- DB_REP_NOTPERM from the DB_ENV->rep_process_message() method mean the message
- could not be flushed to the client's disk. If the client does
- not acknowledge receipt of such messages to the master until a
- subsequent call to the DB_ENV->rep_process_message() method returns
- DB_REP_ISPERM and the LSN returned is at least as large as
- this message's LSN, then the master's send function will not return
- success to the Berkeley DB library. This means the thread
- committing the transaction on the master will not be allowed to
- proceed based on the transaction having committed until the
- selected set of clients have received the message and consider
- it complete.
-
- Alternatively, the client's message processing loop could
- acknowledge the message to the master, but with an error code
- indicating that the application's send function should not return to
- the Berkeley DB library until a subsequent acknowledgement from
- the same client indicates success.
-
- The application send callback function invoked by Berkeley DB
- contains an LSN of the record being sent (if appropriate for
- that record). When DB_ENV->rep_process_message() method returns indicators that
- a permanent record has been written then it also returns the
- maximum LSN of the permanent record written.
- There is one final pair of failure scenarios to consider. First, it is
-not possible to abort transactions after the application's send
-function has been called, as the master may have already written the
-commit log records to disk, and so abort is no longer an option.
-Second, a related problem is that even though the master will attempt
-to flush the local log if the send function returns failure,
-that flush may fail (for example, when the local disk is full). Again,
-the transaction cannot be aborted as one or more clients may have
-committed the transaction even if send returns failure. Rare
-applications may not be able to tolerate these unlikely failure modes.
-In that case the application may want to:
+ There is one final pair of failure scenarios to consider.
+ First, it is not possible to abort transactions after the
+ application's send function
+ has been called, as the master may have already written the
+ commit log records to disk, and so abort is no longer an
+ option. Second, a related problem is that even though the
+ master will attempt to flush the local log if the send function returns failure, that
+ flush may fail (for example, when the local disk is full).
+ Again, the transaction cannot be aborted as one or more
+ clients may have committed the transaction even if send returns failure. Rare
+ applications may not be able to tolerate these unlikely
+ failure modes. In that case the application may want
+ to:
+
- Configure the master to do always local synchronous commits
- (turning off the DB_TXN_NOSYNC configuration). This will
- decrease performance significantly, of course (one of the
- reasons to use replication is to avoid local disk writes.) In
- this configuration, failure to write the local log will cause
- the transaction to abort in all cases.
-
+ Configure the master to do always local synchronous
+ commits (turning off the DB_TXN_NOSYNC
+ configuration). This will decrease performance
+ significantly, of course (one of the reasons to use
+ replication is to avoid local disk writes.) In this
+ configuration, failure to write the local log will
+ cause the transaction to abort in all cases.
+
- Do not return from the application's send function under any conditions,
- until the selected set of clients has acknowledged the message.
- Until the send function
- returns to the Berkeley DB library, the thread committing the
- transaction on the master will wait, and so no application will
- be able to act on the knowledge that the transaction has
- committed.
-
+ Do not return from the application's send function under any
+ conditions, until the selected set of clients has
+ acknowledged the message. Until the send function returns to
+ the Berkeley DB library, the thread committing the
+ transaction on the master will wait, and so no
+ application will be able to act on the knowledge that
+ the transaction has committed.
+