Chapter 4. Using Replication with the SQL API

Table of Contents

Replication Overview
Replication Masters
Elections
Durability Guarantees
Permanent Message Handling
Two-Site Replication Groups
Replication PRAGMAs
PRAGMA replication
PRAGMA replication_ack_policy
PRAGMA replication_ack_timeout
PRAGMA replication_get_master
PRAGMA replication_initial_master
PRAGMA replication_local_site
PRAGMA replication_num_sites
PRAGMA replication_perm_failed
PRAGMA replication_priority
PRAGMA replication_remote_site
PRAGMA replication_remove_site
PRAGMA replication_site_status
PRAGMA replication_verbose_output
PRAGMA replication_verbose_file
Displaying Replication Statistics
Replication Usage Examples
Example 1: Distributed Read at 3 Sites
Example 2: 2-Site Failover

The Berkeley DB SQL interface allows you to use Berkeley DB's replication feature. You configure and start replication using PRAGMAs that are specific to the task.

This chapter provides a high-level introduction of Berkeley DB replication. It then shows how to configure and use replication with the SQL API.

For a more detailed description of Berkeley DB replication, see:

Replication Overview

Berkeley DB's replication feature allows you to automatically distribute your database write operations to one or more read-only replicas. For this reason, BDB's replication implementation is said to be a single master, multiple replica replication strategy.

A single replication master and all of its replicas are referred to as a replication group. Each replication group can have one and only one master site.

When discussing Berkeley DB replication, we sometimes refer to replication sites. This is because most production applications place each of their replication participants on separate physical machines. In fact, each replication participant must be assigned a hostname/port pair that is unique within the replication group.

Note that under the hood, the unit of replication is the environment. That is, data is replicated from one Berkeley DB environment to one or more other Berkeley DB environments. However, when used with the BDB SQL interface, you can think of this as replicating between Berkeley DB databases, because the BDB SQL interface results in a single database file for each environment.

Replication Masters

Every replication group has one and only one master. The master site is where you perform write operations. These operations are then automatically replicated to the other sites in the replication group. Because the other replica sites in the replication group are read-only, it is an error for you to attempt to perform write operatons on them.

The replication master is usually automatically selected by the replication group using elections. Replication elections simply determine which replication site has the most up-to-date copy of the data, and so is in the best position to serve as the master site.

Note that when you initially start up your BDB SQL replicated application, you must explicitly designate a specific site as the master. Over time, the master site can move from one environment to the next. For example, if the master site is shut down, becomes unavailable, or a network partition causes it to lose contact with the rest of the replication group, then the replication group will elect a new master if it can successfully hold an election. When the old master comes back online, it rejoins the replication group as a read-only replica site.

Also, if you are enabling replication for an existing database, then that database must be designated as the master. Doing this is required; otherwise the entire contents of the existing database might be deleted during the replication startup process.

Elections

A replication group selects the master site by holding an election. In simplistic terms, each participant in the replication group votes on who it believes has the most up-to-date version of the data that the replication group is managing. The site that receives the most number of votes becomes the master site, and all data write activity must occur there.

In order to hold an election, the replication group must have a quorum. In order to achieve a quorum, a simple majority of the sites must be available to select the master. That is, n/2 + 1 sites must be available, where n is the total number of replication group participants. By requiring a simple majority, the replication group avoids the possibility of simultaneously running with two master sites due to a network partition.

If a replication group cannot select a master, then it can only be used in read-only mode.

Durability Guarantees

Durability is a term that means data modifications have met some pre-defined set of guarantees that the modifications will remain persistent across application run times. Usually, this means that there is some assurance that the data modification has been written to stable storage (that is, written to a hard drive).

For replicated BDB SQL applications, the durability guarantee is enhanced because data modifications are also replicated to those environments that are participating in the replication group. This ensures higher data durability than non-replicated applications by placing data in multiple environments that usually reside on separate physical machines.

Permanent Message Handling

Permanent messages are created by replication masters as a part of a transactional commit operation. When a replica receives a message that is marked as permanent, it knows that the message affects transactional integrity. Receipt of a permanent message means that the replica must send a message acknowledgment back to the master server because the master might be waiting for the acknowledgment before it considers the transaction commit to be complete.

Whether the master is actually waiting for message acknowledgement depends on the acknowledgement policy in effect for the replication group. Policies can range from NONE (the master will not wait for any acknowledgements before completing the transaction) to ALL (the master will wait for acknowledgements from all replicas before completing the transaction).

Acknowledgements are only sent back to the master once the replica has completed applying the message to its local environment. Therefore, the stronger your acknowledgement policy, the stronger you durability guarantee. On the other hand, the stronger your acknowledgement policy, the slower your application's write throughput will be.

In addition to setting an acknowledgement policy, you can also set an acknowledgment timeout. This time limit is set in microseconds and it represents the length of time the master will wait to satisfy its acknowledgement policy for each transaction commit. If this timeout value is not met, the transaction is still committed locally to the master, but is not yet considered durable across the replication group. Your code should take whatever actions are appropriate for that transaction. If enough other sites are available to meet the acknowledgement policy, the transaction will become durable after more time has passed.

You set acknowledgement policies and acknowledgement timeouts using PRAGMAs. See PRAGMA replication_ack_policy and PRAGMA replication_ack_timeout. In addition, you can examine how frequently your transactions do not achieve durability within the acknowledgement timeout by using PRAGMA replication_perm_failed.

Two-Site Replication Groups

In a replication group that consists of exactly two sites, both sites must be available in order to achieve a quorum. Without a quorum, a new master site cannot be elected. This means that if the master site is unable to participate in the replication group, then the remaining read-only replica cannot become the master site.

In other words, if you have a group that consists of exactly two sites, if you lose your master site then the replication group must exist in read-only mode until the master site becomes available again.