summaryrefslogtreecommitdiff
path: root/Documentation/ref
diff options
context:
space:
mode:
authorBen Pfaff <blp@ovn.org>2017-12-31 21:15:58 -0800
committerBen Pfaff <blp@ovn.org>2018-03-24 12:04:53 -0700
commit1b1d2e6daa563cc91f974ffdc082fb3a8b424801 (patch)
tree9cc5df01b7af35962d5f40d0ffd8882fb277e047 /Documentation/ref
parent53178986d7fc86bcfc2f297b547a97ee71a21bb7 (diff)
downloadopenvswitch-1b1d2e6daa563cc91f974ffdc082fb3a8b424801.tar.gz
ovsdb: Introduce experimental support for clustered databases.
This commit adds support for OVSDB clustering via Raft. Please read ovsdb(7) for information on how to set up a clustered database. It is simple and boils down to running "ovsdb-tool create-cluster" on one server and "ovsdb-tool join-cluster" on each of the others and then starting ovsdb-server in the usual way on all of them. One you have a clustered database, you configure ovn-controller and ovn-northd to use it by pointing them to all of the servers, e.g. where previously you might have said "tcp:1.2.3.4" was the database server, now you say that it is "tcp:1.2.3.4,tcp:5.6.7.8,tcp:9.10.11.12". This also adds support for database clustering to ovs-sandbox. Acked-by: Justin Pettit <jpettit@ovn.org> Tested-by: aginwala <aginwala@asu.edu> Signed-off-by: Ben Pfaff <blp@ovn.org>
Diffstat (limited to 'Documentation/ref')
-rw-r--r--Documentation/ref/ovsdb.5.rst207
-rw-r--r--Documentation/ref/ovsdb.7.rst229
2 files changed, 398 insertions, 38 deletions
diff --git a/Documentation/ref/ovsdb.5.rst b/Documentation/ref/ovsdb.5.rst
index f3e50976b..da0ad7a49 100644
--- a/Documentation/ref/ovsdb.5.rst
+++ b/Documentation/ref/ovsdb.5.rst
@@ -30,9 +30,11 @@ ovsdb
Description
===========
-OVSDB, the Open vSwitch Database, is a database system whose network
-protocol is specified by RFC 7047. The RFC does not specify an on-disk
-storage format. This manpage documents the format used by Open vSwitch.
+OVSDB, the Open vSwitch Database, is a database system whose network protocol
+is specified by RFC 7047. The RFC does not specify an on-disk storage format.
+The OVSDB implementation in Open vSwitch implements two storage formats: one
+for standalone (and active-backup) databases, and the other for clustered
+databases. This manpage documents both of these formats.
Most users do not need to be concerned with this specification. Instead,
to manipulate OVSDB files, refer to `ovsdb-tool(1)`. For an
@@ -47,14 +49,16 @@ infer it.
OVSDB files do not include the values of ephemeral columns.
-Database files are text files encoded in UTF-8 with LF (U+000A) line ends,
-organized as append-only series of records. Each record consists of 2
-lines of text.
+Standalone and clustered database files share the common structure described
+here. They are text files encoded in UTF-8 with LF (U+000A) line ends,
+organized as append-only series of records. Each record consists of 2 lines of
+text.
-The first line in each record has the format ``OVSDB JSON`` *length* *hash*,
-where *length* is a positive decimal integer and *hash* is a SHA-1 checksum
-expressed as 40 hexadecimal digits. Words in the first line must be separated
-by exactly one space.
+The first line in each record has the format ``OVSDB <magic> <length> <hash>``,
+where <magic> is ``JSON`` for standalone databases or ``CLUSTER`` for clustered
+databases, <length> is a positive decimal integer, and <hash> is a SHA-1
+checksum expressed as 40 hexadecimal digits. Words in the first line must be
+separated by exactly one space.
The second line must be exactly *length* bytes long (including the LF) and its
SHA-1 checksum (including the LF) must match *hash* exactly. The line's
@@ -102,8 +106,7 @@ looking through a database log with ``ovsdb-tool show-log``:
operations, OVSDB concatenates them into a single ``_comment`` member,
separated by a new-line.
- OVSDB only writes a ``_comment`` member if it would be
- a nonempty string.
+ OVSDB only writes a ``_comment`` member if it would be a nonempty string.
Each of these records also has one or more additional members, each of which
maps from the name of a database table to a <table-txn>:
@@ -123,3 +126,183 @@ maps from the name of a database table to a <table-txn>:
default values for their types defined in RFC 7047 section 5.2.1; for
modified rows, the OVSDB implementation omits columns whose values are
unchanged.
+
+Clustered Format
+----------------
+
+The clustered format has the following additional notation:
+
+<uint64>
+ A JSON integer that represents a 64-bit unsigned integer. The OVS JSON
+ implementation only supports integers in the range -2**63 through 2**63-1,
+ so 64-bit unsigned integer values from 2**63 through 2**64-1 are expressed
+ as negative numbers.
+
+<address>
+ A JSON string that represents a network address to support clustering, in
+ the ``<protocol>:<ip>:<port>`` syntax described in ``ovsdb-tool(1)``.
+
+<servers>
+ A JSON object whose names are <raw-uuid>s that identify servers and
+ whose values are <address>es that specify those servers' addresses.
+
+<cluster-txn>
+ A JSON array with two elements:
+
+ 1. The first element is either a <database-schema> or ``null``. A
+ <database-schema> element is always present in the first record of a
+ clustered database to indicate the database's initial schema. If it is
+ not ``null`` in a later record, it indicates a change of schema for the
+ database.
+
+ 2. The second element is either a transaction record in the format
+ described under ``Standalone Format'' above, or ``null``.
+
+ When a schema is present, the transaction record is relative to an empty
+ database. That is, a schema change effectively resets the database to
+ empty and the transaction record represents the full database contents.
+ This allows readers to be ignorant of the full semantics of schema change.
+
+The first record in a clustered database contains the following members,
+all of which are required:
+
+``"server_id": <raw-uuid>``
+ The server's own UUID, which must be unique within the cluster.
+
+``"local_address": <address>``
+ The address on which the server listens for connections from other
+ servers in the cluster.
+
+``name": <id>``
+ The database schema name. It is only important when a server is in the
+ process of joining a cluster: a server will only join a cluster if the
+ name matches. (If the database schema name were unique, then we would
+ not also need a cluster ID.)
+
+``"cluster_id": <raw-uuid>``
+ The cluster's UUID. The all-zeros UUID is not a valid cluster ID.
+
+``"prev_term": <uint64>`` and ``"prev_index": <uint64>``
+ The Raft term and index just before the beginning of the log.
+
+``"prev_servers": <servers>``
+ The set of one or more servers in the cluster at index "prev_index" and
+ term "prev_term". It might not include this server, if it was not the
+ initial server in the cluster.
+
+``"prev_data": <json-value>`` and ``"prev_eid": <raw-uuid>``
+ A snapshot of the data in the database at index "prev_index" and term
+ "prev_term", and the entry ID for that data. The snapshot must contain a
+ schema.
+
+The second and subsequent records, if present, in a clustered database
+represent changes to the database, to the cluster state, or both. There are
+several types of these records. The most important types of records directly
+represent persistent state described in the Raft specification:
+
+Entry
+ A Raft log entry.
+
+Term
+ The start of a new term.
+
+Vote
+ The server's vote for a leader in the current term.
+
+The following additional types of records aid debugging and troubleshooting,
+but they do not affect correctness.
+
+Leader
+ Identifies a newly elected leader for the current term.
+
+Commit Index
+ An update to the server's ``commit_index``.
+
+Note
+ A human-readable description of some event.
+
+The table below identifies the members that each type of record contains.
+"yes" indicates that a member is required, "?" that it is optional, blank that
+it is forbidden, and [1] that ``data`` and ``eid`` must be either both present
+or both absent.
+
+============ ===== ==== ==== ====== ============ ====
+member Entry Term Vote Leader Commit Index Note
+============ ===== ==== ==== ====== ============ ====
+comment ? ? ? ? ? ?
+term yes yes yes yes
+index yes
+servers ?
+data [1]
+eid [1]
+vote yes
+leader yes
+commit_index yes
+note yes
+============ ===== ==== ==== ====== ============ ====
+
+The members are:
+
+``"comment": <string>``
+ A human-readable string giving an administrator more information about
+ the reason a record was emitted.
+
+``"term": <uint64>``
+ The term in which the activity occurred.
+
+``"index": <uint64>``
+ The index of a log entry.
+
+``"servers": <servers>``
+ Server configuration in a log entry.
+
+``"data": <json-value>``
+ The data in a log entry.
+
+``"eid": <raw-uuid>``
+ Entry ID in a log entry.
+
+``"vote": <raw-uuid>``
+ The server ID for which this server voted.
+
+``"leader": <raw-uuid>``
+ The server ID of the server. Emitted by both leaders and followers when a
+ leader is elected.
+
+``"commit_index": <uint64>``
+ Updated ``commit_index`` value.
+
+``"note": <string>``
+ One of a few special strings indicating important events. The currently
+ defined strings are:
+
+ ``"transfer leadership"``
+ This server transferred leadership to a different server (with details
+ included in ``comment``).
+
+ ``"left"``
+ This server finished leaving the cluster. (This lets subsequent
+ readers know that the server is not part of the cluster and should not
+ attempt to connect to it.)
+
+Joining a Cluster
+~~~~~~~~~~~~~~~~~
+
+In addition to general format for a clustered database, there is also a special
+case for a database file created by ``ovsdb-tool join-cluster``. Such a file
+contains exactly one record, which conveys the information passed to the
+``join-cluster`` command. It has the following members:
+
+``"server_id": <raw-uuid>`` and ``"local_address": <address>`` and ``"name": <id>``
+ These have the same semantics described above in the general description
+ of the format.
+
+``"cluster_id": <raw-uuid>``
+ This is provided only if the user gave the ``--cid`` option to
+ ``join-cluster``. It has the same semantics described above.
+
+``"remote_addresses"; [<address>*]``
+ One or more remote servers to contact for joining the cluster.
+
+When the server successfully joins the cluster, the database file is replaced
+by one described in `Clustered Format`_.
diff --git a/Documentation/ref/ovsdb.7.rst b/Documentation/ref/ovsdb.7.rst
index 6adef7382..dc5745f8c 100644
--- a/Documentation/ref/ovsdb.7.rst
+++ b/Documentation/ref/ovsdb.7.rst
@@ -123,9 +123,13 @@ schema checksum from a schema or database file, respectively.
Service Models
==============
-OVSDB supports two service models for databases: **standalone**, and
-**active-backup**. The service models provide different compromises
-among consistency and availability.
+OVSDB supports three service models for databases: **standalone**,
+**active-backup**, and **clustered**. The service models provide different
+compromises among consistency, availability, and partition tolerance. They
+also differ in the number of servers required and in terms of performance. The
+standalone and active-backup database service models share one on-disk format,
+and clustered databases use a different format, but the OVSDB programs work
+with both formats. ``ovsdb(5)`` documents these file formats.
RFC 7047, which specifies the OVSDB protocol, does not mandate or specify
any particular service model.
@@ -147,6 +151,11 @@ To set up a standalone database, use ``ovsdb-tool create`` to
create a database file, then run ``ovsdb-server`` to start the
database service.
+To configure a client, such as ``ovs-vswitchd`` or ``ovs-vsctl``, to use a
+standalone database, configure the server to listen on a "connection method"
+that the client can reach, then point the client to that connection method.
+See `Connection Methods`_ below for information about connection methods.
+
Active-Backup Database Service Model
------------------------------------
@@ -189,10 +198,149 @@ for server pairs.
Compared to a standalone server, the active-backup service model
somewhat increases availability, at a risk of split-brain. It adds
-generally insignificant performance overhead.
+generally insignificant performance overhead. On the other hand, the
+clustered service model, discussed below, requires at least 3 servers
+and has greater performance overhead, but it avoids the need for
+external management software and eliminates the possibility of
+split-brain.
Open vSwitch 2.6 introduced support for the active-backup service model.
+Clustered Database Service Model
+--------------------------------
+
+A **clustered** database runs across 3 or 5 or more database servers (the
+**cluster**) on different hosts. Servers in a cluster automatically
+synchronize writes within the cluster. A 3-server cluster can remain available
+in the face of at most 1 server failure; a 5-server cluster tolerates up to 2
+failures. Clusters larger than 5 servers will also work, with every 2 added
+servers allowing the cluster to tolerate 1 more failure, but write performance
+decreases. The number of servers should be odd: a 4- or 6-server cluster
+cannot tolerate more failures than a 3- or 5-server cluster, respectively.
+
+To set up a clustered database, first initialize it on a single node by running
+``ovsdb-tool create-cluster``, then start ``ovsdb-server``. Depending on its
+arguments, the ``create-cluster`` command can create an empty database or copy
+a standalone database's contents into the new database.
+
+To configure a client, such as ``ovn-controller`` or ``ovn-sbctl``, to use a
+clustered database, first configure all of the servers to listen on a
+connection method that the client can reach, then point the client to all of
+the servers' connection methods, comma-separated. See `Connection Methods`_,
+below, for more detail.
+
+Open vSwitch 2.9 introduced support for the clustered service model.
+
+How to Maintain a Clustered Database
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+To add a server to a cluster, run ``ovsdb-tool join-cluster`` on the new server
+and start ``ovsdb-server``. To remove a running server from a cluster, use
+``ovs-appctl`` to invoke the ``cluster/leave`` command. When a server fails
+and cannot be recovered, e.g. because its hard disk crashed, or to otherwise
+remove a server that is down from a cluster, use ``ovs-appctl`` to invoke
+``cluster/kick`` to make the remaining servers kick it out of the cluster.
+
+The above methods for adding and removing servers only work for healthy
+clusters, that is, for clusters with no more failures than their maximum
+tolerance. For example, in a 3-server cluster, the failure of 2 servers
+prevents servers joining or leaving the cluster (as well as database access).
+To prevent data loss or inconsistency, the preferred solution to this problem
+is to bring up enough of the failed servers to make the cluster healthy again,
+then if necessary remove any remaining failed servers and add new ones. If
+this cannot be done, though, use ``ovs-appctl`` to invoke ``cluster/leave
+--force`` on a running server. This command forces the server to which it is
+directed to leave its cluster and form a new single-node cluster that contains
+only itself. The data in the new cluster may be inconsistent with the former
+cluster: transactions not yet replicated to the server will be lost, and
+transactions not yet applied to the cluster may be committed. Afterward, any
+servers in its former cluster will regard the server to have failed.
+
+The servers in a cluster synchronize data over a cluster management protocol
+that is specific to Open vSwitch; it is not the same as the OVSDB protocol
+specified in RFC 7047. For this purpose, a server in a cluster is tied to a
+particular IP address and TCP port, which is specified in the ``ovsdb-tool``
+command that creates or joins the cluster. The TCP port used for clustering
+must be different from that used for OVSDB clients. To change the port or
+address of a server in a cluster, first remove it from the cluster, then add it
+back with the new address.
+
+To upgrade the ``ovsdb-server`` processes in a cluster from one version of Open
+vSwitch to another, upgrading them one at a time will keep the cluster healthy
+during the upgrade process. (This is different from upgrading a database
+schema, which is covered later under `Upgrading or Downgrading a Database`_.)
+
+Clustered OVSDB does not support the OVSDB "ephemeral columns" feature.
+``ovsdb-tool`` and ``ovsdb-client`` change ephemeral columns into persistent
+ones when they work with schemas for clustered databases. Future versions of
+OVSDB might add support for this feature.
+
+Understanding Cluster Consistency
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+To ensure consistency, clustered OVSDB uses the Raft algorithm described in
+Diego Ongaro's Ph.D. thesis, "Consensus: Bridging Theory and Practice". In an
+operational Raft cluster, at any given time a single server is the "leader" and
+the other nodes are "followers". Only the leader processes transactions, but a
+transaction is only committed when a majority of the servers confirm to the
+leader that they have written it to persistent storage.
+
+In most database systems, read and write access to the database happens through
+transactions. In such a system, Raft allows a cluster to present a strongly
+consistent transactional interface. OVSDB uses conventional transactions for
+writes, but clients often effectively do reads a different way, by asking the
+server to "monitor" a database or a subset of one on the client's behalf.
+Whenever monitored data changes, the server automatically tells the client what
+changed, which allows the client to maintain an accurate snapshot of the
+database in its memory. Of course, at any given time, the snapshot may be
+somewhat dated since some of it could have changed without the change
+notification yet being received and processed by the client.
+
+Given this unconventional usage model, OVSDB also adopts an unconventional
+clustering model. Each server in a cluster acts independently for the purpose
+of monitors and read-only transactions, without verifying that data is
+up-to-date with the leader. Servers forward transactions that write to the
+database to the leader for execution, ensuring consistency. This has the
+following consequences:
+
+* Transactions that involve writes, against any server in the cluster, are
+ linearizable if clients take care to use correct prerequisites, which is the
+ same condition required for linearizability in a standalone OVSDB.
+ (Actually, "at-least-once" consistency, because OVSDB does not have a session
+ mechanism to drop duplicate transactions if a connection drops after the
+ server commits it but before the client receives the result.)
+
+* Read-only transactions can yield results based on a stale version of the
+ database, if they are executed against a follower. Transactions on the
+ leader always yield fresh results. (With monitors, as explained above, a
+ client can always see stale data even without clustering, so clustering does
+ not change the consistency model for monitors.)
+
+* Monitor-based (or read-heavy) workloads scale well across a cluster, because
+ clustering OVSDB adds no additional work or communication for reads and
+ monitors.
+
+* A write-heavy client should connect to the leader, to avoid the overhead of
+ followers forwarding transactions to the leader.
+
+* When a client conducts a mix of read and write transactions across more than
+ one server in a cluster, it can see inconsistent results because a read
+ transaction might read stale data whose updates have not yet propagated from
+ the leader. By default, ``ovn-sbctl`` and similar utilities connect to the
+ cluster leader to avoid this issue.
+
+ The same might occur for transactions against a single follower except that
+ the OVSDB server ensures that the results of a write forwarded to the leader
+ by a given server are visible at that server before it replies to the
+ requesting client.
+
+* If a client uses a database on one server in a cluster, then another server
+ in the cluster (perhaps because the first server failed), the client could
+ observe stale data. Clustered OVSDB clients, however, can use a column in
+ the ``_Server`` database to detect that data on a server is older than data
+ that the client previously read. The OVSDB client library in Open vSwitch
+ uses this feature to avoid servers with stale data.
+
Database Replication
====================
@@ -245,6 +393,18 @@ unix:<file>
On Windows, connect to a local named pipe that is represented by a file
created in the path <file> to mimic the behavior of a Unix domain socket.
+<method1>,<method2>,...,<methodN>
+ For a clustered database service to be highly available, a client must be
+ able to connect to any of the servers in the cluster. To do so, specify
+ connection methods for each of the servers separated by commas (and
+ optional spaces).
+
+ In theory, if machines go up and down and IP addresses change in the right
+ way, a client could talk to the wrong instance of a database. To avoid
+ this possibility, add ``cid:<uuid>`` to the list of methods, where <uuid>
+ is the cluster ID of the desired database cluster, as printed by
+ ``ovsdb-tool get-cid``. This feature is optional.
+
OVSDB supports the following passive connection methods:
pssl:<port>[:<ip>]
@@ -314,27 +474,42 @@ A more common backup strategy is to periodically take and store a snapshot.
For the standalone and active-backup service models, making a copy of the
database file, e.g. using ``cp``, effectively makes a snapshot, and because
OVSDB database files are append-only, it works even if the database is being
-modified when the snapshot takes place.
+modified when the snapshot takes place. This approach does not work for
+clustered databases.
-Another way to make a backup is to use ``ovsdb-client backup``, which
-connects to a running database server and outputs an atomic snapshot of its
-schema and content, in the same format used for on-disk databases.
+Another way to make a backup, which works with all OVSDB service models, is to
+use ``ovsdb-client backup``, which connects to a running database server and
+outputs an atomic snapshot of its schema and content, in the same format used
+for standalone and active-backup databases.
Multiple options are also available when the time comes to restore a database
-from a backup. One option is to stop the database server or servers, overwrite
-the database file with the backup (e.g. with ``cp``), and then restart the
-servers. Another way is to use ``ovsdb-client restore``, which connects to a
-running database server and replaces the data in one of its databases by a
-provided snapshot. The advantage of ``ovsdb-client restore`` is that it causes
-zero downtime for the database and its server. It has the downside that UUIDs
-of rows in the restored database will differ from those in the snapshot,
-because the OVSDB protocol does not allow clients to specify row UUIDs.
+from a backup. For the standalone and active-backup service models, one option
+is to stop the database server or servers, overwrite the database file with the
+backup (e.g. with ``cp``), and then restart the servers. Another way, which
+works with any service model, is to use ``ovsdb-client restore``, which
+connects to a running database server and replaces the data in one of its
+databases by a provided snapshot. The advantage of ``ovsdb-client restore`` is
+that it causes zero downtime for the database and its server. It has the
+downside that UUIDs of rows in the restored database will differ from those in
+the snapshot, because the OVSDB protocol does not allow clients to specify row
+UUIDs.
None of these approaches saves and restores data in columns that the schema
designates as ephemeral. This is by design: the designer of a schema only
marks a column as ephemeral if it is acceptable for its data to be lost
when a database server restarts.
+Clustering and backup serve different purposes. Clustering increases
+availability, but it does not protect against data loss if, for example, a
+malicious or malfunctioning OVSDB client deletes or tampers with data.
+
+Changing Database Service Model
+-------------------------------
+
+Use ``ovsdb-tool create-cluster`` to create a clustered database from the
+contents of a standalone database. Use ``ovsdb-tool backup`` to create a
+standalone database from the contents of a clustered database.
+
Upgrading or Downgrading a Database
-----------------------------------
@@ -367,8 +542,8 @@ active-backup database, first stop the database server or servers, then use
``ovsdb-tool convert`` to convert it to the new schema, and then restart the
database server.
-OVSDB also supports online database schema conversion.
-To convert a database online, use ``ovsdb-client convert``.
+OVSDB also supports online database schema conversion for any of its database
+service models. To convert a database online, use ``ovsdb-client convert``.
The conversion is atomic, consistent, isolated, and durable. ``ovsdb-server``
disconnects any clients connected when the conversion takes place (except
clients that use the ``set_db_change_aware`` Open vSwitch extension RPC). Upon
@@ -405,9 +580,9 @@ First, ``ovsdb-tool compact`` can compact a standalone or active-backup
database that is not currently being served by ``ovsdb-server`` (or otherwise
locked for writing by another process). To compact any database that is
currently being served by ``ovsdb-server``, use ``ovs-appctl`` to send the
-``ovsdb-server/compact`` command. Each server in an active-backup database
-maintains its database file independently, so to compact all of them, issue
-this command separately on each server.
+``ovsdb-server/compact`` command. Each server in an active-backup or clustered
+database maintains its database file independently, so to compact all of them,
+issue this command separately on each server.
Viewing History
---------------
@@ -421,8 +596,10 @@ client. The comments can be helpful for quickly understanding a transaction;
for example, ``ovs-vsctl`` adds its command line to the transactions that it
makes.
-For active-backup databases, the sequence of transactions in each server's log
-will differ, even at points when they reflect the same data.
+The ``show-log`` command works with both OVSDB file formats, but the details of
+the output format differ. For active-backup and clustered databases, the
+sequence of transactions in each server's log will differ, even at points when
+they reflect the same data.
Truncating History
------------------
@@ -449,9 +626,9 @@ cryptography, it is acceptable for this purpose because it is not used to
defend against malicious attackers.
The first record in a standalone or active-backup database file specifies the
-schema. ``ovsdb-server`` will refuse to work with a database whose first
-record is corrupted. Delete and recreate such a database, or restore it from a
-backup.
+schema. ``ovsdb-server`` will refuse to work with a database where this record
+is corrupted, or with a clustered database file with corruption in the first
+few records. Delete and recreate such a database, or restore it from a backup.
When ``ovsdb-server`` adds records to a database file in which it detected
corruption, it first truncates the file just after the last good record.