diff options
author | Ben Pfaff <blp@ovn.org> | 2017-12-31 21:15:58 -0800 |
---|---|---|
committer | Ben Pfaff <blp@ovn.org> | 2018-03-24 12:04:53 -0700 |
commit | 1b1d2e6daa563cc91f974ffdc082fb3a8b424801 (patch) | |
tree | 9cc5df01b7af35962d5f40d0ffd8882fb277e047 /Documentation/ref | |
parent | 53178986d7fc86bcfc2f297b547a97ee71a21bb7 (diff) | |
download | openvswitch-1b1d2e6daa563cc91f974ffdc082fb3a8b424801.tar.gz |
ovsdb: Introduce experimental support for clustered databases.
This commit adds support for OVSDB clustering via Raft. Please read
ovsdb(7) for information on how to set up a clustered database. It is
simple and boils down to running "ovsdb-tool create-cluster" on one server
and "ovsdb-tool join-cluster" on each of the others and then starting
ovsdb-server in the usual way on all of them.
One you have a clustered database, you configure ovn-controller and
ovn-northd to use it by pointing them to all of the servers, e.g. where
previously you might have said "tcp:1.2.3.4" was the database server,
now you say that it is "tcp:1.2.3.4,tcp:5.6.7.8,tcp:9.10.11.12".
This also adds support for database clustering to ovs-sandbox.
Acked-by: Justin Pettit <jpettit@ovn.org>
Tested-by: aginwala <aginwala@asu.edu>
Signed-off-by: Ben Pfaff <blp@ovn.org>
Diffstat (limited to 'Documentation/ref')
-rw-r--r-- | Documentation/ref/ovsdb.5.rst | 207 | ||||
-rw-r--r-- | Documentation/ref/ovsdb.7.rst | 229 |
2 files changed, 398 insertions, 38 deletions
diff --git a/Documentation/ref/ovsdb.5.rst b/Documentation/ref/ovsdb.5.rst index f3e50976b..da0ad7a49 100644 --- a/Documentation/ref/ovsdb.5.rst +++ b/Documentation/ref/ovsdb.5.rst @@ -30,9 +30,11 @@ ovsdb Description =========== -OVSDB, the Open vSwitch Database, is a database system whose network -protocol is specified by RFC 7047. The RFC does not specify an on-disk -storage format. This manpage documents the format used by Open vSwitch. +OVSDB, the Open vSwitch Database, is a database system whose network protocol +is specified by RFC 7047. The RFC does not specify an on-disk storage format. +The OVSDB implementation in Open vSwitch implements two storage formats: one +for standalone (and active-backup) databases, and the other for clustered +databases. This manpage documents both of these formats. Most users do not need to be concerned with this specification. Instead, to manipulate OVSDB files, refer to `ovsdb-tool(1)`. For an @@ -47,14 +49,16 @@ infer it. OVSDB files do not include the values of ephemeral columns. -Database files are text files encoded in UTF-8 with LF (U+000A) line ends, -organized as append-only series of records. Each record consists of 2 -lines of text. +Standalone and clustered database files share the common structure described +here. They are text files encoded in UTF-8 with LF (U+000A) line ends, +organized as append-only series of records. Each record consists of 2 lines of +text. -The first line in each record has the format ``OVSDB JSON`` *length* *hash*, -where *length* is a positive decimal integer and *hash* is a SHA-1 checksum -expressed as 40 hexadecimal digits. Words in the first line must be separated -by exactly one space. +The first line in each record has the format ``OVSDB <magic> <length> <hash>``, +where <magic> is ``JSON`` for standalone databases or ``CLUSTER`` for clustered +databases, <length> is a positive decimal integer, and <hash> is a SHA-1 +checksum expressed as 40 hexadecimal digits. Words in the first line must be +separated by exactly one space. The second line must be exactly *length* bytes long (including the LF) and its SHA-1 checksum (including the LF) must match *hash* exactly. The line's @@ -102,8 +106,7 @@ looking through a database log with ``ovsdb-tool show-log``: operations, OVSDB concatenates them into a single ``_comment`` member, separated by a new-line. - OVSDB only writes a ``_comment`` member if it would be - a nonempty string. + OVSDB only writes a ``_comment`` member if it would be a nonempty string. Each of these records also has one or more additional members, each of which maps from the name of a database table to a <table-txn>: @@ -123,3 +126,183 @@ maps from the name of a database table to a <table-txn>: default values for their types defined in RFC 7047 section 5.2.1; for modified rows, the OVSDB implementation omits columns whose values are unchanged. + +Clustered Format +---------------- + +The clustered format has the following additional notation: + +<uint64> + A JSON integer that represents a 64-bit unsigned integer. The OVS JSON + implementation only supports integers in the range -2**63 through 2**63-1, + so 64-bit unsigned integer values from 2**63 through 2**64-1 are expressed + as negative numbers. + +<address> + A JSON string that represents a network address to support clustering, in + the ``<protocol>:<ip>:<port>`` syntax described in ``ovsdb-tool(1)``. + +<servers> + A JSON object whose names are <raw-uuid>s that identify servers and + whose values are <address>es that specify those servers' addresses. + +<cluster-txn> + A JSON array with two elements: + + 1. The first element is either a <database-schema> or ``null``. A + <database-schema> element is always present in the first record of a + clustered database to indicate the database's initial schema. If it is + not ``null`` in a later record, it indicates a change of schema for the + database. + + 2. The second element is either a transaction record in the format + described under ``Standalone Format'' above, or ``null``. + + When a schema is present, the transaction record is relative to an empty + database. That is, a schema change effectively resets the database to + empty and the transaction record represents the full database contents. + This allows readers to be ignorant of the full semantics of schema change. + +The first record in a clustered database contains the following members, +all of which are required: + +``"server_id": <raw-uuid>`` + The server's own UUID, which must be unique within the cluster. + +``"local_address": <address>`` + The address on which the server listens for connections from other + servers in the cluster. + +``name": <id>`` + The database schema name. It is only important when a server is in the + process of joining a cluster: a server will only join a cluster if the + name matches. (If the database schema name were unique, then we would + not also need a cluster ID.) + +``"cluster_id": <raw-uuid>`` + The cluster's UUID. The all-zeros UUID is not a valid cluster ID. + +``"prev_term": <uint64>`` and ``"prev_index": <uint64>`` + The Raft term and index just before the beginning of the log. + +``"prev_servers": <servers>`` + The set of one or more servers in the cluster at index "prev_index" and + term "prev_term". It might not include this server, if it was not the + initial server in the cluster. + +``"prev_data": <json-value>`` and ``"prev_eid": <raw-uuid>`` + A snapshot of the data in the database at index "prev_index" and term + "prev_term", and the entry ID for that data. The snapshot must contain a + schema. + +The second and subsequent records, if present, in a clustered database +represent changes to the database, to the cluster state, or both. There are +several types of these records. The most important types of records directly +represent persistent state described in the Raft specification: + +Entry + A Raft log entry. + +Term + The start of a new term. + +Vote + The server's vote for a leader in the current term. + +The following additional types of records aid debugging and troubleshooting, +but they do not affect correctness. + +Leader + Identifies a newly elected leader for the current term. + +Commit Index + An update to the server's ``commit_index``. + +Note + A human-readable description of some event. + +The table below identifies the members that each type of record contains. +"yes" indicates that a member is required, "?" that it is optional, blank that +it is forbidden, and [1] that ``data`` and ``eid`` must be either both present +or both absent. + +============ ===== ==== ==== ====== ============ ==== +member Entry Term Vote Leader Commit Index Note +============ ===== ==== ==== ====== ============ ==== +comment ? ? ? ? ? ? +term yes yes yes yes +index yes +servers ? +data [1] +eid [1] +vote yes +leader yes +commit_index yes +note yes +============ ===== ==== ==== ====== ============ ==== + +The members are: + +``"comment": <string>`` + A human-readable string giving an administrator more information about + the reason a record was emitted. + +``"term": <uint64>`` + The term in which the activity occurred. + +``"index": <uint64>`` + The index of a log entry. + +``"servers": <servers>`` + Server configuration in a log entry. + +``"data": <json-value>`` + The data in a log entry. + +``"eid": <raw-uuid>`` + Entry ID in a log entry. + +``"vote": <raw-uuid>`` + The server ID for which this server voted. + +``"leader": <raw-uuid>`` + The server ID of the server. Emitted by both leaders and followers when a + leader is elected. + +``"commit_index": <uint64>`` + Updated ``commit_index`` value. + +``"note": <string>`` + One of a few special strings indicating important events. The currently + defined strings are: + + ``"transfer leadership"`` + This server transferred leadership to a different server (with details + included in ``comment``). + + ``"left"`` + This server finished leaving the cluster. (This lets subsequent + readers know that the server is not part of the cluster and should not + attempt to connect to it.) + +Joining a Cluster +~~~~~~~~~~~~~~~~~ + +In addition to general format for a clustered database, there is also a special +case for a database file created by ``ovsdb-tool join-cluster``. Such a file +contains exactly one record, which conveys the information passed to the +``join-cluster`` command. It has the following members: + +``"server_id": <raw-uuid>`` and ``"local_address": <address>`` and ``"name": <id>`` + These have the same semantics described above in the general description + of the format. + +``"cluster_id": <raw-uuid>`` + This is provided only if the user gave the ``--cid`` option to + ``join-cluster``. It has the same semantics described above. + +``"remote_addresses"; [<address>*]`` + One or more remote servers to contact for joining the cluster. + +When the server successfully joins the cluster, the database file is replaced +by one described in `Clustered Format`_. diff --git a/Documentation/ref/ovsdb.7.rst b/Documentation/ref/ovsdb.7.rst index 6adef7382..dc5745f8c 100644 --- a/Documentation/ref/ovsdb.7.rst +++ b/Documentation/ref/ovsdb.7.rst @@ -123,9 +123,13 @@ schema checksum from a schema or database file, respectively. Service Models ============== -OVSDB supports two service models for databases: **standalone**, and -**active-backup**. The service models provide different compromises -among consistency and availability. +OVSDB supports three service models for databases: **standalone**, +**active-backup**, and **clustered**. The service models provide different +compromises among consistency, availability, and partition tolerance. They +also differ in the number of servers required and in terms of performance. The +standalone and active-backup database service models share one on-disk format, +and clustered databases use a different format, but the OVSDB programs work +with both formats. ``ovsdb(5)`` documents these file formats. RFC 7047, which specifies the OVSDB protocol, does not mandate or specify any particular service model. @@ -147,6 +151,11 @@ To set up a standalone database, use ``ovsdb-tool create`` to create a database file, then run ``ovsdb-server`` to start the database service. +To configure a client, such as ``ovs-vswitchd`` or ``ovs-vsctl``, to use a +standalone database, configure the server to listen on a "connection method" +that the client can reach, then point the client to that connection method. +See `Connection Methods`_ below for information about connection methods. + Active-Backup Database Service Model ------------------------------------ @@ -189,10 +198,149 @@ for server pairs. Compared to a standalone server, the active-backup service model somewhat increases availability, at a risk of split-brain. It adds -generally insignificant performance overhead. +generally insignificant performance overhead. On the other hand, the +clustered service model, discussed below, requires at least 3 servers +and has greater performance overhead, but it avoids the need for +external management software and eliminates the possibility of +split-brain. Open vSwitch 2.6 introduced support for the active-backup service model. +Clustered Database Service Model +-------------------------------- + +A **clustered** database runs across 3 or 5 or more database servers (the +**cluster**) on different hosts. Servers in a cluster automatically +synchronize writes within the cluster. A 3-server cluster can remain available +in the face of at most 1 server failure; a 5-server cluster tolerates up to 2 +failures. Clusters larger than 5 servers will also work, with every 2 added +servers allowing the cluster to tolerate 1 more failure, but write performance +decreases. The number of servers should be odd: a 4- or 6-server cluster +cannot tolerate more failures than a 3- or 5-server cluster, respectively. + +To set up a clustered database, first initialize it on a single node by running +``ovsdb-tool create-cluster``, then start ``ovsdb-server``. Depending on its +arguments, the ``create-cluster`` command can create an empty database or copy +a standalone database's contents into the new database. + +To configure a client, such as ``ovn-controller`` or ``ovn-sbctl``, to use a +clustered database, first configure all of the servers to listen on a +connection method that the client can reach, then point the client to all of +the servers' connection methods, comma-separated. See `Connection Methods`_, +below, for more detail. + +Open vSwitch 2.9 introduced support for the clustered service model. + +How to Maintain a Clustered Database +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +To add a server to a cluster, run ``ovsdb-tool join-cluster`` on the new server +and start ``ovsdb-server``. To remove a running server from a cluster, use +``ovs-appctl`` to invoke the ``cluster/leave`` command. When a server fails +and cannot be recovered, e.g. because its hard disk crashed, or to otherwise +remove a server that is down from a cluster, use ``ovs-appctl`` to invoke +``cluster/kick`` to make the remaining servers kick it out of the cluster. + +The above methods for adding and removing servers only work for healthy +clusters, that is, for clusters with no more failures than their maximum +tolerance. For example, in a 3-server cluster, the failure of 2 servers +prevents servers joining or leaving the cluster (as well as database access). +To prevent data loss or inconsistency, the preferred solution to this problem +is to bring up enough of the failed servers to make the cluster healthy again, +then if necessary remove any remaining failed servers and add new ones. If +this cannot be done, though, use ``ovs-appctl`` to invoke ``cluster/leave +--force`` on a running server. This command forces the server to which it is +directed to leave its cluster and form a new single-node cluster that contains +only itself. The data in the new cluster may be inconsistent with the former +cluster: transactions not yet replicated to the server will be lost, and +transactions not yet applied to the cluster may be committed. Afterward, any +servers in its former cluster will regard the server to have failed. + +The servers in a cluster synchronize data over a cluster management protocol +that is specific to Open vSwitch; it is not the same as the OVSDB protocol +specified in RFC 7047. For this purpose, a server in a cluster is tied to a +particular IP address and TCP port, which is specified in the ``ovsdb-tool`` +command that creates or joins the cluster. The TCP port used for clustering +must be different from that used for OVSDB clients. To change the port or +address of a server in a cluster, first remove it from the cluster, then add it +back with the new address. + +To upgrade the ``ovsdb-server`` processes in a cluster from one version of Open +vSwitch to another, upgrading them one at a time will keep the cluster healthy +during the upgrade process. (This is different from upgrading a database +schema, which is covered later under `Upgrading or Downgrading a Database`_.) + +Clustered OVSDB does not support the OVSDB "ephemeral columns" feature. +``ovsdb-tool`` and ``ovsdb-client`` change ephemeral columns into persistent +ones when they work with schemas for clustered databases. Future versions of +OVSDB might add support for this feature. + +Understanding Cluster Consistency +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +To ensure consistency, clustered OVSDB uses the Raft algorithm described in +Diego Ongaro's Ph.D. thesis, "Consensus: Bridging Theory and Practice". In an +operational Raft cluster, at any given time a single server is the "leader" and +the other nodes are "followers". Only the leader processes transactions, but a +transaction is only committed when a majority of the servers confirm to the +leader that they have written it to persistent storage. + +In most database systems, read and write access to the database happens through +transactions. In such a system, Raft allows a cluster to present a strongly +consistent transactional interface. OVSDB uses conventional transactions for +writes, but clients often effectively do reads a different way, by asking the +server to "monitor" a database or a subset of one on the client's behalf. +Whenever monitored data changes, the server automatically tells the client what +changed, which allows the client to maintain an accurate snapshot of the +database in its memory. Of course, at any given time, the snapshot may be +somewhat dated since some of it could have changed without the change +notification yet being received and processed by the client. + +Given this unconventional usage model, OVSDB also adopts an unconventional +clustering model. Each server in a cluster acts independently for the purpose +of monitors and read-only transactions, without verifying that data is +up-to-date with the leader. Servers forward transactions that write to the +database to the leader for execution, ensuring consistency. This has the +following consequences: + +* Transactions that involve writes, against any server in the cluster, are + linearizable if clients take care to use correct prerequisites, which is the + same condition required for linearizability in a standalone OVSDB. + (Actually, "at-least-once" consistency, because OVSDB does not have a session + mechanism to drop duplicate transactions if a connection drops after the + server commits it but before the client receives the result.) + +* Read-only transactions can yield results based on a stale version of the + database, if they are executed against a follower. Transactions on the + leader always yield fresh results. (With monitors, as explained above, a + client can always see stale data even without clustering, so clustering does + not change the consistency model for monitors.) + +* Monitor-based (or read-heavy) workloads scale well across a cluster, because + clustering OVSDB adds no additional work or communication for reads and + monitors. + +* A write-heavy client should connect to the leader, to avoid the overhead of + followers forwarding transactions to the leader. + +* When a client conducts a mix of read and write transactions across more than + one server in a cluster, it can see inconsistent results because a read + transaction might read stale data whose updates have not yet propagated from + the leader. By default, ``ovn-sbctl`` and similar utilities connect to the + cluster leader to avoid this issue. + + The same might occur for transactions against a single follower except that + the OVSDB server ensures that the results of a write forwarded to the leader + by a given server are visible at that server before it replies to the + requesting client. + +* If a client uses a database on one server in a cluster, then another server + in the cluster (perhaps because the first server failed), the client could + observe stale data. Clustered OVSDB clients, however, can use a column in + the ``_Server`` database to detect that data on a server is older than data + that the client previously read. The OVSDB client library in Open vSwitch + uses this feature to avoid servers with stale data. + Database Replication ==================== @@ -245,6 +393,18 @@ unix:<file> On Windows, connect to a local named pipe that is represented by a file created in the path <file> to mimic the behavior of a Unix domain socket. +<method1>,<method2>,...,<methodN> + For a clustered database service to be highly available, a client must be + able to connect to any of the servers in the cluster. To do so, specify + connection methods for each of the servers separated by commas (and + optional spaces). + + In theory, if machines go up and down and IP addresses change in the right + way, a client could talk to the wrong instance of a database. To avoid + this possibility, add ``cid:<uuid>`` to the list of methods, where <uuid> + is the cluster ID of the desired database cluster, as printed by + ``ovsdb-tool get-cid``. This feature is optional. + OVSDB supports the following passive connection methods: pssl:<port>[:<ip>] @@ -314,27 +474,42 @@ A more common backup strategy is to periodically take and store a snapshot. For the standalone and active-backup service models, making a copy of the database file, e.g. using ``cp``, effectively makes a snapshot, and because OVSDB database files are append-only, it works even if the database is being -modified when the snapshot takes place. +modified when the snapshot takes place. This approach does not work for +clustered databases. -Another way to make a backup is to use ``ovsdb-client backup``, which -connects to a running database server and outputs an atomic snapshot of its -schema and content, in the same format used for on-disk databases. +Another way to make a backup, which works with all OVSDB service models, is to +use ``ovsdb-client backup``, which connects to a running database server and +outputs an atomic snapshot of its schema and content, in the same format used +for standalone and active-backup databases. Multiple options are also available when the time comes to restore a database -from a backup. One option is to stop the database server or servers, overwrite -the database file with the backup (e.g. with ``cp``), and then restart the -servers. Another way is to use ``ovsdb-client restore``, which connects to a -running database server and replaces the data in one of its databases by a -provided snapshot. The advantage of ``ovsdb-client restore`` is that it causes -zero downtime for the database and its server. It has the downside that UUIDs -of rows in the restored database will differ from those in the snapshot, -because the OVSDB protocol does not allow clients to specify row UUIDs. +from a backup. For the standalone and active-backup service models, one option +is to stop the database server or servers, overwrite the database file with the +backup (e.g. with ``cp``), and then restart the servers. Another way, which +works with any service model, is to use ``ovsdb-client restore``, which +connects to a running database server and replaces the data in one of its +databases by a provided snapshot. The advantage of ``ovsdb-client restore`` is +that it causes zero downtime for the database and its server. It has the +downside that UUIDs of rows in the restored database will differ from those in +the snapshot, because the OVSDB protocol does not allow clients to specify row +UUIDs. None of these approaches saves and restores data in columns that the schema designates as ephemeral. This is by design: the designer of a schema only marks a column as ephemeral if it is acceptable for its data to be lost when a database server restarts. +Clustering and backup serve different purposes. Clustering increases +availability, but it does not protect against data loss if, for example, a +malicious or malfunctioning OVSDB client deletes or tampers with data. + +Changing Database Service Model +------------------------------- + +Use ``ovsdb-tool create-cluster`` to create a clustered database from the +contents of a standalone database. Use ``ovsdb-tool backup`` to create a +standalone database from the contents of a clustered database. + Upgrading or Downgrading a Database ----------------------------------- @@ -367,8 +542,8 @@ active-backup database, first stop the database server or servers, then use ``ovsdb-tool convert`` to convert it to the new schema, and then restart the database server. -OVSDB also supports online database schema conversion. -To convert a database online, use ``ovsdb-client convert``. +OVSDB also supports online database schema conversion for any of its database +service models. To convert a database online, use ``ovsdb-client convert``. The conversion is atomic, consistent, isolated, and durable. ``ovsdb-server`` disconnects any clients connected when the conversion takes place (except clients that use the ``set_db_change_aware`` Open vSwitch extension RPC). Upon @@ -405,9 +580,9 @@ First, ``ovsdb-tool compact`` can compact a standalone or active-backup database that is not currently being served by ``ovsdb-server`` (or otherwise locked for writing by another process). To compact any database that is currently being served by ``ovsdb-server``, use ``ovs-appctl`` to send the -``ovsdb-server/compact`` command. Each server in an active-backup database -maintains its database file independently, so to compact all of them, issue -this command separately on each server. +``ovsdb-server/compact`` command. Each server in an active-backup or clustered +database maintains its database file independently, so to compact all of them, +issue this command separately on each server. Viewing History --------------- @@ -421,8 +596,10 @@ client. The comments can be helpful for quickly understanding a transaction; for example, ``ovs-vsctl`` adds its command line to the transactions that it makes. -For active-backup databases, the sequence of transactions in each server's log -will differ, even at points when they reflect the same data. +The ``show-log`` command works with both OVSDB file formats, but the details of +the output format differ. For active-backup and clustered databases, the +sequence of transactions in each server's log will differ, even at points when +they reflect the same data. Truncating History ------------------ @@ -449,9 +626,9 @@ cryptography, it is acceptable for this purpose because it is not used to defend against malicious attackers. The first record in a standalone or active-backup database file specifies the -schema. ``ovsdb-server`` will refuse to work with a database whose first -record is corrupted. Delete and recreate such a database, or restore it from a -backup. +schema. ``ovsdb-server`` will refuse to work with a database where this record +is corrupted, or with a clustered database file with corruption in the first +few records. Delete and recreate such a database, or restore it from a backup. When ``ovsdb-server`` adds records to a database file in which it detected corruption, it first truncates the file just after the last good record. |