diff options
Diffstat (limited to 'ovn')
-rw-r--r-- | ovn/OVN-GW-HA.rst | 426 | ||||
-rw-r--r-- | ovn/automake.mk | 3 | ||||
-rw-r--r-- | ovn/controller/pinctrl.c | 3 | ||||
-rw-r--r-- | ovn/ovn-architecture.7.xml | 4 |
4 files changed, 4 insertions, 432 deletions
diff --git a/ovn/OVN-GW-HA.rst b/ovn/OVN-GW-HA.rst deleted file mode 100644 index 5b21b6469..000000000 --- a/ovn/OVN-GW-HA.rst +++ /dev/null @@ -1,426 +0,0 @@ -.. - Licensed under the Apache License, Version 2.0 (the "License"); you may - not use this file except in compliance with the License. You may obtain - a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - - Unless required by applicable law or agreed to in writing, software - distributed under the License is distributed on an "AS IS" BASIS, WITHOUT - WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the - License for the specific language governing permissions and limitations - under the License. - - Convention for heading levels in Open vSwitch documentation: - - ======= Heading 0 (reserved for the title in a document) - ------- Heading 1 - ~~~~~~~ Heading 2 - +++++++ Heading 3 - ''''''' Heading 4 - - Avoid deeper levels because they do not render well. - -================================== -OVN Gateway High Availability Plan -================================== - -:: - - OVN Gateway - - +---------------------------+ - | | - | External Network | - | | - +-------------^-------------+ - | - | - +-----------+ - | | - | Gateway | - | | - +-----------+ - ^ - | - | - +-------------v-------------+ - | | - | OVN Virtual Network | - | | - +---------------------------+ - -The OVN gateway is responsible for shuffling traffic between the tunneled -overlay network (governed by ovn-northd), and the legacy physical network. In -a naive implementation, the gateway is a single x86 server, or hardware VTEP. -For most deployments, a single system has enough forwarding capacity to service -the entire virtualized network, however, it introduces a single point of -failure. If this system dies, the entire OVN deployment becomes unavailable. -To mitigate this risk, an HA solution is critical -- by spreading -responsibility across multiple systems, no single server failure can take down -the network. - -An HA solution is both critical to the manageability of the system, and -extremely difficult to get right. The purpose of this document, is to propose -a plan for OVN Gateway High Availability which takes into account our past -experience building similar systems. It should be considered a fluid changing -proposal, not a set-in-stone decree. - -Basic Architecture ------------------- - -In an OVN deployment, the set of hypervisors and network elements operating -under the guidance of ovn-northd are in what's called "logical space". These -servers use VXLAN, STT, or Geneve to communicate, oblivious to the details of -the underlying physical network. When these systems need to communicate with -legacy networks, traffic must be routed through a Gateway which translates from -OVN controlled tunnel traffic, to raw physical network traffic. - -Since the gateway is typically the only system with a connection to the -physical network all traffic between logical space and the WAN must travel -through it. This makes it a critical single point of failure -- if the gateway -dies, communication with the WAN ceases for all systems in logical space. - -To mitigate this risk, multiple gateways should be run in a "High Availability -Cluster" or "HA Cluster". The HA cluster will be responsible for performing -the duties of a gateways, while being able to recover gracefully from -individual member failures. - -:: - - OVN Gateway HA Cluster - - +---------------------------+ - | | - | External Network | - | | - +-------------^-------------+ - | - | - +----------------------v----------------------+ - | | - | High Availability Cluster | - | | - | +-----------+ +-----------+ +-----------+ | - | | | | | | | | - | | Gateway | | Gateway | | Gateway | | - | | | | | | | | - | +-----------+ +-----------+ +-----------+ | - +----------------------^----------------------+ - | - | - +-------------v-------------+ - | | - | OVN Virtual Network | - | | - +---------------------------+ - -L2 vs L3 High Availability -~~~~~~~~~~~~~~~~~~~~~~~~~~ - -In order to achieve this goal, there are two broad approaches one can take. -The HA cluster can appear to the network like a giant Layer 2 Ethernet Switch, -or like a giant IP Router. These approaches are called L2HA, and L3HA -respectively. L2HA allows ethernet broadcast domains to extend into logical -space, a significant advantage, but this comes at a cost. The need to avoid -transient L2 loops during failover significantly complicates their design. On -the other hand, L3HA works for most use cases, is simpler, and fails more -gracefully. For these reasons, it is suggested that OVN supports an L3HA -model, leaving L2HA for future work (or third party VTEP providers). Both -models are discussed further below. - -L3HA ----- - -In this section, we'll work through a basic simple L3HA implementation, on top -of which we'll gradually build more sophisticated features explaining their -motivations and implementations as we go. - -Naive active-backup -~~~~~~~~~~~~~~~~~~~ - -Let's assume that there are a collection of logical routers which a tenant has -asked for, our task is to schedule these logical routers on one of N gateways, -and gracefully redistribute the routers on gateways which have failed. The -absolute simplest way to achieve this is what we'll call "naive-active-backup". - -:: - - Naive Active Backup HA Implementation - - +----------------+ +----------------+ - | Leader | | Backup | - | | | | - | A B C | | | - | | | | - +----+-+-+-+----++ +-+--------------+ - ^ ^ ^ ^ | | - | | | | | | - | | | | +-+------+---+ - + + + + | ovn-northd | - Traffic +------------+ - -In a naive active-backup, one of the Gateways is chosen (arbitrarily) as a -leader. All logical routers (A, B, C in the figure), are scheduled on this -leader gateway and all traffic flows through it. ovn-northd monitors this -gateway via OpenFlow echo requests (or some equivalent), and if the gateway -dies, it recreates the routers on one of the backups. - -This approach basically works in most cases and should likely be the starting -point for OVN -- it's strictly better than no HA solution and is a good -foundation for more sophisticated solutions. That said, it's not without it's -limitations. Specifically, this approach doesn't coordinate with the physical -network to minimize disruption during failures, and it tightly couples failover -to ovn-northd (we'll discuss why this is bad in a bit), and wastes resources by -leaving backup gateways completely unutilized. - -Router Failover -+++++++++++++++ - -When ovn-northd notices the leader has died and decides to migrate routers to a -backup gateway, the physical network has to be notified to direct traffic to -the new gateway. Otherwise, traffic could be blackholed for longer than -necessary making failovers worse than they need to be. - -For now, let's assume that OVN requires all gateways to be on the same IP -subnet on the physical network. If this isn't the case, gateways would need to -participate in routing protocols to orchestrate failovers, something which is -difficult and out of scope of this document. - -Since all gateways are on the same IP subnet, we simply need to worry about -updating the MAC learning tables of the Ethernet switches on that subnet. -Presumably, they all have entries for each logical router pointing to the old -leader. If these entries aren't updated, all traffic will be sent to the (now -defunct) old leader, instead of the new one. - -In order to mitigate this issue, it's recommended that the new gateway sends a -Reverse ARP (RARP) onto the physical network for each logical router it now -controls. A Reverse ARP is a benign protocol used by many hypervisors when -virtual machines migrate to update L2 forwarding tables. In this case, the -ethernet source address of the RARP is that of the logical router it -corresponds to, and its destination is the broadcast address. This causes the -RARP to travel to every L2 switch in the broadcast domain, updating forwarding -tables accordingly. This strategy is recommended in all failover mechanisms -discussed in this document -- when a router newly boots on a new leader, it -should RARP its MAC address. - -Controller Independent Active-backup -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -:: - - Controller Independent Active-Backup Implementation - - +----------------+ +----------------+ - | Leader | | Backup | - | | | | - | A B C | | | - | | | | - +----------------+ +----------------+ - ^ ^ ^ ^ - | | | | - | | | | - + + + + - Traffic - -The fundamental problem with naive active-backup, is it tightly couples the -failover solution to ovn-northd. This can significantly increase downtime in -the event of a failover as the (often already busy) ovn-northd controller has -to recompute state for the new leader. Worse, if ovn-northd goes down, we can't -perform gateway failover at all. This violates the principle that control -plane outages should have no impact on dataplane functionality. - -In a controller independent active-backup configuration, ovn-northd is -responsible for initial configuration while the HA cluster is responsible for -monitoring the leader, and failing over to a backup if necessary. ovn-northd -sets HA policy, but doesn't actively participate when failovers occur. - -Of course, in this model, ovn-northd is not without some responsibility. Its -role is to pre-plan what should happen in the event of a failure, leaving it to -the individual switches to execute this plan. It does this by assigning each -gateway a unique leadership priority. Once assigned, it communicates this -priority to each node it controls. Nodes use the leadership priority to -determine which gateway in the cluster is the active leader by using a simple -metric: the leader is the gateway that is healthy, with the highest priority. -If that gateway goes down, leadership falls to the next highest priority, and -conversely, if a new gateway comes up with a higher priority, it takes over -leadership. - -Thus, in this model, leadership of the HA cluster is determined simply by the -status of its members. Therefore if we can communicate the status of each -gateway to each transport node, they can individually figure out which is the -leader, and direct traffic accordingly. - -Tunnel Monitoring -+++++++++++++++++ - -Since in this model leadership is determined exclusively by the health status -of member gateways, a key problem is how do we communicate this information to -the relevant transport nodes. Luckily, we can do this fairly cheaply using -tunnel monitoring protocols like BFD. - -The basic idea is pretty straightforward. Each transport node maintains a -tunnel to every gateway in the HA cluster (not just the leader). These tunnels -are monitored using the BFD protocol to see which are alive. Given this -information, hypervisors can trivially compute the highest priority live -gateway, and thus the leader. - -In practice, this leadership computation can be performed trivially using the -bundle or group action. Rather than using OpenFlow to simply output to the -leader, all gateways could be listed in an active-backup bundle action ordered -by their priority. The bundle action will automatically take into account the -tunnel monitoring status to output the packet to the highest priority live -gateway. - -Inter-Gateway Monitoring -++++++++++++++++++++++++ - -One somewhat subtle aspect of this model, is that failovers are not globally -atomic. When a failover occurs, it will take some time for all hypervisors to -notice and adjust accordingly. Similarly, if a new high priority Gateway comes -up, it may take some time for all hypervisors to switch over to the new leader. -In order to avoid confusing the physical network, under these circumstances -it's important for the backup gateways to drop traffic they've received -erroneously. In order to do this, each Gateway must know whether or not it is, -in fact active. This can be achieved by creating a mesh of tunnels between -gateways. Each gateway monitors the other gateways its cluster to determine -which are alive, and therefore whether or not that gateway happens to be the -leader. If leading, the gateway forwards traffic normally, otherwise it drops -all traffic. - -Gateway Leadership Resignation -++++++++++++++++++++++++++++++ - -Sometimes a gateway may be healthy, but still may not be suitable to lead the -HA cluster. This could happen for several reasons including: - -* The physical network is unreachable - -* BFD (or ping) has detected the next hop router is unreachable - -* The Gateway recently booted and isn't fully configured - -In this case, the Gateway should resign leadership by holding its tunnels down -using the ``other_config:cpath_down`` flag. This indicates to participating -hypervisors and Gateways that this gateway should be treated as if it's down, -even though its tunnels are still healthy. - -Router Specific Active-Backup -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -:: - - Router Specific Active-Backup - - +----------------+ +----------------+ - | | | | - | A C | | B D E | - | | | | - +----------------+ +----------------+ - ^ ^ ^ ^ - | | | | - | | | | - + + + + - Traffic - -Controller independent active-backup is a great advance over naive -active-backup, but it still has one glaring problem -- it under-utilizes the -backup gateways. In ideal scenario, all traffic would split evenly among the -live set of gateways. Getting all the way there is somewhat tricky, but as a -step in the direction, one could use the "Router Specific Active-Backup" -algorithm. This algorithm looks a lot like active-backup on a per logical -router basis, with one twist. It chooses a different active Gateway for each -logical router. Thus, in situations where there are several logical routers, -all with somewhat balanced load, this algorithm performs better. - -Implementation of this strategy is quite straightforward if built on top of -basic controller independent active-backup. On a per logical router basis, the -algorithm is the same, leadership is determined by the liveness of the -gateways. The key difference here is that the gateways must have a different -leadership priority for each logical router. These leadership priorities can -be computed by ovn-northd just as they had been in the controller independent -active-backup model. - -Once we have these per logical router priorities, they simply need be -communicated to the members of the gateway cluster and the hypervisors. The -hypervisors in particular, need simply have an active-backup bundle action (or -group action) per logical router listing the gateways in priority order for -*that router*, rather than having a single bundle action shared for all the -routers. - -Additionally, the gateways need to be updated to take into account individual -router priorities. Specifically, each gateway should drop traffic of backup -routers it's running, and forward traffic of active gateways, instead of simply -dropping or forwarding everything. This should likely be done by having -ovn-controller recompute OpenFlow for the gateway, though other options exist. - -The final complication is that ovn-northd's logic must be updated to choose -these per logical router leadership priorities in a more sophisticated manner. -It doesn't matter much exactly what algorithm it chooses to do this, beyond -that it should provide good balancing in the common case. I.E. each logical -routers priorities should be different enough that routers balance to different -gateways even when failures occur. - -Preemption -++++++++++ - -In an active-backup setup, one issue that users will run into is that of -gateway leader preemption. If a new Gateway is added to a cluster, or for some -reason an existing gateway is rebooted, we could end up in a situation where -the newly activated gateway has higher priority than any other in the HA -cluster. In this case, as soon as that gateway appears, it will preempt -leadership from the currently active leader causing an unnecessary failover. -Since failover can be quite expensive, this preemption may be undesirable. - -The controller can optionally avoid preemption by cleverly tweaking the -leadership priorities. For each router, new gateways should be assigned -priorities that put them second in line or later when they eventually come up. -Furthermore, if a gateway goes down for a significant period of time, its old -leadership priorities should be revoked and new ones should be assigned as if -it's a brand new gateway. Note that this should only happen if a gateway has -been down for a while (several minutes), otherwise a flapping gateway could -have wide ranging, unpredictable, consequences. - -Note that preemption avoidance should be optional depending on the deployment. -One necessarily sacrifices optimal load balancing to satisfy these requirements -as new gateways will get no traffic on boot. Thus, this feature represents a -trade-off which must be made on a per installation basis. - -Fully Active-Active HA -~~~~~~~~~~~~~~~~~~~~~~ - -:: - - Fully Active-Active HA - - +----------------+ +----------------+ - | | | | - | A B C D E | | A B C D E | - | | | | - +----------------+ +----------------+ - ^ ^ ^ ^ - | | | | - | | | | - + + + + - Traffic - -The final step in L3HA is to have true active-active HA. In this scenario each -router has an instance on each Gateway, and a mechanism similar to ECMP is used -to distribute traffic evenly among all instances. This mechanism would require -Gateways to participate in routing protocols with the physical network to -attract traffic and alert of failures. It is out of scope of this document, -but may eventually be necessary. - -L2HA ----- - -L2HA is very difficult to get right. Unlike L3HA, where the consequences of -problems are minor, in L2HA if two gateways are both transiently active, an L2 -loop triggers and a broadcast storm results. In practice to get around this, -gateways end up implementing an overly conservative "when in doubt drop all -traffic" policy, or they implement something like MLAG. - -MLAG has multiple gateways work together to pretend to be a single L2 switch -with a large LACP bond. In principle, it's the right solution to the problem -as it solves the broadcast storm problem, and has been deployed successfully in -other contexts. That said, it's difficult to get right and not recommended. diff --git a/ovn/automake.mk b/ovn/automake.mk index 7465f8ed2..1257ef49d 100644 --- a/ovn/automake.mk +++ b/ovn/automake.mk @@ -71,8 +71,7 @@ EXTRA_DIST += ovn/ovn-architecture.7.xml DISTCLEANFILES += ovn/ovn-architecture.7 EXTRA_DIST += \ - ovn/TODO.rst \ - ovn/OVN-GW-HA.rst + ovn/TODO.rst # Version checking for ovn-nb.ovsschema. ALL_LOCAL += ovn/ovn-nb.ovsschema.stamp diff --git a/ovn/controller/pinctrl.c b/ovn/controller/pinctrl.c index db9e44161..673d65cb3 100644 --- a/ovn/controller/pinctrl.c +++ b/ovn/controller/pinctrl.c @@ -731,8 +731,7 @@ pinctrl_recv(const struct ofp_header *oh, enum ofptype type) if (type == OFPTYPE_ECHO_REQUEST) { queue_msg(make_echo_reply(oh)); } else if (type == OFPTYPE_GET_CONFIG_REPLY) { - /* Enable asynchronous messages (see "Asynchronous Messages" in - * DESIGN.rst for more information). */ + /* Enable asynchronous messages */ struct ofputil_switch_config config; ofputil_decode_get_config_reply(oh, &config); diff --git a/ovn/ovn-architecture.7.xml b/ovn/ovn-architecture.7.xml index 95cba984d..d96e4b141 100644 --- a/ovn/ovn-architecture.7.xml +++ b/ovn/ovn-architecture.7.xml @@ -341,8 +341,8 @@ controller (over a Unix domain socket) instead of a remote controller. It's possible, however, for some other bridge in the same system to have an in-band remote controller, and in that case this suppresses the flows - that in-band control would ordinarily set up. See <code>In-Band - Control</code> in <code>DESIGN.rst</code> for more information. + that in-band control would ordinarily set up. Refer to the documentation + for more information. </dd> </dl> |