summaryrefslogtreecommitdiff
path: root/vswitchd/INTERNALS.rst
blob: 95c00f2271ac1770dc8750a8af7dadee9596b87f (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
..
      Licensed under the Apache License, Version 2.0 (the "License"); you may
      not use this file except in compliance with the License. You may obtain
      a copy of the License at

          http://www.apache.org/licenses/LICENSE-2.0

      Unless required by applicable law or agreed to in writing, software
      distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
      WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
      License for the specific language governing permissions and limitations
      under the License.

      Convention for heading levels in Open vSwitch documentation:

      =======  Heading 0 (reserved for the title in a document)
      -------  Heading 1
      ~~~~~~~  Heading 2
      +++++++  Heading 3
      '''''''  Heading 4

      Avoid deeper levels because they do not render well.

======================
ovs-vswitchd Internals
======================

This document describes some of the internals of the ovs-vswitchd process.  It
is not complete.  It tends to be updated on demand, so if you have questions
about the vswitchd implementation, ask them and perhaps we'll add some
appropriate documentation here.

Most of the ovs-vswitchd implementation is in ``vswitchd/bridge.c``, so code
references below should be assumed to refer to that file except as otherwise
specified.

Bonding
-------

Bonding allows two or more interfaces (the "slaves") to share network traffic.
From a high-level point of view, bonded interfaces act like a single port, but
they have the bandwidth of multiple network devices, e.g. two 1 GB physical
interfaces act like a single 2 GB interface.  Bonds also increase robustness:
the bonded port does not go down as long as at least one of its slaves is up.

In vswitchd, a bond always has at least two slaves (and may have more).  If a
configuration error, etc. would cause a bond to have only one slave, the port
becomes an ordinary port, not a bonded port, and none of the special features
of bonded ports described in this section apply.

There are many forms of bonding of which ovs-vswitchd implements only a few.
The most complex bond ovs-vswitchd implements is called "source load balancing"
or SLB bonding.  SLB bonding divides traffic among the slaves based on the
Ethernet source address.  This is useful only if the traffic over the bond has
multiple Ethernet source addresses, for example if network traffic from
multiple VMs are multiplexed over the bond.

Enabling and Disabling Slaves
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

When a bond is created, a slave is initially enabled or disabled based on
whether carrier is detected on the NIC (see ``iface_create()``).  After that, a
slave is disabled if its carrier goes down for a period of time longer than the
downdelay, and it is enabled if carrier comes up for longer than the updelay
(see ``bond_link_status_update()``).  There is one exception where the updelay
is skipped: if no slaves at all are currently enabled, then the first slave on
which carrier comes up is enabled immediately.

The updelay should be set to a time longer than the STP forwarding delay of the
physical switch to which the bond port is connected (if STP is enabled on that
switch).  Otherwise, the slave will be enabled, and load may be shifted to it,
before the physical switch starts forwarding packets on that port, which can
cause some data to be "blackholed" for a time.  The exception for a single
enabled slave does not cause any problem in this regard because when no slaves
are enabled all output packets are blackholed anyway.

When a slave becomes disabled, the vswitch immediately chooses a new output
port for traffic that was destined for that slave (see
``bond_enable_slave()``).  It also sends a "gratuitous learning packet",
specifically a RARP, on the bond port (on the newly chosen slave) for each MAC
address that the vswitch has learned on a port other than the bond (see
``bond_send_learning_packets()``), to teach the physical switch that the new
slave should be used in place of the one that is now disabled.  (This behavior
probably makes sense only for a vswitch that has only one port (the bond)
connected to a physical switch; vswitchd should probably provide a way to
disable or configure it in other scenarios.)

Bond Packet Input
~~~~~~~~~~~~~~~~~

Bonding accepts unicast packets on any bond slave.  This can occasionally cause
packet duplication for the first few packets sent to a given MAC, if the
physical switch attached to the bond is flooding packets to that MAC because it
has not yet learned the correct slave for that MAC.

Bonding only accepts multicast (and broadcast) packets on a single bond slave
(the "active slave") at any given time.  Multicast packets received on other
slaves are dropped.  Otherwise, every multicast packet would be duplicated,
once for every bond slave, because the physical switch attached to the bond
will flood those packets.

Bonding also drops received packets when the vswitch has learned that the
packet's MAC is on a port other than the bond port itself.  This is because it
is likely that the vswitch itself sent the packet out the bond port on a
different slave and is now receiving the packet back.  This occurs when the
packet is multicast or the physical switch has not yet learned the MAC and is
flooding it.  However, the vswitch makes an exception to this rule for
broadcast ARP replies, which indicate that the MAC has moved to another switch,
probably due to VM migration.  (ARP replies are normally unicast, so this
exception does not match normal ARP replies.  It will match the learning
packets sent on bond fail-over.)

The active slave is simply the first slave to be enabled after the bond is
created (see ``bond_choose_active_iface()``).  If the active slave is disabled,
then a new active slave is chosen among the slaves that remain active.
Currently due to the way that configuration works, this tends to be the
remaining slave whose interface name is first alphabetically, but this is by no
means guaranteed.

Bond Packet Output
~~~~~~~~~~~~~~~~~~

When a packet is sent out a bond port, the bond slave actually used is selected
based on the packet's source MAC and VLAN tag (see ``choose_output_iface()``).
In particular, the source MAC and VLAN tag are hashed into one of 256 values,
and that value is looked up in a hash table (the "bond hash") kept in the
``bond_hash`` member of struct port.  The hash table entry identifies a bond
slave.  If no bond slave has yet been chosen for that hash table entry,
vswitchd chooses one arbitrarily.

Every 10 seconds, vswitchd rebalances the bond slaves (see
``bond_rebalance_port()``).  To rebalance, vswitchd examines the statistics for
the number of bytes transmitted by each slave over approximately the past
minute, with data sent more recently weighted more heavily than data sent less
recently.  It considers each of the slaves in order from most-loaded to
least-loaded.  If highly loaded slave H is significantly more heavily loaded
than the least-loaded slave L, and slave H carries at least two hashes, then
vswitchd shifts one of H's hashes to L.  However, vswitchd will only shift a
hash from H to L if it will decrease the ratio of the load between H and L by
at least 0.1.

Currently, "significantly more loaded" means that H must carry at least 1 Mbps
more traffic, and that traffic must be at least 3% greater than L's.

Bond Balance Modes
~~~~~~~~~~~~~~~~~~

Each bond balancing mode has different considerations, described below.

LACP Bonding
++++++++++++

LACP bonding requires the remote switch to implement LACP, but it is otherwise
very simple in that, after LACP negotiation is complete, there is no need for
special handling of received packets.

Several of the physical switches that support LACP block all traffic for ports
that are configured to use LACP, until LACP is negotiated with the host. When
configuring a LACP bond on a OVS host (eg: XenServer), this means that there
will be an interruption of the network connectivity between the time the ports
on the physical switch and the bond on the OVS host are configured. The
interruption may be relatively long, if different people are responsible for
managing the switches and the OVS host.

Such network connectivity failure can be avoided if LACP can be configured on
the OVS host before configuring the physical switch, and having the OVS host
fall back to a bond mode (active-backup) till the physical switch LACP
configuration is complete. An option "lacp-fallback-ab" exists to provide such
behavior on openvswitch.

Active Backup Bonding
+++++++++++++++++++++

Active Backup bonds send all traffic out one "active" slave until that slave
becomes unavailable.  Since they are significantly less complicated than SLB
bonds, they are preferred when LACP is not an option.  Additionally, they are
the only bond mode which supports attaching each slave to a different upstream
switch.

SLB Bonding
+++++++++++

SLB bonding allows a limited form of load balancing without the remote switch's
knowledge or cooperation.  The basics of SLB are simple.  SLB assigns each
source MAC+VLAN pair to a link and transmits all packets from that MAC+VLAN
through that link.  Learning in the remote switch causes it to send packets to
that MAC+VLAN through the same link.

SLB bonding has the following complications:

0. When the remote switch has not learned the MAC for the destination of a
   unicast packet and hence floods the packet to all of the links on the SLB
   bond, Open vSwitch will forward duplicate packets, one per link, to each
   other switch port.

   Open vSwitch does not solve this problem.

1. When the remote switch receives a multicast or broadcast packet from a port
   not on the SLB bond, it will forward it to all of the links in the SLB bond.
   This would cause packet duplication if not handled specially.

   Open vSwitch avoids packet duplication by accepting multicast and broadcast
   packets on only the active slave, and dropping multicast and broadcast
   packets on all other slaves.

2. When Open vSwitch forwards a multicast or broadcast packet to a link in the
   SLB bond other than the active slave, the remote switch will forward it to
   all of the other links in the SLB bond, including the active slave.  Without
   special handling, this would mean that Open vSwitch would forward a second
   copy of the packet to each switch port (other than the bond), including the
   port that originated the packet.

   Open vSwitch deals with this case by dropping packets received on any SLB
   bonded link that have a source MAC+VLAN that has been learned on any other
   port.  (This means that SLB as implemented in Open vSwitch relies critically
   on MAC learning.  Notably, SLB is incompatible with the "flood_vlans"
   feature.)

3. Suppose that a MAC+VLAN moves to an SLB bond from another port (e.g. when a
   VM is migrated from this hypervisor to a different one).  Without additional
   special handling, Open vSwitch will not notice until the MAC learning entry
   expires, up to 60 seconds later as a consequence of rule #2.

   Open vSwitch avoids a 60-second delay by listening for gratuitous ARPs,
   which VMs commonly emit upon migration.  As an exception to rule #2, a
   gratuitous ARP received on an SLB bond is not dropped and updates the MAC
   learning table in the usual way.  (If a move does not trigger a gratuitous
   ARP, or if the gratuitous ARP is lost in the network, then a 60-second delay
   still occurs.)

4. Suppose that a MAC+VLAN moves from an SLB bond to another port (e.g. when a
   VM is migrated from a different hypervisor to this one), that the MAC+VLAN
   emits a gratuitous ARP, and that Open vSwitch forwards that gratuitous ARP
   to a link in the SLB bond other than the active slave.  The remote switch
   will forward the gratuitous ARP to all of the other links in the SLB bond,
   including the active slave.  Without additional special handling, this would
   mean that Open vSwitch would learn that the MAC+VLAN was located on the SLB
   bond, as a consequence of rule #3.

   Open vSwitch avoids this problem by "locking" the MAC learning table entry
   for a MAC+VLAN from which a gratuitous ARP was received from a non-SLB bond
   port.  For 5 seconds, a locked MAC learning table entry will not be updated
   based on a gratuitous ARP received on a SLB bond.