summaryrefslogtreecommitdiff
path: root/doc/source/admin/config-ovs-offload.rst
blob: ad48c921a1468908eda01579bb1db7254785386c (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
.. _config-ovs-offload:

================================
Open vSwitch hardware offloading
================================

The purpose of this page is to describe how to enable Open vSwitch hardware
offloading functionality available in OpenStack (using OpenStack Networking).
This functionality was first introduced in the OpenStack Pike release.
This page intends to serve as a guide for how to configure OpenStack Networking
and OpenStack Compute to enable Open vSwitch hardware offloading.

The basics
~~~~~~~~~~

Open vSwitch is a production quality, multilayer virtual switch licensed under
the open source Apache 2.0 license.  It is designed to enable massive network
automation through programmatic extension, while still supporting standard
management interfaces and protocols. Open vSwitch (OVS) allows Virtual Machines
(VM) to communicate with each other and with the outside world.
The OVS software based solution is CPU intensive, affecting system performance
and preventing fully utilizing available bandwidth.

.. list-table::
   :header-rows: 1
   :widths: 30 90

   * - Term
     - Definition
   * - PF
     - Physical Function. The physical Ethernet controller that supports
       SR-IOV.
   * - VF
     - Virtual Function. The virtual PCIe device created from a physical
       Ethernet controller.
   * - Representor Port
     - Virtual network interface similar to SR-IOV port that represents
       Nova instance.
   * - First Compute Node
     - OpenStack Compute Node that can host Compute instances (Virtual Machines).
   * - Second Compute Node
     - OpenStack Compute Node that can host Compute instances (Virtual Machines).


Supported Ethernet controllers
------------------------------

The following manufacturers are known to work:

- Mellanox ConnectX-4 NIC (VLAN Offload)
- Mellanox ConnectX-4 Lx/ConnectX-5 NICs (VLAN/VXLAN Offload)
- Broadcom NetXtreme-S series NICs
- Broadcom NetXtreme-E series NICs

For information on **Mellanox Ethernet Cards**, see
`Mellanox: Ethernet Cards - Overview
<http://www.mellanox.com/page/ethernet_cards_overview>`_.

Prerequisites
-------------

- Linux Kernel >= 4.13
- Open vSwitch >= 2.8
- iproute >= 4.12
- Mellanox or Broadcom NIC

    .. note:: Mellanox NIC FW that supports Open vSwitch hardware offloading:

       ConnectX-5    >= 16.21.0338

       ConnectX-4    >= 12.18.2000

       ConnectX-4 Lx >= 14.21.0338

Using Open vSwitch hardware offloading
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

In order to enable Open vSwitch hardware offloading, the following steps are required:

#. Enable SR-IOV
#. Configure NIC to switchdev mode (relevant Nodes)
#. Enable Open vSwitch hardware offloading

.. note::

   Throughout this guide, ``enp3s0f0`` is used as the PF and ``eth3`` is used
   as the representor port. These ports may vary in different environments.

.. note::

   Throughout this guide, we use ``systemctl`` to restart OpenStack services.
   This is correct for ``systemd`` OS. Other methods to restart services should be
   used in other environments.

Create Compute virtual functions
----------------------------------

Create the VFs for the network interface that will be used for SR-IOV. We use
``enp3s0f0`` as PF, which is also used as the interface for the VLAN provider
network and has access to the private networks of all nodes.

.. note::

   The following steps detail how to create VFs using Mellanox ConnectX-4 and
   SR-IOV Ethernet cards on an Intel system. Steps may be different for the
   hardware of your choice.

#. Ensure SR-IOV and VT-d are enabled on the system.
   Enable IOMMU in Linux by adding ``intel_iommu=on`` to kernel parameters,
   for example, using GRUB.

#. On each Compute node, create the VFs:

   .. code-block:: bash

      # echo '4' > /sys/class/net/enp3s0f0/device/sriov_numvfs

   .. note::

      A network interface can be used both for PCI passthrough, using the PF,
      and SR-IOV, using the VFs. If the PF is used, the VF number stored in
      the ``sriov_numvfs`` file is lost. If the PF is attached again to the
      operating system, the number of VFs assigned to this interface will be
      zero. To keep the number of VFs always assigned to this interface,
      update a relevant file according to your OS.
      See some examples below:

      In Ubuntu, modifying the ``/etc/network/interfaces`` file:

      .. code-block:: ini

         auto enp3s0f0
         iface enp3s0f0 inet dhcp
         pre-up echo '4' > /sys/class/net/enp3s0f0/device/sriov_numvfs


      In Red Hat, modifying the ``/sbin/ifup-local`` file:

      .. code-block:: bash

         #!/bin/sh
         if [[ "$1" == "enp3s0f0" ]]
         then
             echo '4' > /sys/class/net/enp3s0f0/device/sriov_numvfs
         fi


   .. warning::

      Alternatively, you can create VFs by passing the ``max_vfs`` to the
      kernel module of your network interface. However, the ``max_vfs``
      parameter has been deprecated, so the PCI /sys interface is the preferred
      method.

   You can determine the maximum number of VFs a PF can support:

   .. code-block:: bash

      # cat /sys/class/net/enp3s0f0/device/sriov_totalvfs
      8

#. Verify that the VFs have been created and are in ``up`` state:

   .. note::

      The PCI bus number of the PF (03:00.0) and VFs (03:00.2 .. 03:00.5)
      will be used later.

   .. code-block::bash

      # lspci | grep Ethernet
      03:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]
      03:00.1 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]
      03:00.2 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function]
      03:00.3 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function]
      03:00.4 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function]
      03:00.5 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function]


   .. code-block:: bash

      # ip link show enp3s0f0
      8: enp3s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT qlen 1000
         link/ether a0:36:9f:8f:3f:b8 brd ff:ff:ff:ff:ff:ff
         vf 0 MAC 00:00:00:00:00:00, spoof checking on, link-state auto
         vf 1 MAC 00:00:00:00:00:00, spoof checking on, link-state auto
         vf 2 MAC 00:00:00:00:00:00, spoof checking on, link-state auto
         vf 3 MAC 00:00:00:00:00:00, spoof checking on, link-state auto

   If the interfaces are down, set them to ``up`` before launching a guest,
   otherwise the instance will fail to spawn:

   .. code-block:: bash

      # ip link set enp3s0f0 up


Configure Open vSwitch hardware offloading
------------------------------------------

#. Change the e-switch mode from legacy to switchdev on the PF device.
   This will also create the VF representor network devices in the host OS.

   .. code-block:: bash

      # echo 0000:03:00.2 > /sys/bus/pci/drivers/mlx5_core/unbind

   This tells the driver to unbind VF 03:00.2

   .. note::

     This should be done for all relevant VFs
     (in this example 0000:03:00.2 .. 0000:03:00.5)

#. Enable Open vSwitch hardware offloading,
   set PF to switchdev mode and bind VFs back.

   .. code-block:: bash

     # sudo devlink dev eswitch set pci/0000:03:00.0 mode switchdev
     # sudo ethtool -K enp3s0f0 hw-tc-offload on
     # echo 0000:03:00.2 > /sys/bus/pci/drivers/mlx5_core/bind

   .. note::

     This should be done for all relevant VFs
     (in this example 0000:03:00.2 .. 0000:03:00.5)

#. Restart Open vSwitch

   .. code-block:: bash

      # sudo systemctl enable openvswitch.service
      # sudo ovs-vsctl set Open_vSwitch . other_config:hw-offload=true
      # sudo systemctl restart openvswitch.service

   .. note::

      The given aging of OVS is given in milliseconds and can be controlled with:

   .. code-block:: bash

      # ovs-vsctl set Open_vSwitch . other_config:max-idle=30000


Configure Nodes (VLAN Configuration)
-------------------------------------

#. Update ``/etc/neutron/plugins/ml2/ml2_conf.ini`` on Controller nodes

   .. code-block:: ini

      [ml2]
      tenant_network_types = vlan
      type_drivers = vlan
      mechanism_drivers = openvswitch

   .. end

#. Update ``/etc/neutron/neutron.conf`` on Controller nodes

   .. code-block:: ini

      [DEFAULT]
      core_plugin = ml2

   .. end

#. Update ``/etc/nova/nova.conf`` on Controller nodes

   .. code-block:: ini

      [filter_scheduler]
      enabled_filters = PciPassthroughFilter

   .. end

#. Update ``/etc/nova/nova.conf`` on Compute nodes

   .. code-block:: ini

      [pci]
      #VLAN Configuration passthrough_whitelist example
      passthrough_whitelist ={"'"address"'":"'"*:'"03:00"'.*"'","'"physical_network"'":"'"physnet2"'"}

   .. end


Configure Nodes (VXLAN Configuration)
-------------------------------------


#. Update ``/etc/neutron/plugins/ml2/ml2_conf.ini`` on Controller nodes

   .. code-block:: ini

      [ml2]
      tenant_network_types = vxlan
      type_drivers = vxlan
      mechanism_drivers = openvswitch

   .. end

#. Update ``/etc/neutron/neutron.conf`` on Controller nodes

   .. code-block:: ini

      [DEFAULT]
      core_plugin = ml2

   .. end

#. Update ``/etc/nova/nova.conf`` on Controller nodes

   .. code-block:: ini

      [filter_scheduler]
      enabled_filters = PciPassthroughFilter

   .. end

#. Update ``/etc/nova/nova.conf`` on Compute nodes

   .. note::

      VXLAN configuration requires physical_network to be null.

   .. code-block:: ini

      [pci]
      #VLAN Configuration passthrough_whitelist example
      passthrough_whitelist ={"'"address"'":"'"*:'"03:00"'.*"'","'"physical_network"'":null}

   .. end

#. Restart nova and neutron services

   .. code-block:: bash

     # sudo systemctl restart openstack-nova-compute.service
     # sudo systemctl restart openstack-nova-scheduler.service
     # sudo systemctl restart neutron-server.service


Validate Open vSwitch hardware offloading
-----------------------------------------

   .. note::

     In this example we will bring up two instances on different Compute nodes and
     send ICMP echo packets between them. Then we will check TCP packets on
     a representor port and we will see that only the first packet will be
     shown there. All the rest will be offloaded.

#. Create a port ``direct`` on ``private`` network

   .. code-block:: bash

      # openstack port create --network private --vnic-type=direct --binding-profile '{"capabilities": ["switchdev"]}' direct_port1


#. Create an instance using the direct port on 'First Compute Node'

   .. code-block:: bash

      # openstack server create --flavor m1.small --image cloud_image --nic port-id=direct_port1 vm1


#. Repeat steps above and create a second instance on 'Second Compute Node'

   .. code-block:: bash

      # openstack port create --network private --vnic-type=direct --binding-profile '{"capabilities": ["switchdev"]}' direct_port2
      # openstack server create --flavor m1.small --image mellanox_fedora --nic port-id=direct_port2 vm2

   .. note::

      You can use  --availability-zone nova:compute_node_1 option
      to set the desired Compute Node


#. Connect to instance1 and send ICMP Echo Request packets to instance2

   .. code-block:: bash

      # vncviewer localhost:5900
      vm_1# ping vm2

#. Connect to 'Second Compute Node' and find representor port of the instance

   .. note::

      Find a representor port first, in our case it's eth3

   .. code-block:: console

      compute_node2# ip link show enp3s0f0
      6: enp3s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master ovs-system state UP mode DEFAULT group default qlen 1000
         link/ether ec:0d:9a:46:9e:84 brd ff:ff:ff:ff:ff:ff
         vf 0 MAC 00:00:00:00:00:00, spoof checking off, link-state enable, trust off, query_rss off
         vf 1 MAC 00:00:00:00:00:00, spoof checking off, link-state enable, trust off, query_rss off
         vf 2 MAC 00:00:00:00:00:00, spoof checking off, link-state enable, trust off, query_rss off
         vf 3 MAC fa:16:3e:b9:b8:ce, vlan 57, spoof checking on, link-state enable, trust off, query_rss off

      compute_node2# ls -l /sys/class/net/
      lrwxrwxrwx 1 root root 0 Sep 11 10:54 eth0 -> ../../devices/virtual/net/eth0
      lrwxrwxrwx 1 root root 0 Sep 11 10:54 eth1 -> ../../devices/virtual/net/eth1
      lrwxrwxrwx 1 root root 0 Sep 11 10:54 eth2 -> ../../devices/virtual/net/eth2
      lrwxrwxrwx 1 root root 0 Sep 11 10:54 eth3 -> ../../devices/virtual/net/eth3

      compute_node2# sudo ovs-dpctl show
      system@ovs-system:
        lookups: hit:1684 missed:1465 lost:0
        flows: 0
        masks: hit:8420 total:1 hit/pkt:2.67
        port 0: ovs-system (internal)
        port 1: br-enp3s0f0 (internal)
        port 2: br-int (internal)
        port 3: br-ex (internal)
        port 4: enp3s0f0
        port 5: tapfdc744bb-61 (internal)
        port 6: qr-a7b1e843-4f (internal)
        port 7: qg-79a77e6d-8f (internal)
        port 8: qr-f55e4c5f-f3 (internal)
        port 9: eth3

   .. end

#. Check traffic on the representor port. Verify that only the first ICMP packet appears.

   .. code-block:: console

      compute_node2# tcpdump -nnn -i eth3

      tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
      listening on eth3, link-type EN10MB (Ethernet), capture size 262144 bytes
      17:12:41.220447 ARP, Request who-has 172.0.0.10 tell 172.0.0.13, length 46
      17:12:41.220684 ARP, Reply 172.0.0.10 is-at fa:16:3e:f2:8b:23, length 42
      17:12:41.260487 IP 172.0.0.13 > 172.0.0.10: ICMP echo request, id 1263, seq 1, length 64
      17:12:41.260778 IP 172.0.0.10 > 172.0.0.13: ICMP echo reply, id 1263, seq 1, length 64
      17:12:46.268951 ARP, Request who-has 172.0.0.13 tell 172.0.0.10, length 42
      17:12:46.271771 ARP, Reply 172.0.0.13 is-at fa:16:3e:1a:10:05, length 46
      17:12:55.354737 IP6 fe80::f816:3eff:fe29:8118 > ff02::1: ICMP6, router advertisement, length 64
      17:12:56.106705 IP 0.0.0.0.68 > 255.255.255.255.67: BOOTP/DHCP, Request from 62:21:f0:89:40:73, length 300

   .. end