diff options
Diffstat (limited to 'Documentation/intro/install/dpdk-advanced.rst')
-rw-r--r-- | Documentation/intro/install/dpdk-advanced.rst | 938 |
1 files changed, 938 insertions, 0 deletions
diff --git a/Documentation/intro/install/dpdk-advanced.rst b/Documentation/intro/install/dpdk-advanced.rst new file mode 100644 index 000000000..44d1cd78c --- /dev/null +++ b/Documentation/intro/install/dpdk-advanced.rst @@ -0,0 +1,938 @@ +.. + Licensed under the Apache License, Version 2.0 (the "License"); you may + not use this file except in compliance with the License. You may obtain + a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, WITHOUT + WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the + License for the specific language governing permissions and limitations + under the License. + + Convention for heading levels in Open vSwitch documentation: + + ======= Heading 0 (reserved for the title in a document) + ------- Heading 1 + ~~~~~~~ Heading 2 + +++++++ Heading 3 + ''''''' Heading 4 + + Avoid deeper levels because they do not render well. + +================================= +Open vSwitch with DPDK (Advanced) +================================= + +The Advanced Install Guide explains how to improve OVS performance when using +DPDK datapath. This guide provides information on tuning, system configuration, +troubleshooting, static code analysis and testcases. + +Building as a Shared Library +---------------------------- + +DPDK can be built as a static or a shared library and shall be linked by +applications using DPDK datapath. When building OVS with DPDK, you can link +Open vSwitch against the shared DPDK library. + +.. note:: + Minor performance loss is seen with OVS when using shared DPDK library as + compared to static library. + +To build Open vSwitch using DPDK as a shared library, first refer to +:doc:`/intro/install/dpdk` for download instructions for DPDK and OVS. + +Once DPDK and OVS have been downloaded, you must configure the DPDK library +accordingly. Simply set ``CONFIG_RTE_BUILD_SHARED_LIB=y`` in +``config/common_base``, then build and install DPDK. Once done, DPDK can be +built as usual. For example:: + + $ export DPDK_TARGET=x86_64-native-linuxapp-gcc + $ export DPDK_BUILD=$DPDK_DIR/$DPDK_TARGET + $ make install T=$DPDK_TARGET DESTDIR=install + +Once DPDK is built, export the DPDK shared library location and setup OVS as +detailed in :doc:`/intro/install/dpdk`:: + + $ export LD_LIBRARY_PATH=$DPDK_DIR/x86_64-native-linuxapp-gcc/lib + +System Configuration +-------------------- + +To achieve optimal OVS performance, the system can be configured and that +includes BIOS tweaks, Grub cmdline additions, better understanding of NUMA +nodes and apt selection of PCIe slots for NIC placement. + +Recommended BIOS Settings +~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. list-table:: Recommended BIOS Settings + :header-rows: 1 + + * - Setting + - Value + * - C3 Power State + - Disabled + * - C6 Power State + - Disabled + * - MLC Streamer + - Enabled + * - MLC Spacial Prefetcher + - Enabled + * - DCU Data Prefetcher + - Enabled + * - DCA + - Enabled + * - CPU Power and Performance + - Performance + * - Memeory RAS and Performance Config -> NUMA optimized + - Enabled + +PCIe Slot Selection +~~~~~~~~~~~~~~~~~~~ + +The fastpath performance can be affected by factors related to the placement of +the NIC, such as channel speeds between PCIe slot and CPU or the proximity of +PCIe slot to the CPU cores running the DPDK application. Listed below are the +steps to identify right PCIe slot. + +#. Retrieve host details using ``dmidecode``. For example:: + + $ dmidecode -t baseboard | grep "Product Name" + +#. Download the technical specification for product listed, e.g: S2600WT2 + +#. Check the Product Architecture Overview on the Riser slot placement, CPU + sharing info and also PCIe channel speeds + + For example: On S2600WT, CPU1 and CPU2 share Riser Slot 1 with Channel speed + between CPU1 and Riser Slot1 at 32GB/s, CPU2 and Riser Slot1 at 16GB/s. + Running DPDK app on CPU1 cores and NIC inserted in to Riser card Slots will + optimize OVS performance in this case. + +#. Check the Riser Card #1 - Root Port mapping information, on the available + slots and individual bus speeds. In S2600WT slot 1, slot 2 has high bus + speeds and are potential slots for NIC placement. + +Advanced Hugepage Setup +~~~~~~~~~~~~~~~~~~~~~~~ + +Allocate and mount 1 GB hugepages. + +- For persistent allocation of huge pages, add the following options to the + kernel bootline:: + + default_hugepagesz=1GB hugepagesz=1G hugepages=N + + For platforms supporting multiple huge page sizes, add multiple options:: + + default_hugepagesz=<size> hugepagesz=<size> hugepages=N + + where: + + ``N`` + number of huge pages requested + ``size`` + huge page size with an optional suffix ``[kKmMgG]`` + +- For run-time allocation of huge pages:: + + $ echo N > /sys/devices/system/node/nodeX/hugepages/hugepages-1048576kB/nr_hugepages + + where: + + ``N`` + number of huge pages requested + ``X`` + NUMA Node + + .. note:: + For run-time allocation of 1G huge pages, Contiguous Memory Allocator + (``CONFIG_CMA``) has to be supported by kernel, check your Linux distro. + +Now mount the huge pages, if not already done so:: + + $ mount -t hugetlbfs -o pagesize=1G none /dev/hugepages + +Enable HyperThreading +~~~~~~~~~~~~~~~~~~~~~ + +With HyperThreading, or SMT, enabled, a physical core appears as two logical +cores. SMT can be utilized to spawn worker threads on logical cores of the same +physical core there by saving additional cores. + +With DPDK, when pinning pmd threads to logical cores, care must be taken to set +the correct bits of the ``pmd-cpu-mask`` to ensure that the pmd threads are +pinned to SMT siblings. + +Take a sample system configuration, with 2 sockets, 2 * 10 core processors, HT +enabled. This gives us a total of 40 logical cores. To identify the physical +core shared by two logical cores, run:: + + $ cat /sys/devices/system/cpu/cpuN/topology/thread_siblings_list + +where ``N`` is the logical core number. + +In this example, it would show that cores ``1`` and ``21`` share the same +physical core. As cores are counted from 0, the ``pmd-cpu-mask`` can be used +to enable these two pmd threads running on these two logical cores (one +physical core) is:: + + $ ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x200002 + +Isolate Cores +~~~~~~~~~~~~~ + +The ``isolcpus`` option can be used to isolate cores from the Linux scheduler. +The isolated cores can then be used to dedicatedly run HPC applications or +threads. This helps in better application performance due to zero context +switching and minimal cache thrashing. To run platform logic on core 0 and +isolate cores between 1 and 19 from scheduler, add ``isolcpus=1-19`` to GRUB +cmdline. + +.. note:: + It has been verified that core isolation has minimal advantage due to mature + Linux scheduler in some circumstances. + +NUMA/Cluster-on-Die +~~~~~~~~~~~~~~~~~~~ + +Ideally inter-NUMA datapaths should be avoided where possible as packets will +go across QPI and there may be a slight performance penalty when compared with +intra NUMA datapaths. On Intel Xeon Processor E5 v3, Cluster On Die is +introduced on models that have 10 cores or more. This makes it possible to +logically split a socket into two NUMA regions and again it is preferred where +possible to keep critical datapaths within the one cluster. + +It is good practice to ensure that threads that are in the datapath are pinned +to cores in the same NUMA area. e.g. pmd threads and QEMU vCPUs responsible for +forwarding. If DPDK is built with ``CONFIG_RTE_LIBRTE_VHOST_NUMA=y``, vHost +User ports automatically detect the NUMA socket of the QEMU vCPUs and will be +serviced by a PMD from the same node provided a core on this node is enabled in +the ``pmd-cpu-mask``. ``libnuma`` packages are required for this feature. + +Compiler Optimizations +~~~~~~~~~~~~~~~~~~~~~~ + +The default compiler optimization level is ``-O2``. Changing this to more +aggressive compiler optimization such as ``-O3 -march=native`` with +gcc (verified on 5.3.1) can produce performance gains though not siginificant. +``-march=native`` will produce optimized code on local machine and should be +used when software compilation is done on Testbed. + +Performance Tuning +------------------ + +Affinity +~~~~~~~~ + +For superior performance, DPDK pmd threads and Qemu vCPU threads needs to be +affinitized accordingly. + +- PMD thread Affinity + + A poll mode driver (pmd) thread handles the I/O of all DPDK interfaces + assigned to it. A pmd thread shall poll the ports for incoming packets, + switch the packets and send to tx port. pmd thread is CPU bound, and needs + to be affinitized to isolated cores for optimum performance. + + By setting a bit in the mask, a pmd thread is created and pinned to the + corresponding CPU core. e.g. to run a pmd thread on core 2:: + + $ ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x4 + + .. note:: + pmd thread on a NUMA node is only created if there is at least one DPDK + interface from that NUMA node added to OVS. + +- QEMU vCPU thread Affinity + + A VM performing simple packet forwarding or running complex packet pipelines + has to ensure that the vCPU threads performing the work has as much CPU + occupancy as possible. + + For example, on a multicore VM, multiple QEMU vCPU threads shall be spawned. + When the DPDK ``testpmd`` application that does packet forwarding is invoked, + the ``taskset`` command should be used to affinitize the vCPU threads to the + dedicated isolated cores on the host system. + +Multiple Poll-Mode Driver Threads +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +With pmd multi-threading support, OVS creates one pmd thread for each NUMA node +by default. However, in cases where there are multiple ports/rxq's producing +traffic, performance can be improved by creating multiple pmd threads running +on separate cores. These pmd threads can share the workload by each being +responsible for different ports/rxq's. Assignment of ports/rxq's to pmd threads +is done automatically. + +A set bit in the mask means a pmd thread is created and pinned to the +corresponding CPU core. For example, to run pmd threads on core 1 and 2:: + + $ ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x6 + +When using dpdk and dpdkvhostuser ports in a bi-directional VM loopback as +shown below, spreading the workload over 2 or 4 pmd threads shows significant +improvements as there will be more total CPU occupancy available:: + + NIC port0 <-> OVS <-> VM <-> OVS <-> NIC port 1 + +DPDK Physical Port Rx Queues +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +:: + + $ ovs-vsctl set Interface <DPDK interface> options:n_rxq=<integer> + +The command above sets the number of rx queues for DPDK physical interface. +The rx queues are assigned to pmd threads on the same NUMA node in a +round-robin fashion. + +DPDK Physical Port Queue Sizes +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +:: + + $ ovs-vsctl set Interface dpdk0 options:n_rxq_desc=<integer> + $ ovs-vsctl set Interface dpdk0 options:n_txq_desc=<integer> + +The command above sets the number of rx/tx descriptors that the NIC associated +with dpdk0 will be initialised with. + +Different ``n_rxq_desc`` and ``n_txq_desc`` configurations yield different +benefits in terms of throughput and latency for different scenarios. +Generally, smaller queue sizes can have a positive impact for latency at the +expense of throughput. The opposite is often true for larger queue sizes. +Note: increasing the number of rx descriptors eg. to 4096 may have a negative +impact on performance due to the fact that non-vectorised DPDK rx functions may +be used. This is dependant on the driver in use, but is true for the commonly +used i40e and ixgbe DPDK drivers. + +Exact Match Cache +~~~~~~~~~~~~~~~~~ + +Each pmd thread contains one Exact Match Cache (EMC). After initial flow setup +in the datapath, the EMC contains a single table and provides the lowest level +(fastest) switching for DPDK ports. If there is a miss in the EMC then the next +level where switching will occur is the datapath classifier. Missing in the +EMC and looking up in the datapath classifier incurs a significant performance +penalty. If lookup misses occur in the EMC because it is too small to handle +the number of flows, its size can be increased. The EMC size can be modified by +editing the define ``EM_FLOW_HASH_SHIFT`` in ``lib/dpif-netdev.c``. + +As mentioned above, an EMC is per pmd thread. An alternative way of increasing +the aggregate amount of possible flow entries in EMC and avoiding datapath +classifier lookups is to have multiple pmd threads running. + +Rx Mergeable Buffers +~~~~~~~~~~~~~~~~~~~~ + +Rx mergeable buffers is a virtio feature that allows chaining of multiple +virtio descriptors to handle large packet sizes. Large packets are handled by +reserving and chaining multiple free descriptors together. Mergeable buffer +support is negotiated between the virtio driver and virtio device and is +supported by the DPDK vhost library. This behavior is supported and enabled by +default, however in the case where the user knows that rx mergeable buffers are +not needed i.e. jumbo frames are not needed, it can be forced off by adding +``mrg_rxbuf=off`` to the QEMU command line options. By not reserving multiple +chains of descriptors it will make more individual virtio descriptors available +for rx to the guest using dpdkvhost ports and this can improve performance. + +OVS Testcases +------------- + +PHY-VM-PHY (vHost Loopback) +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +:doc:`/intro/install/dpdk` details steps for PHY-VM-PHY loopback testcase and +packet forwarding using DPDK testpmd application in the Guest VM. For users +wishing to do packet forwarding using kernel stack below, you need to run the +below commands on the guest:: + + $ ifconfig eth1 1.1.1.2/24 + $ ifconfig eth2 1.1.2.2/24 + $ systemctl stop firewalld.service + $ systemctl stop iptables.service + $ sysctl -w net.ipv4.ip_forward=1 + $ sysctl -w net.ipv4.conf.all.rp_filter=0 + $ sysctl -w net.ipv4.conf.eth1.rp_filter=0 + $ sysctl -w net.ipv4.conf.eth2.rp_filter=0 + $ route add -net 1.1.2.0/24 eth2 + $ route add -net 1.1.1.0/24 eth1 + $ arp -s 1.1.2.99 DE:AD:BE:EF:CA:FE + $ arp -s 1.1.1.99 DE:AD:BE:EF:CA:EE + +PHY-VM-PHY (IVSHMEM) +~~~~~~~~~~~~~~~~~~~~ + +IVSHMEM can also be validated using the PHY-VM-PHY configuration. To begin, +follow the steps described in the :doc:`/intro/install/dpdk` to create and +initialize the database, start ovs-vswitchd and add ``dpdk``-type devices to +bridge ``br0``. Once complete, follow the below steps: + +1. Add DPDK ring port to the bridge:: + + $ ovs-vsctl add-port br0 dpdkr0 -- set Interface dpdkr0 type=dpdkr + +2. Build modified QEMU + + QEMU must be patched to enable IVSHMEM support:: + + $ cd /usr/src/ + $ wget http://wiki.qemu.org/download/qemu-2.2.1.tar.bz2 + $ tar -jxvf qemu-2.2.1.tar.bz2 + $ cd /usr/src/qemu-2.2.1 + $ wget https://raw.githubusercontent.com/netgroup-polito/un-orchestrator/master/orchestrator/compute_controller/plugins/kvm-libvirt/patches/ivshmem-qemu-2.2.1.patch + $ patch -p1 < ivshmem-qemu-2.2.1.patch + $ ./configure --target-list=x86_64-softmmu --enable-debug --extra-cflags='-g' + $ make -j 4 + +3. Generate QEMU commandline:: + + $ mkdir -p /usr/src/cmdline_generator + $ cd /usr/src/cmdline_generator + $ wget https://raw.githubusercontent.com/netgroup-polito/un-orchestrator/master/orchestrator/compute_controller/plugins/kvm-libvirt/cmdline_generator/cmdline_generator.c + $ wget https://raw.githubusercontent.com/netgroup-polito/un-orchestrator/master/orchestrator/compute_controller/plugins/kvm-libvirt/cmdline_generator/Makefile + $ export RTE_SDK=/usr/src/dpdk-16.11 + $ export RTE_TARGET=x86_64-ivshmem-linuxapp-gcc + $ make + $ ./build/cmdline_generator -m -p dpdkr0 XXX + $ cmdline=`cat OVSMEMPOOL` + +4. Start guest VM:: + + $ export VM_NAME=ivshmem-vm + $ export QCOW2_IMAGE=/root/CentOS7_x86_64.qcow2 + $ export QEMU_BIN=/usr/src/qemu-2.2.1/x86_64-softmmu/qemu-system-x86_64 + $ taskset 0x20 $QEMU_BIN -cpu host -smp 2,cores=2 -hda $QCOW2_IMAGE \ + -m 4096 --enable-kvm -name $VM_NAME -nographic -vnc :2 \ + -pidfile /tmp/vm1.pid $cmdline + +5. Build and run the sample ``dpdkr`` app in VM:: + + $ echo 1024 > /proc/sys/vm/nr_hugepages + $ mount -t hugetlbfs nodev /dev/hugepages (if not already mounted) + + # Build the DPDK ring application in the VM + $ export RTE_SDK=/root/dpdk-16.11 + $ export RTE_TARGET=x86_64-ivshmem-linuxapp-gcc + $ make + + # Run dpdkring application + $ ./build/dpdkr -c 1 -n 4 -- -n 0 + # where "-n 0" refers to ring '0' i.e dpdkr0 + +PHY-VM-PHY (vHost Multiqueue) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +vHost Multique functionality can also be validated using the PHY-VM-PHY +configuration. To begin, follow the steps described in +:doc:`/intro/install/dpdk` to create and initialize the database, start +ovs-vswitchd and add ``dpdk``-type devices to bridge ``br0``. Once complete, +follow the below steps: + +1. Configure PMD and RXQs. + + For example, set the number of dpdk port rx queues to at least 2 The number + of rx queues at vhost-user interface gets automatically configured after + virtio device connection and doesn't need manual configuration:: + + $ ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0xC + $ ovs-vsctl set Interface dpdk0 options:n_rxq=2 + $ ovs-vsctl set Interface dpdk1 options:n_rxq=2 + +2. Instantiate Guest VM using QEMU cmdline + + We must configure with appropriate software versions to ensure this feature + is supported. + + .. list-table:: Recommended BIOS Settings + :header-rows: 1 + + * - Setting + - Value + * - QEMU version + - 2.5.0 + * - QEMU thread affinity + - 2 cores (taskset 0x30) + * - Memory + - 4 GB + * - Cores + - 2 + * - Distro + - Fedora 22 + * - Multiqueue + - Enabled + + To do this, instantiate the guest as follows:: + + $ export VM_NAME=vhost-vm + $ export GUEST_MEM=4096M + $ export QCOW2_IMAGE=/root/Fedora22_x86_64.qcow2 + $ export VHOST_SOCK_DIR=/usr/local/var/run/openvswitch + $ taskset 0x30 qemu-system-x86_64 -cpu host -smp 2,cores=2 -m 4096M \ + -drive file=$QCOW2_IMAGE --enable-kvm -name $VM_NAME \ + -nographic -numa node,memdev=mem -mem-prealloc \ + -object memory-backend-file,id=mem,size=$GUEST_MEM,mem-path=/dev/hugepages,share=on \ + -chardev socket,id=char1,path=$VHOST_SOCK_DIR/dpdkvhostuser0 \ + -netdev type=vhost-user,id=mynet1,chardev=char1,vhostforce,queues=2 \ + -device virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1,mq=on,vectors=6 \ + -chardev socket,id=char2,path=$VHOST_SOCK_DIR/dpdkvhostuser1 \ + -netdev type=vhost-user,id=mynet2,chardev=char2,vhostforce,queues=2 \ + -device virtio-net-pci,mac=00:00:00:00:00:02,netdev=mynet2,mq=on,vectors=6 + + .. note:: + Queue value above should match the queues configured in OVS, The vector + value should be set to "number of queues x 2 + 2" + +3. Configure the guest interface + + Assuming there are 2 interfaces in the guest named eth0, eth1 check the + channel configuration and set the number of combined channels to 2 for + virtio devices:: + + $ ethtool -l eth0 + $ ethtool -L eth0 combined 2 + $ ethtool -L eth1 combined 2 + + More information can be found in vHost walkthrough section. + +4. Configure kernel packet forwarding + + Configure IP and enable interfaces:: + + $ ifconfig eth0 5.5.5.1/24 up + $ ifconfig eth1 90.90.90.1/24 up + + Configure IP forwarding and add route entries:: + + $ sysctl -w net.ipv4.ip_forward=1 + $ sysctl -w net.ipv4.conf.all.rp_filter=0 + $ sysctl -w net.ipv4.conf.eth0.rp_filter=0 + $ sysctl -w net.ipv4.conf.eth1.rp_filter=0 + $ ip route add 2.1.1.0/24 dev eth1 + $ route add default gw 2.1.1.2 eth1 + $ route add default gw 90.90.90.90 eth1 + $ arp -s 90.90.90.90 DE:AD:BE:EF:CA:FE + $ arp -s 2.1.1.2 DE:AD:BE:EF:CA:FA + + Check traffic on multiple queues:: + + $ cat /proc/interrupts | grep virtio + +vHost Walkthrough +----------------- + +Two types of vHost User ports are available in OVS: + +- vhost-user (``dpdkvhostuser``) + +- vhost-user-client (``dpdkvhostuserclient``) + +vHost User uses a client-server model. The server creates/manages/destroys the +vHost User sockets, and the client connects to the server. Depending on which +port type you use, ``dpdkvhostuser`` or ``dpdkvhostuserclient``, a different +configuration of the client-server model is used. + +For vhost-user ports, Open vSwitch acts as the server and QEMU the client. For +vhost-user-client ports, Open vSwitch acts as the client and QEMU the server. + +vhost-user +~~~~~~~~~~ + +1. Install the prerequisites: + + - QEMU version >= 2.2 + +2. Add vhost-user ports to the switch. + + Unlike DPDK ring ports, DPDK vhost-user ports can have arbitrary names, + except that forward and backward slashes are prohibited in the names. + + For vhost-user, the name of the port type is ``dpdkvhostuser``:: + + $ ovs-vsctl add-port br0 vhost-user-1 -- set Interface vhost-user-1 \ + type=dpdkvhostuser + + This action creates a socket located at + ``/usr/local/var/run/openvswitch/vhost-user-1``, which you must provide to + your VM on the QEMU command line. More instructions on this can be found in + the next section "Adding vhost-user ports to VM" + + .. note:: + If you wish for the vhost-user sockets to be created in a sub-directory of + ``/usr/local/var/run/openvswitch``, you may specify this directory in the + ovsdb like so:: + + $ ovs-vsctl --no-wait \ + set Open_vSwitch . other_config:vhost-sock-dir=subdir` + +3. Add vhost-user ports to VM + + 1. Configure sockets + + Pass the following parameters to QEMU to attach a vhost-user device:: + + -chardev socket,id=char1,path=/usr/local/var/run/openvswitch/vhost-user-1 + -netdev type=vhost-user,id=mynet1,chardev=char1,vhostforce + -device virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1 + + where ``vhost-user-1`` is the name of the vhost-user port added to the + switch. + + Repeat the above parameters for multiple devices, changing the chardev + ``path`` and ``id`` as necessary. Note that a separate and different + chardev ``path`` needs to be specified for each vhost-user device. For + example you have a second vhost-user port named ``vhost-user-2``, you + append your QEMU command line with an additional set of parameters:: + + -chardev socket,id=char2,path=/usr/local/var/run/openvswitch/vhost-user-2 + -netdev type=vhost-user,id=mynet2,chardev=char2,vhostforce + -device virtio-net-pci,mac=00:00:00:00:00:02,netdev=mynet2 + + 2. Configure hugepages + + QEMU must allocate the VM's memory on hugetlbfs. vhost-user ports access + a virtio-net device's virtual rings and packet buffers mapping the VM's + physical memory on hugetlbfs. To enable vhost-user ports to map the VM's + memory into their process address space, pass the following parameters + to QEMU:: + + -object memory-backend-file,id=mem,size=4096M,mem-path=/dev/hugepages,share=on + -numa node,memdev=mem -mem-prealloc + + 3. Enable multiqueue support (optional) + + QEMU needs to be configured to use multiqueue:: + + -chardev socket,id=char2,path=/usr/local/var/run/openvswitch/vhost-user-2 + -netdev type=vhost-user,id=mynet2,chardev=char2,vhostforce,queues=$q + -device virtio-net-pci,mac=00:00:00:00:00:02,netdev=mynet2,mq=on,vectors=$v + + where: + + ``$q`` + The number of queues + ``$v`` + The number of vectors, which is ``$q`` * 2 + 2 + + The vhost-user interface will be automatically reconfigured with + required number of rx and tx queues after connection of virtio device. + Manual configuration of ``n_rxq`` is not supported because OVS will work + properly only if ``n_rxq`` will match number of queues configured in + QEMU. + + A least 2 PMDs should be configured for the vswitch when using + multiqueue. Using a single PMD will cause traffic to be enqueued to the + same vhost queue rather than being distributed among different vhost + queues for a vhost-user interface. + + If traffic destined for a VM configured with multiqueue arrives to the + vswitch via a physical DPDK port, then the number of rxqs should also be + set to at least 2 for that physical DPDK port. This is required to + increase the probability that a different PMD will handle the multiqueue + transmission to the guest using a different vhost queue. + + If one wishes to use multiple queues for an interface in the guest, the + driver in the guest operating system must be configured to do so. It is + recommended that the number of queues configured be equal to ``$q``. + + For example, this can be done for the Linux kernel virtio-net driver + with:: + + $ ethtool -L <DEV> combined <$q> + + where: + + ``-L`` + Changes the numbers of channels of the specified network device + ``combined`` + Changes the number of multi-purpose channels. + +Configure the VM using libvirt +++++++++++++++++++++++++++++++ + +You can also build and configure the VM using libvirt rather than QEMU by +itself. + +1. Change the user/group, access control policty and restart libvirtd. + + - In ``/etc/libvirt/qemu.conf`` add/edit the following lines:: + + user = "root" + group = "root" + + - Disable SELinux or set to permissive mode:: + + $ setenforce 0 + + - Restart the libvirtd process, For example, on Fedora:: + + $ systemctl restart libvirtd.service + +2. Instantiate the VM + + - Copy the XML configuration described in :doc:`/intro/install/dpdk` + + - Start the VM:: + + $ virsh create demovm.xml + + - Connect to the guest console:: + + $ virsh console demovm + +3. Configure the VM + + The demovm xml configuration is aimed at achieving out of box performance on + VM. + + - The vcpus are pinned to the cores of the CPU socket 0 using ``vcpupin``. + + - Configure NUMA cell and memory shared using ``memAccess='shared'``. + + - Disable ``mrg_rxbuf='off'`` + +Refer to the `libvirt documentation <http://libvirt.org/formatdomain.html>`__ +for more information. + +vhost-user-client +~~~~~~~~~~~~~~~~~ + +1. Install the prerequisites: + + - QEMU version >= 2.7 + +2. Add vhost-user-client ports to the switch. + + Unlike vhost-user ports, the name given to port does not govern the name of + the socket device. ``vhost-server-path`` reflects the full path of the + socket that has been or will be created by QEMU for the given vHost User + client port. + + For vhost-user-client, the name of the port type is + ``dpdkvhostuserclient``:: + + $ VHOST_USER_SOCKET_PATH=/path/to/socker + $ ovs-vsctl add-port br0 vhost-client-1 \ + -- set Interface vhost-client-1 type=dpdkvhostuserclient \ + options:vhost-server-path=$VHOST_USER_SOCKET_PATH + +3. Add vhost-user-client ports to VM + + 1. Configure sockets + + Pass the following parameters to QEMU to attach a vhost-user device:: + + -chardev socket,id=char1,path=$VHOST_USER_SOCKET_PATH,server + -netdev type=vhost-user,id=mynet1,chardev=char1,vhostforce + -device virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1 + + where ``vhost-user-1`` is the name of the vhost-user port added to the + switch. + + If the corresponding dpdkvhostuserclient port has not yet been configured + in OVS with ``vhost-server-path=/path/to/socket``, QEMU will print a log + similar to the following:: + + QEMU waiting for connection on: disconnected:unix:/path/to/socket,server + + QEMU will wait until the port is created sucessfully in OVS to boot the VM. + + One benefit of using this mode is the ability for vHost ports to + 'reconnect' in event of the switch crashing or being brought down. Once + it is brought back up, the vHost ports will reconnect automatically and + normal service will resume. + +DPDK Backend Inside VM +~~~~~~~~~~~~~~~~~~~~~~ + +Additional configuration is required if you want to run ovs-vswitchd with DPDK +backend inside a QEMU virtual machine. Ovs-vswitchd creates separate DPDK TX +queues for each CPU core available. This operation fails inside QEMU virtual +machine because, by default, VirtIO NIC provided to the guest is configured to +support only single TX queue and single RX queue. To change this behavior, you +need to turn on ``mq`` (multiqueue) property of all ``virtio-net-pci`` devices +emulated by QEMU and used by DPDK. You may do it manually (by changing QEMU +command line) or, if you use Libvirt, by adding the following string to +``<interface>`` sections of all network devices used by DPDK:: + + <driver name='vhost' queues='N'/> + +Where: + +``N`` + determines how many queues can be used by the guest. + +This requires QEMU >= 2.2. + +QoS +--- + +Assuming you have a vhost-user port transmitting traffic consisting of packets +of size 64 bytes, the following command would limit the egress transmission +rate of the port to ~1,000,000 packets per second:: + + $ ovs-vsctl set port vhost-user0 qos=@newqos -- \ + --id=@newqos create qos type=egress-policer other-config:cir=46000000 \ + other-config:cbs=2048` + +To examine the QoS configuration of the port, run:: + + $ ovs-appctl -t ovs-vswitchd qos/show vhost-user0 + +To clear the QoS configuration from the port and ovsdb, run:: + + $ ovs-vsctl destroy QoS vhost-user0 -- clear Port vhost-user0 qos + +Refer to vswitch.xml for more details on egress-policer. + +Rate Limiting +-------------- + +Here is an example on Ingress Policing usage. Assuming you have a vhost-user +port receiving traffic consisting of packets of size 64 bytes, the following +command would limit the reception rate of the port to ~1,000,000 packets per +second:: + + $ ovs-vsctl set interface vhost-user0 ingress_policing_rate=368000 \ + ingress_policing_burst=1000` + +To examine the ingress policer configuration of the port:: + + $ ovs-vsctl list interface vhost-user0 + +To clear the ingress policer configuration from the port:: + + $ ovs-vsctl set interface vhost-user0 ingress_policing_rate=0 + +Refer to vswitch.xml for more details on ingress-policer. + +Flow Control +------------ + +Flow control can be enabled only on DPDK physical ports. To enable flow +control support at tx side while adding a port, run:: + + $ ovs-vsctl add-port br0 dpdk0 -- \ + set Interface dpdk0 type=dpdk options:tx-flow-ctrl=true + +Similarly, to enable rx flow control, run:: + + $ ovs-vsctl add-port br0 dpdk0 -- \ + set Interface dpdk0 type=dpdk options:rx-flow-ctrl=true + +To enable flow control auto-negotiation, run:: + + $ ovs-vsctl add-port br0 dpdk0 -- \ + set Interface dpdk0 type=dpdk options:flow-ctrl-autoneg=true + +To turn ON the tx flow control at run time(After the port is being added to +OVS):: + + $ ovs-vsctl set Interface dpdk0 options:tx-flow-ctrl=true + +The flow control parameters can be turned off by setting ``false`` to the +respective parameter. To disable the flow control at tx side, run:: + + $ ovs-vsctl set Interface dpdk0 options:tx-flow-ctrl=false + +pdump +----- + +Pdump allows you to listen on DPDK ports and view the traffic that is passing +on them. To use this utility, one must have libpcap installed on the system. +Furthermore, DPDK must be built with ``CONFIG_RTE_LIBRTE_PDUMP=y`` and +``CONFIG_RTE_LIBRTE_PMD_PCAP=y``. + +.. warning:: + A performance decrease is expected when using a monitoring application like + the DPDK pdump app. + +To use pdump, simply launch OVS as usual. Then, navigate to the ``app/pdump`` +directory in DPDK, ``make`` the application and run like so:: + + $ sudo ./build/app/dpdk-pdump -- \ + --pdump port=0,queue=0,rx-dev=/tmp/pkts.pcap \ + --server-socket-path=/usr/local/var/run/openvswitch + +The above command captures traffic received on queue 0 of port 0 and stores it +in ``/tmp/pkts.pcap``. Other combinations of port numbers, queues numbers and +pcap locations are of course also available to use. For example, to capture all +packets that traverse port 0 in a single pcap file:: + + $ sudo ./build/app/dpdk-pdump -- \ + --pdump 'port=0,queue=*,rx-dev=/tmp/pkts.pcap,tx-dev=/tmp/pkts.pcap' \ + --server-socket-path=/usr/local/var/run/openvswitch + +``server-socket-path`` must be set to the value of ovs_rundir() which typically +resolves to ``/usr/local/var/run/openvswitch``. + +Many tools are available to view the contents of the pcap file. Once example is +tcpdump. Issue the following command to view the contents of ``pkts.pcap``:: + + $ tcpdump -r pkts.pcap + +More information on the pdump app and its usage can be found in the `DPDK docs +<http://dpdk.org/doc/guides/sample_app_ug/pdump.html>`__. + +Jumbo Frames +------------ + +By default, DPDK ports are configured with standard Ethernet MTU (1500B). To +enable Jumbo Frames support for a DPDK port, change the Interface's +``mtu_request`` attribute to a sufficiently large value. For example, to add a +DPDK Phy port with MTU of 9000:: + + $ ovs-vsctl add-port br0 dpdk0 \ + -- set Interface dpdk0 type=dpdk \ + -- set Interface dpdk0 mtu_request=9000` + +Similarly, to change the MTU of an existing port to 6200:: + + $ ovs-vsctl set Interface dpdk0 mtu_request=6200 + +Some additional configuration is needed to take advantage of jumbo frames with +vHost ports: + +1. *mergeable buffers* must be enabled for vHost ports, as demonstrated in the + QEMU command line snippet below:: + + -netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce \ + -device virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1,mrg_rxbuf=on + +2. Where virtio devices are bound to the Linux kernel driver in a guest + environment (i.e. interfaces are not bound to an in-guest DPDK driver), the + MTU of those logical network interfaces must also be increased to a + sufficiently large value. This avoids segmentation of Jumbo Frames received + in the guest. Note that 'MTU' refers to the length of the IP packet only, + and not that of the entire frame. + + To calculate the exact MTU of a standard IPv4 frame, subtract the L2 header + and CRC lengths (i.e. 18B) from the max supported frame size. So, to set + the MTU for a 9018B Jumbo Frame:: + + $ ifconfig eth1 mtu 9000 + +When Jumbo Frames are enabled, the size of a DPDK port's mbuf segments are +increased, such that a full Jumbo Frame of a specific size may be accommodated +within a single mbuf segment. + +Jumbo frame support has been validated against 9728B frames, which is the +largest frame size supported by Fortville NIC using the DPDK i40e driver, but +larger frames and other DPDK NIC drivers may be supported. These cases are +common for use cases involving East-West traffic only. + +vsperf +------ + +The vsperf project aims to develop a vSwitch test framework that can be used to +validate the suitability of different vSwitch implementations in a telco +deployment environment. More information can be found on the `OPNFV wiki +<https://wiki.opnfv.org/display/vsperf/VSperf+Home>`__. + +Bug Reporting +------------- + +Report problems to bugs@openvswitch.org. |