summaryrefslogtreecommitdiff
path: root/vswitchd
Commit message (Collapse)AuthorAgeFilesLines
* vswitch.xml: Add description of SRv6 tunnel and related options.Nobuhiro MIKI2023-03-301-7/+32
| | | | | | | | | The description of SRv6 was missing in vswitch.xml, which is used to generate the man page, so this patch adds it. Fixes: 03fc1ad78521 ("userspace: Add SRv6 tunnel support.") Signed-off-by: Nobuhiro MIKI <nmiki@yahoo-corp.jp> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* dpdk: Allow retaining CAP_SYS_RAWIO privileges.Aaron Conole2023-03-222-1/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Open vSwitch generally tries to let the underlying operating system managed the low level details of hardware, for example DMA mapping, bus arbitration, etc. However, when using DPDK, the underlying operating system yields control of many of these details to userspace for management. In the case of some DPDK port drivers, configuring rte_flow or even allocating resources may require access to iopl/ioperm calls, which are guarded by the CAP_SYS_RAWIO privilege on linux systems. These calls are dangerous, and can allow a process to completely compromise a system. However, they are needed in the case of some userspace driver code which manages the hardware (for example, the mlx implementation of backend support for rte_flow). Here, we create an opt-in flag passed to the command line to allow this access. We need to do this before ever accessing the database, because we want to drop all privileges asap, and cannot wait for a connection to the database to be established and functional before dropping. There may be distribution specific ways to do capability management as well (using for example, systemd), but they are not as universal to the vswitchd as a flag. Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Aaron Conole <aconole@redhat.com> Acked-by: Flavio Leitner <fbl@sysclose.org> Acked-by: Gaetan Rivet <gaetanr@nvidia.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* vswitch: Add missing documentation for "ct_flush" capability.Ales Musil2023-03-151-0/+6
| | | | | | | Fixes: 08146bf7d9b4 ("openflow: Add extension to flush CT by generic match.") Signed-off-by: Ales Musil <amusil@redhat.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* ofproto-dpif-upcall: Wait for valid hw flow stats before applying ↵Eelco Chaudron2023-03-152-0/+16
| | | | | | | | | | | | | | | | | min-revalidate-pps. Depending on the driver implementation, it can take from 0.2 seconds up to 2 seconds before offloaded flow statistics are updated. This is true for both TC and rte_flow-based offloading. This is causing a problem with min-revalidate-pps, as old statistic values are used during this period. This fix will wait for at least 2 seconds, by default, before assuming no packets where received during this period. Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* ipfix: Make template and stats interval configurable.Adrian Moreno2023-02-273-2/+49
| | | | | | | | | Add options to the IPFIX table configure the interval to send statistics and template information. Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Adrian Moreno <amorenoz@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* netdev-linux: Add jitter parameter to the netem qos options.Miika Petäjäniemi2023-02-211-0/+4
| | | | | | | | Adds jitter option to enable emulating latency fluctuation with netem. Submitted-at: https://github.com/openvswitch/ovs/pull/407 Signed-off-by: Miika Petäjäniemi <miika.petajaniemi@solita.fi> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* dpctl: Add support to count upcall packets.wangchuanlei2023-01-311-1/+3
| | | | | | | | | Add support to count upcall packets per port, both succeed and failed, which is a better way to see how many packets upcalled on each interface. Acked-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: wangchuanlei <wangchuanlei@inspur.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* revalidator: Allow min-revalidator-pps to be 0.Han Zhou2023-01-271-1/+2
| | | | | | | | | | | | | | | | | | | | Today the minimum value for this setting is 1. This patch allows it to be 0, meaning not checking pps at all, and always do revalidation. This is particularly useful for environments where some of the applications with long-lived connections may have very low traffic for certain period but have high rate of burst periodically. It is desirable to keep the datapath flows instead of periodically deleting them to avoid burst of packet miss to userspace. When setting to 0, there may be more datapath flows to be revalidated, resulting in higher CPU cost of revalidator threads. This is the downside but in certain cases this is still more desirable than packet misses to user space. Signed-off-by: Han Zhou <hzhou@ovn.org> Acked-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* dpif-netdev: Set PMD load based sleep start/inc to 1 us.Kevin Traynor2023-01-231-4/+0
| | | | | | | | | | | | Now that the timer slack for the PMD threads is reduced we can also reduce the start/increment for PMD load based sleeping to match it. This will further reduce initial sleep times making it more resilient to interfaces that might be sensitive to large sleep times. Signed-off-by: Kevin Traynor <ktraynor@redhat.com> Reviewed-by: David Marchand <david.marchand@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* dpif-netdev: Add PMD load based sleeping.Kevin Traynor2023-01-121-0/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Sleep for an incremental amount of time if none of the Rx queues assigned to a PMD have at least half a batch of packets (i.e. 16 pkts) on an polling iteration of the PMD. Upon detecting the threshold of >= 16 pkts on an Rxq, reset the sleep time to zero (i.e. no sleep). Sleep time will be increased on each iteration where the low load conditions remain up to a total of the max sleep time which is set by the user e.g: ovs-vsctl set Open_vSwitch . other_config:pmd-maxsleep=500 The default pmd-maxsleep value is 0, which means that no sleeps will occur and the default behaviour is unchanged from previously. Also add new stats to pmd-perf-show to get visibility of operation e.g. ... - sleep iterations: 153994 ( 76.8 % of iterations) Sleep time (us): 9159399 ( 59 us/iteration avg.) ... Reviewed-by: Robin Jarry <rjarry@redhat.com> Reviewed-by: David Marchand <david.marchand@redhat.com> Signed-off-by: Kevin Traynor <ktraynor@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* vswitch.ovsschema: Set bfd_status to ephemeral.Daniel Ding2022-12-061-3/+4
| | | | | | | | | | When restart openvswitch, the bfd status will be kept before ovs-vswitchd running. And if the ovs-vswitchd has high workload, which will defer updating bfd status, which not we excepted. Signed-off-by: Daniel Ding <zhihui.ding@easystack.cn> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* netdev: Assume default link speed to be 10 Gbps instead of 100 Mbps.Ilya Maximets2022-11-302-3/+3
| | | | | | | | | | | | | | | | | | | | | 100 Mbps was a fair assumption 13 years ago. Modern days 10 Gbps seems like a good value in case no information is available otherwise. The change mainly affects QoS which is currently limited to 100 Mbps if the user didn't specify 'max-rate' and the card doesn't report the speed or OVS doesn't have a predefined enumeration for the speed reported by the NIC. Calculation of the path cost for STP/RSTP is also affected if OVS is unable to determine the link speed. Lower link speed adapters are typically good at reporting their speed, so chances for overshoot should be low. But newer high-speed adapters, for which there is no speed enumeration or if there are some other issues, will not suffer that much. Acked-by: Mike Pattrick <mkp@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* vswitchd: Publish per iface received multicast packets.David Marchand2022-11-241-0/+1
| | | | | | | | | | The count of received multicast packets has been computed internally, but not exposed to ovsdb. Fix this. Signed-off-by: David Marchand <david.marchand@redhat.com> Acked-by: Mike Pattrick <mkp@redhat.com> Acked-by: Michael Santana <msantana@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* vswitch.xml: Fix the name of rstp-path-cost option.Ilya Maximets2022-11-021-1/+1
| | | | | | | | | For some reason it is documented as 'rstp-port-path-cost', while the code and some other bits of documentation use 'rstp-path-cost'. Fixes: 9efd308e957c ("Rapid Spanning Tree Protocol (IEEE 802.1D).") Reviewed-by: David Marchand <david.marchand@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* docs: Remove remaining references to OVS kmod and XenServer.Ilya Maximets2022-08-151-49/+16
| | | | | | | | | | | README file still mentions a kernel module and some parts of the documentation still have XenServer references, e.g. 'xs-*' database configuration options. Removing them. Fixes: 422e90437854 ("make: Remove the Linux datapath.") Fixes: 83c9518e7c67 ("xenserver: Remove xenserver.") Acked-by: Aaron Conole <aconole@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* xenserver: Remove xenserver.Greg Rose2022-08-154-151/+5
| | | | | | | | | | | | | Remove the current xenserver implementation - it is obsolete and since 3.0 we do not support kernel module builds [1]. 1. https://mail.openvswitch.org/pipermail/ovs-dev/2022-July/395789.html [i.maximets] Can be added back if people willing to maintain it will be found. Signed-off-by: Greg Rose <gvrose8192@gmail.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* ofproto/bond: Add knob 'all-members-active'.Christophe Fontaine2022-07-152-0/+17
| | | | | | | | | | This config param allows the delivery of broadcast and multicast packets to the secondary interface of non-lacp bonds, equivalent to the option 'all_slaves_active' for Linux kernel bonds. Reported-at: https://bugzilla.redhat.com/1720935 Signed-off-by: Christophe Fontaine <cfontain@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* odp-execute: Add command to switch action implementation.Emma Finn2022-07-151-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | This commit adds a new command to allow the user to switch the active action implementation at runtime. Usage: $ ovs-appctl odp-execute/action-impl-set scalar This commit also adds a new command to retrieve the list of available action implementations. This can be used by to check what implementations of actions are available and what implementation is active during runtime. Usage: $ ovs-appctl odp-execute/action-impl-show Added separate test-case for ovs-actions show/set commands: odp-execute - actions implementation Signed-off-by: Emma Finn <emma.finn@intel.com> Signed-off-by: Kumar Amber <kumar.amber@intel.com> Signed-off-by: Sunil Pai G <sunil.pai.g@intel.com> Co-authored-by: Kumar Amber <kumar.amber@intel.com> Co-authored-by: Sunil Pai G <sunil.pai.g@intel.com> Acked-by: Harry van Haaren <harry.van.haaren@intel.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>
* odp-execute: Add function pointers to odp-execute for different action ↵Emma Finn2022-07-151-0/+3
| | | | | | | | | | | | | | | | implementations. This commit introduces the initial infrastructure required to allow different implementations for OvS actions. The patch introduces action function pointers which allows user to switch between different action implementations available. This will allow for more performance and flexibility so the user can choose the action implementation to best suite their use case. Signed-off-by: Emma Finn <emma.finn@intel.com> Acked-by: Harry van Haaren <harry.van.haaren@intel.com> Acked-by: Sunil Pai G <sunil.pai.g@intel.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>
* man: Fix various typos across manual pages.Frode Nordahl2022-07-141-2/+2
| | | | | | | As reported by Debian lintian. Signed-off-by: Frode Nordahl <frode.nordahl@canonical.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* netdev-dpdk: Add shared mempool config.Kevin Traynor2022-07-141-0/+37
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Mempools may currently be shared between DPDK ports based on port MTU and NUMA. With some hint from the user we can increase the sharing on MTU and hence reduce memory consumption in many cases. For example, a port with MTU 9000, uses a mempool with an mbuf size based on 9000 MTU. A port with MTU 1500, uses a different mempool with an mbuf size based on 1500 MTU. In this case, assuming same NUMA, both these ports could share the 9000 MTU mempool. The user must give a hint as order of creation of ports and setting of MTUs may vary and we need to ensure that upgrades from older OVS versions do not require more memory. This scheme can also prevent multiple mempools being created for cases where a port is added picking up a default MTU and an appropriate mempool, but later has it's MTU changed to a different value requiring a different mempool. Example usage: $ ovs-vsctl --no-wait set Open_vSwitch . \ other_config:shared-mempool-config=9000,1500:1,6000:1 Port added on NUMA 0: * MTU 1500, use mempool based on 9000 MTU * MTU 5000, use mempool based on 9000 MTU * MTU 9000, use mempool based on 9000 MTU * MTU 9300, use mempool based on 9300 MTU (existing behaviour) Port added on NUMA 1: * MTU 1500, use mempool based on 1500 MTU * MTU 5000, use mempool based on 6000 MTU * MTU 9000, use mempool based on 9000 MTU * MTU 9300, use mempool based on 9300 MTU (existing behaviour) Default behaviour is unchanged and mempools are still only created when needed. Signed-off-by: Kevin Traynor <ktraynor@redhat.com> Reviewed-by: David Marchand <david.marchand@redhat.com> Acked-by: Sunil Pai G <sunil.pai.g@intel.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>
* vswitchd.xml: Fix whitespace.Kevin Traynor2022-05-041-2/+2
| | | | | | | | | | My xml editor keeps autofixing these which means I have to be careful during 'git add' for unrelated changes. Might as well just fix them. Signed-off-by: Kevin Traynor <ktraynor@redhat.com> Reviewed-by: David Marchand <david.marchand@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* ovs-monitor-ipsec: Allow custom options per tunnel.Andreas Karis2022-05-041-1/+3
| | | | | | | | | | | Tunnels in LibreSwan and OpenSwan allow for many options to be set on a per tunnel basis. Pass through any options starting with ipsec_ to the connection in the configuration file. Administrators are responsible for picking valid key/value pairs. Signed-off-by: Andreas Karis <ak.karis@gmail.com> Acked-by: Mike Pattrick <mkp@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* hmap: use short version of safe loops if possible.Adrian Moreno2022-03-301-28/+28
| | | | | | | | | | | | | | | Using SHORT version of the *_SAFE loops makes the code cleaner and less error prone. So, use the SHORT version and remove the extra variable when possible for hmap and all its derived types. In order to be able to use both long and short versions without changing the name of the macro for all the clients, overload the existing name and select the appropriate version depending on the number of arguments. Acked-by: Dumitru Ceara <dceara@redhat.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Adrian Moreno <amorenoz@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* list: use short version of safe loops if possible.Adrian Moreno2022-03-301-8/+8
| | | | | | | | | | | | | | | Using the SHORT version of the *_SAFE loops makes the code cleaner and less error-prone. So, use the SHORT version and remove the extra variable when possible. In order to be able to use both long and short versions without changing the name of the macro for all the clients, overload the existing name and select the appropriate version depending on the number of arguments. Acked-by: Dumitru Ceara <dceara@redhat.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Adrian Moreno <amorenoz@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* vswitchd.xml: Add missing tx-steering PMD option.Maxime Coquelin2022-01-311-0/+23
| | | | | | | | This patch documents PMD's other_config:tx-steering option. Acked-by: Kevin Traynor <ktraynor@redhat.com> Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* netdev-offload: Add multi-thread API.Gaetan Rivet2022-01-191-0/+16
| | | | | | | | | | | | | | | | | | | | | | | | | Expose functions reporting user configuration of offloading threads, as well as utility functions for multithreading. This will only expose the configuration knob to the user, while no datapath will implement the multiple thread request. This will allow implementations to use this API for offload thread management in relevant layers before enabling the actual dataplane implementation. The offload thread ID is lazily allocated and can as such be in a different order than the offload thread start sequence. The RCU thread will sometime access hardware-offload objects from a provider for reclamation purposes. In such case, it will get a default offload thread ID of 0. Care must be taken that using this thread ID is safe concurrently with the offload threads. Signed-off-by: Gaetan Rivet <grive@u256.net> Reviewed-by: Eli Britstein <elibr@nvidia.com> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* openvswitch: Define the OVS_STATIC_TRACE() macro.Eelco Chaudron2022-01-181-0/+3
| | | | | | | | | This patch defines the OVS_STATIC_TRACE() macro, and as an example, adds two of them in the bridge run loop. Signed-off-by: Eelco Chaudron <echaudro@redhat.com> Acked-by: Paolo Valerio <pvalerio@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* bridge: Fix incorrect configuration of netdev's dpif type.Ilya Maximets2021-12-171-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | netdev_set_dpif_type() can only be used with a normalized dpif type as an argument, which is a constant static string derived from a type of a dpif_class or a constant string "system". Usage of a same constant string allows netdev-offload module to compare types by simply comparing pointers. OTOH, 'br->ofproto->type' is a dynamic string that: a. Can be NULL. b. Even if not NULL and equal, can be a different dynamically allocated string. Both these qualities breaks assumptions made by all other modules related to HW offload, breaking the functionality. Fix that by moving netdev_set_dpif_type() to dpif.c and calling with a correct constant string as an argument. The call moved from bridge.c to dpif.c, because we need to have access to the dpif class, but bridge.c should not. Not trying to set the dpif_type inside the netdev_ports_insert(), because it's used now outside the offloading context. So, it's cleaner to move the netdev_set_dpif_type() call outside of the netdev-offload module. Additionally removed the redundant call from the netdev_ports_insert() and refactored the function, since it doesn't need an extra argument anymore. Fixes: 4f19a78a61c5 ("netdev-vport: Fix userspace tunnel ioctl(SIOCGIFINDEX) info logs.") Reported-by: Roi Dayan <roid@nvidia.com> Reported-at: https://mail.openvswitch.org/pipermail/ovs-dev/2021-December/390117.html Signed-off-by: Ilya Maximets <i.maximets@ovn.org> Reviewed-by: Lin Huang <linhuang@ruijie.com.cn> Acked-by: Roi Dayan <roid@nvidia.com>
* netdev-vport: Fix userspace tunnel ioctl(SIOCGIFINDEX) info logs.Lin Huang2021-12-081-0/+2
| | | | | | | | | | | | | | | | | | | Userspace tunnel doesn't have a valid device in the kernel. So get_ifindex() function (ioctl) always get error during adding a port, deleting a port or updating a port status. The info log is "2021-08-29T09:17:39.830Z|00059|netdev_linux|INFO|ioctl(SIOCGIFINDEX) on vxlan_sys_4789 device failed: No such device" If there are a lot of userspace tunnel ports on a bridge, the iface_refresh_netdev_status() function will spend a lot of time. So ignore userspace tunnel port ioctl(SIOCGIFINDEX) operation, just return -ENODEV. Signed-off-by: Lin Huang <linhuang@ruijie.com.cn> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* ovsdb-idl: Add memory report function.Ilya Maximets2021-11-041-0/+2
| | | | | | | | | | | | | | | Added new function to return memory usage statistics for database objects inside IDL. Statistics similar to what ovsdb-server reports. Not counting _Server database as it should be small, hence doesn't worth adding extra code to the ovsdb-cs module. Can be added later if needed. ovs-vswitchd is a user in OVS, but this API will be mostly useful for OVN daemons. Signed-off-by: Ilya Maximets <i.maximets@ovn.org> Acked-by: Han Zhou <hzhou@ovn.org> Acked-by: Dumitru Ceara <dceara@redhat.com>
* ovsdb-data: Optimize union of sets.Ilya Maximets2021-09-241-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Current algorithm of ovsdb_datum_union looks like this: for-each atom in b: if not bin_search(a, atom): push(a, clone(atom)) quicksort(a) So, the complexity looks like this: Nb * log2(Na) + Nb + (Na + Nb) * log2(Na + Nb) Comparisons clones Comparisons for quicksort for search ovsdb_datum_union() is heavily used in database transactions while new element is added to a set. For example, if new logical switch port is added to a logical switch in OVN. This is a very common use case where CMS adds one new port to an existing switch that already has, let's say, 100 ports. For this case ovsdb-server will have to perform: 1 * log2(100) + 1 clone + 101 * log2(101) Comparisons Comparisons for for search quicksort. ~7 1 ~707 Roughly 714 comparisons of atoms and 1 clone. Since binary search can give us position, where new atom should go (it's the 'low' index after the search completion) for free, the logic can be re-worked like this: copied = 0 for-each atom in b: desired_position = bin_search(a, atom) push(result, a[ copied : desired_position - 1 ]) copied = desired_position push(result, clone(atom)) push(result, a[ copied : Na ]) swap(a, result) Complexity of this schema: Nb * log2(Na) + Nb + Na Comparisons clones memory copy on push for search 'swap' is just a swap of a few pointers. 'push' is not a 'clone', but a simple memory copy of 'union ovsdb_atom'. In general, this schema substitutes complexity of a quicksort with complexity of a memory copy of Na atom structures, where we're not even copying strings that these atoms are pointing to. Complexity in the example above goes down from 714 comparisons to 7 comparisons and memcpy of 100 * sizeof (union ovsdb_atom) bytes. General complexity of a memory copy should always be lower than complexity of a quicksort, especially because these copies usually performed in bulk, so this new schema should work faster for any input. All in all, this change allows to execute several times more transactions per second for transactions that adds new entries to sets. Alternatively, union can be implemented as a linear merge of two sorted arrays, but this will result in O(Na) comparisons, which is more than Nb * log2(Na) in common case, since Na is usually far bigger than Nb. Linear merge will also mean per-atom memory copies instead of copying in bulk. 'replace' functionality of ovsdb_datum_union() had no users, so it just removed. But it can easily be added back if needed in the future. Signed-off-by: Ilya Maximets <i.maximets@ovn.org> Acked-by: Han Zhou <hzhou@ovn.org> Acked-by: Mark D. Gray <mark.d.gray@redhat.com>
* dpdk: Stop configuring socket-limit with the value of socket-mem.Rosemarie O'Riorden2021-07-261-7/+1
| | | | | | | | | | | | | | | | This change removes the automatic memory limit on start-up of OVS with DPDK. As DPDK supports dynamic memory allocation, there is no need to limit the amount of memory available, if not requested. Currently, if socket-limit is not configured, it is set to the value of socket-mem. With this change, the user can decide to set it or have no memory limit. Removed logs that announce this change and fixed documentation. Reported at: https://bugzilla.redhat.com/show_bug.cgi?id=1949850 Signed-off-by: Rosemarie O'Riorden <roriorde@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* dpdk: Remove default values for socket-mem and limit.Rosemarie O'Riorden2021-07-261-9/+9
| | | | | | | | | | | | | | | | | | | This change removes the default values for EAL args socket-mem and socket-limit. As DPDK supports dynamic memory allocation, there is no need to allocate a certain amount of memory on start-up, nor limit the amount of memory available, if not requested. Currently, socket-mem has a default value of 1024 when it is not configured by the user, and socket-limit takes on the value of socket-mem, 1024, by default. With this change, socket-mem is not configured by default, meaning that socket-limit is not either. Neither, either or both options can be set. Removed extra logs that announce this change and fixed documentation. Reported at: https://bugzilla.redhat.com/show_bug.cgi?id=1949850 Signed-off-by: Rosemarie O'Riorden <roriorde@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* dpif-netlink: Introduce per-cpu upcall dispatch.Mark Gray2021-07-162-9/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The Open vSwitch kernel module uses the upcall mechanism to send packets from kernel space to user space when it misses in the kernel space flow table. The upcall sends packets via a Netlink socket. Currently, a Netlink socket is created for every vport. In this way, there is a 1:1 mapping between a vport and a Netlink socket. When a packet is received by a vport, if it needs to be sent to user space, it is sent via the corresponding Netlink socket. This mechanism, with various iterations of the corresponding user space code, has seen some limitations and issues: * On systems with a large number of vports, there is correspondingly a large number of Netlink sockets which can limit scaling. (https://bugzilla.redhat.com/show_bug.cgi?id=1526306) * Packet reordering on upcalls. (https://bugzilla.redhat.com/show_bug.cgi?id=1844576) * A thundering herd issue. (https://bugzilla.redhat.com/show_bug.cgi?id=1834444) This patch introduces an alternative, feature-negotiated, upcall mode using a per-cpu dispatch rather than a per-vport dispatch. In this mode, the Netlink socket to be used for the upcall is selected based on the CPU of the thread that is executing the upcall. In this way, it resolves the issues above as: a) The number of Netlink sockets scales with the number of CPUs rather than the number of vports. b) Ordering per-flow is maintained as packets are distributed to CPUs based on mechanisms such as RSS and flows are distributed to a single user space thread. c) Packets from a flow can only wake up one user space thread. Reported-at: https://bugzilla.redhat.com/1844576 Signed-off-by: Mark Gray <mark.d.gray@redhat.com> Acked-by: Flavio Leitner <fbl@sysclose.org> Acked-by: Aaron Conole <aconole@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* dpif-netdev: Allow pin rxq and non-isolate PMD.Kevin Traynor2021-07-161-0/+19
| | | | | | | | | | | | | | | | | | | | | | | | | Pinning an rxq to a PMD with pmd-rxq-affinity may be done for various reasons such as reserving a full PMD for an rxq, or to ensure that multiple rxqs from a port are handled on different PMDs. Previously pmd-rxq-affinity always isolated the PMD so no other rxqs could be assigned to it by OVS. There may be cases where there is unused cycles on those pmds and the user would like other rxqs to also be able to be assigned to it by OVS. Add an option to pin the rxq and non-isolate the PMD. The default behaviour is unchanged, which is pin and isolate the PMD. In order to pin and non-isolate: ovs-vsctl set Open_vSwitch . other_config:pmd-rxq-isolate=false Note this is available only with group assignment type, as pinning conflicts with the operation of the other rxq assignment algorithms. Signed-off-by: Kevin Traynor <ktraynor@redhat.com> Acked-by: Sunil Pai G <sunil.pai.g@intel.com> Acked-by: David Marchand <david.marchand@redhat.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>
* dpif-netdev: Add group rxq scheduling assignment type.Kevin Traynor2021-07-161-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add an rxq scheduling option that allows rxqs to be grouped on a pmd based purely on their load. The current default 'cycles' assignment sorts rxqs by measured processing load and then assigns them to a list of round robin PMDs. This helps to keep the rxqs that require most processing on different cores but as it selects the PMDs in round robin order, it equally distributes rxqs to PMDs. 'cycles' assignment has the advantage in that it separates the most loaded rxqs from being on the same core but maintains the rxqs being spread across a broad range of PMDs to mitigate against changes to traffic pattern. 'cycles' assignment has the disadvantage that in order to make the trade off between optimising for current traffic load and mitigating against future changes, it tries to assign and equal amount of rxqs per PMD in a round robin manner and this can lead to a less than optimal balance of the processing load. Now that PMD auto load balance can help mitigate with future changes in traffic patterns, a 'group' assignment can be used to assign rxqs based on their measured cycles and the estimated running total of the PMDs. In this case, there is no restriction about keeping equal number of rxqs per PMD as it is purely load based. This means that one PMD may have a group of low load rxqs assigned to it while another PMD has one high load rxq assigned to it, as that is the best balance of their measured loads across the PMDs. Signed-off-by: Kevin Traynor <ktraynor@redhat.com> Acked-by: Sunil Pai G <sunil.pai.g@intel.com> Acked-by: David Marchand <david.marchand@redhat.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>
* ofproto-dpif: APIs and CLI option to add/delete static fdb entry.Vasu Dasari2021-07-161-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently there is an option to add/flush/show ARP/ND neighbor. This covers L3 side. For L2 side, there is only fdb show command. This commit gives an option to add/del an fdb entry via ovs-appctl. CLI command looks like: To add: ovs-appctl fdb/add <bridge> <port> <vlan> <Mac> ovs-appctl fdb/add br0 p1 0 50:54:00:00:00:05 To del: ovs-appctl fdb/del <bridge> <vlan> <Mac> ovs-appctl fdb/del br0 0 50:54:00:00:00:05 Added two new APIs to provide convenient interface to add and delete static-macs. bool xlate_add_static_mac_entry(const struct ofproto_dpif *, ofp_port_t in_port, struct eth_addr dl_src, int vlan); bool xlate_delete_static_mac_entry(const struct ofproto_dpif *, struct eth_addr dl_src, int vlan); 1. Static entry should not age. To indicate that entry being programmed is a static entry, 'expires' field in 'struct mac_entry' will be set to a MAC_ENTRY_AGE_STATIC_ENTRY. A check for this value is made while deleting mac entry as part of regular aging process. 2. Another change to the mac-update logic, when a packet with same dl_src as that of a static-mac entry arrives on any port, the logic will not modify the expires field. 3. While flushing fdb entries, made sure static ones are not evicted. 4. Updated "ovs-appctl fdb/stats-show br0" to display number of static entries in switch Added following tests: ofproto-dpif - static-mac add/del/flush ofproto-dpif - static-mac mac moves Reported-at: https://mail.openvswitch.org/pipermail/ovs-discuss/2019-June/048894.html Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=1597752 Signed-off-by: Vasu Dasari <vdasari@gmail.com> Tested-by: Eelco Chaudron <echaudro@redhat.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* dpdk: Logs to announce removal of defaults for socket-mem and limit.Rosemarie O'Riorden2021-07-161-2/+6
| | | | | | | | | | | Deprecate current OVS provided defaults for DPDK socket-mem and socket-limit that are planned to be removed in OVS 2.17. At that point DPDK defaults will be used instead. Warnings have been added to alert users in advance. Signed-off-by: Rosemarie O'Riorden <roriorde@redhat.com> Acked-by: Kevin Traynor <ktraynor@redhat.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>
* conntrack: Document all-zero IP SNAT behavior and add a test case.Eelco Chaudron2021-07-081-0/+9
| | | | | | | | | | | | | | | | | | | | Currently, conntrack in the kernel has an undocumented feature referred to as all-zero IP address SNAT. Basically, when a source port collision is detected during the commit, the source port will be translated to an ephemeral port. If there is no collision, no SNAT is performed. This patchset documents this behavior and adds a self-test to verify it's not changing. In addition, a datapath feature flag is added for the all-zero IP SNAT case. This will help applications on top of OVS, like OVN, to determine this feature can be used. Signed-off-by: Eelco Chaudron <echaudro@redhat.com> Acked-by: Aaron Conole <aconole@redhat.com> Acked-by: Dumitru Ceara <dceara@redhat.com> Acked-by: Alin-Gabriel Serdean <aserdean@ovn.org> Acked-by: Paolo Valerio <pvalerio@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* bridge: Use correct (legacy) role names in database.Ben Pfaff2021-07-071-2/+2
| | | | | | | | | | The vswitchd database schema requires role names to be "master" or "slave", but this code tried to use "primary" and "secondary". Signed-off-by: Ben Pfaff <blp@ovn.org> Reported-at: https://github.com/openvswitch/ovs-issues/issues/218 Tested-at: https://github.com/openvswitch/ovs-issues/issues/218#issuecomment-875374045 Fixes: 807152a4ddfb ("Use primary/secondary, not master/slave, as names for OpenFlow roles.")
* bridge: fix type mismatchYunjian Wang2021-07-021-5/+5
| | | | | | | | | | | | | | Currently the function ofproto_set_flow_limit() was not checking 'limit' value. It maybe negative, which will be lead to a big unsigned value. The 'limit' should never be negative so it's better to just use smap_get_uint() to get it right. And fix ofproto_set_max_idle(), ofproto_set_min_revalidate_pps(), ofproto_set_max_revalidator() and ofproto_set_bundle_idle_timeout() together. Signed-off-by: Yunjian Wang <wangyunjian@huawei.com> Signed-off-by: Ben Pfaff <blp@ovn.org>
* bridge: Only an inactivity_probe of 0 should turn off inactivity probes.Ben Pfaff2021-07-021-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | The documentation for inactivity_probe says this: inactivity_probe: optional integer Maximum number of milliseconds of idle time on connec‐ tion to controller before sending an inactivity probe message. If Open vSwitch does not communicate with the controller for the specified number of seconds, it will send a probe. If a response is not received for the same additional amount of time, Open vSwitch assumes the con‐ nection has been broken and attempts to reconnect. De‐ fault is implementation-specific. A value of 0 disables inactivity probes. This means that a value of 0 should disable inactivity probes and any other value should be in milliseconds. The code in bridge.c was actually interpreting it as any value between 0 and 999 disabling inactivity probes. That was surprising when I accidentally configured it to 5 or to 10, not remembering that it was in milliseconds, and disabled them entirely. This fixes the problem. Signed-off-by: Ben Pfaff <blp@ovn.org> Acked-by: Ilya Maximets <i.maximets@ovn.org>
* add port-based ingress policing based packet-per-second rate-limitingYong Xu2021-07-013-6/+69
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | OVS has support for using policing to enforce a rate limit in kilobits per second. This is configured using OVSDB. f.e. $ ovs-vsctl set interface tap0 ingress_policing_rate=1000 $ ovs-vsctl set interface tap0 ingress_policing_burst=100 This patch adds a related feature, allowing policing to enforce a rate limit in kilo-packets per second. This is also configured using OVSDB. $ ovs-vsctl set interface tap0 ingress_policing_kpkts_rate=1000 $ ovs-vsctl set interface tap0 ingress_policing_kpkts_burst=100 The kilo-bit and kilo-packet rate limits may be used separately or in combination. Add separate action for BPS and PPS in netlink message. Revise code and change action result to pipe to allow traffic pipe into second action. This patch implements the feature for: * OVSDB (northbound API) * TC policer when used both with and without TC offload (kernel API) Signed-off-by: Yong Xu <yong.xu@corigine.com> Signed-off-by: Simon Horman <simon.horman@netronome.com>
* Documentation: Fix DPDK qos example.William Tu2021-03-011-2/+4
| | | | | | | | | | | | | | Fix the example use case based on the decription. EIR and CIR are measured in bytes/sec and considered 64-byte IP packets size withtout 14-byte Ethernet header. So fix the 1000pps example by: (64 - 14) * 1000 = 50,000 If the frame includes 4-byte FCS header, then it's (64 - 14 - 4) * 1000 = 46,000 Fixes: e61bdffc2a98 ("netdev-dpdk: Add new DPDK RFC 4115 egress policer") Signed-off-by: William Tu <u9012063@gmail.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* vswitchd.xml: Fix supported IPsec tunnels.Mark Gray2021-02-011-2/+2
| | | | | | | | 'ovs-monitor-ipsec' does not support 'ip6gre' tunnels. Fixes: 22c5eafb6efa ("ipsec: reintroduce IPsec support for tunneling") Signed-off-by: Mark Gray <mark.d.gray@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* dpif-netdev: Add parameters to configure PMD auto load balance.Christophe Fontaine2021-01-151-2/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Two important parts of how PMD auto load balance operates are how loaded a core needs to be and how much improvement is estimated before a PMD auto load balance can trigger. Previously they were hardcoded to 95% loaded and 25% variance improvement. These default values may not be suitable for all use cases and we may want to use a more (or less) aggressive rebalance, either on the pmd load threshold or on the minimum variance improvement threshold. The defaults are not changed, but "pmd-auto-lb-load-threshold" and "pmd-auto-lb-improvement-threshold" parameters are added to override the defaults. $ ovs-vsctl set open_vswitch . other_config:pmd-auto-lb-load-threshold="70" $ ovs-vsctl set open_vswitch . other_config:pmd-auto-lb-improvement-threshold="20" Signed-off-by: Christophe Fontaine <cfontain@redhat.com> Co-Authored-by: Kevin Traynor <ktraynor@redhat.com> Signed-off-by: Kevin Traynor <ktraynor@redhat.com> Acked-by: David Marchand <david.marchand@redhat.com> Acked-by: Ian Stokes <ian.stokes@intel.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* tunnel: Bareudp Tunnel Support.Martin Varghese2020-12-221-6/+36
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are various L3 encapsulation standards using UDP being discussed to leverage the UDP based load balancing capability of different networks. MPLSoUDP (__ https://tools.ietf.org/html/rfc7510) is one among them. The Bareudp tunnel provides a generic L3 encapsulation support for tunnelling different L3 protocols like MPLS, IP, NSH etc. inside a UDP tunnel. An example to create bareudp device to tunnel MPLS traffic is given $ ovs-vsctl add-port br_mpls udp_port -- set interface udp_port \ type=bareudp options:remote_ip=2.1.1.3 options:local_ip=2.1.1.2 \ options:payload_type=0x8847 options:dst_port=6635 The bareudp device supports special handling for MPLS & IP as they can have multiple ethertypes. MPLS procotcol can have ethertypes ETH_P_MPLS_UC (unicast) & ETH_P_MPLS_MC (multicast). IP protocol can have ethertypes ETH_P_IP (v4) & ETH_P_IPV6 (v6). The bareudp device to tunnel L3 traffic with multiple ethertypes (MPLS & IP) can be created by passing the L3 protocol name as string in the field payload_type. An example to create bareudp device to tunnel MPLS unicast & multicast traffic is given below.:: $ ovs-vsctl add-port br_mpls udp_port -- set interface udp_port \ type=bareudp options:remote_ip=2.1.1.3 options:local_ip=2.1.1.2 \ options:payload_type=mpls options:dst_port=6635 Signed-off-by: Martin Varghese <martin.varghese@nokia.com> Acked-By: Greg Rose <gvrose8192@gmail.com> Tested-by: Greg Rose <gvrose8192@gmail.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* netdev-dpdk: Add option to configure VF MAC address.Gaetan Rivet2020-11-161-0/+18
| | | | | | | | | | | | | | | | | | | | In some cloud topologies, using DPDK VF representors in guest requires configuring a VF before it is assigned to the guest. A first basic option for such configuration is setting the VF MAC address. Add a key 'dpdk-vf-mac' to the 'options' column of the Interface table. This option can be used as such: $ ovs-vsctl add-port br0 dpdk-rep0 -- set Interface dpdk-rep0 type=dpdk \ options:dpdk-vf-mac=00:11:22:33:44:55 Suggested-by: Ilya Maximets <i.maximets@ovn.org> Acked-by: Eli Britstein <elibr@nvidia.com> Acked-by: Kevin Traynor <ktraynor@redhat.com> Signed-off-by: Gaetan Rivet <grive@u256.net> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* Eliminate use of term "slave" in bond, LACP, and bundle contexts.Ben Pfaff2020-10-213-55/+59
| | | | | | | | | | | | | The new term is "member". Most of these changes should not change user-visible behavior. One place where they do is in "ovs-ofctl dump-flows", which will now output "members:..." inside "bundle" actions instead of "slaves:...". I don't expect this to cause real problems in most systems. The old syntax is still supported on input for backward compatibility. Signed-off-by: Ben Pfaff <blp@ovn.org> Acked-by: Alin Gabriel Serdean <aserdean@cloudbasesolutions.com>