summaryrefslogtreecommitdiff
path: root/datapath
Commit message (Collapse)AuthorAgeFilesLines
* make: Remove the Linux datapath.Greg Rose2022-07-15156-45748/+0
| | | | | | | | | | | | | | | | Update the necessary make and configure files to remove the Linux datapath and then remove the datapath. Move datapath/linux/compat/include/linux/openvswitch.h to include/linux/openvswitch.h because it is needed to generate header files used by the userspace switch. Also remove references to the Linux datapath from auxiliary files and utilities since it is no longer supported. Signed-off-by: Greg Rose <gvrose8192@gmail.com> Reviewed-by: David Marchand <david.marchand@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* openvswitch.h: Align uAPI definition with the kernel.Ilya Maximets2022-03-252-10/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Upstream commit: commit 1926407a4ab0e59d5a27bed7b82029b356d80fa0 Author: Ilya Maximets <i.maximets@ovn.org> Date: Wed Mar 9 23:20:33 2022 +0100 net: openvswitch: fix uAPI incompatibility with existing user space Few years ago OVS user space made a strange choice in the commit [1] to define types only valid for the user space inside the copy of a kernel uAPI header. '#ifndef __KERNEL__' and another attribute was added later. This leads to the inevitable clash between user space and kernel types when the kernel uAPI is extended. The issue was unveiled with the addition of a new type for IPv6 extension header in kernel uAPI. When kernel provides the OVS_KEY_ATTR_IPV6_EXTHDRS attribute to the older user space application, application tries to parse it as OVS_KEY_ATTR_PACKET_TYPE and discards the whole netlink message as malformed. Since OVS_KEY_ATTR_IPV6_EXTHDRS is supplied along with every IPv6 packet that goes to the user space, IPv6 support is fully broken. Fixing that by bringing these user space attributes to the kernel uAPI to avoid the clash. Strictly speaking this is not the problem of the kernel uAPI, but changing it is the only way to avoid breakage of the older user space applications at this point. These 2 types are explicitly rejected now since they should not be passed to the kernel. Additionally, OVS_KEY_ATTR_TUNNEL_INFO moved out from the '#ifdef __KERNEL__' as there is no good reason to hide it from the userspace. And it's also explicitly rejected now, because it's for in-kernel use only. Comments with warnings were added to avoid the problem coming back. (1 << type) converted to (1ULL << type) to avoid integer overflow on OVS_KEY_ATTR_IPV6_EXTHDRS, since it equals 32 now. [1] beb75a40fdc2 ("userspace: Switching of L3 packets in L2 pipeline") Fixes: 28a3f0601727 ("net: openvswitch: IPv6: Add IPv6 extension header support") Link: https://lore.kernel.org/netdev/3adf00c7-fe65-3ef4-b6d7-6d8a0cad8a5f@nvidia.com Link: https://github.com/openvswitch/ovs/commit/beb75a40fdc295bfd6521b0068b4cd12f6de507c Reported-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org> Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Acked-by: Aaron Conole <aconole@redhat.com> Link: https://lore.kernel.org/r/20220309222033.3018976-1-i.maximets@ovn.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> Not adding OVS_KEY_ATTR_IPV6_EXTHDRS in this commit as this is not necessary. Will be added along with the actual userspace implementation. This change should help avoiding incompatibility issues in the future. Acked-by: Aaron Conole <aconole@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* Encap & Decap actions for MPLS packet type.Martin Varghese2022-01-171-0/+31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The encap & decap actions are extended to support MPLS packet type. Encap & decap actions adds and removes MPLS header at start of the packet. The existing PUSH MPLS & POP MPLS actions inserts & removes MPLS header between ethernet header and the IP header. Though this behaviour is fine for L3 VPN where an IP packet is encapsulated inside a MPLS tunnel, it does not suffice the L2 VPN requirements. In L2 VPN the ethernet packets must be encapsulated inside MPLS tunnel. In this change the encap & decap actions are extended to support MPLS packet type. The encap & decap adds and removes MPLS header at the start of packet as depicted below. Encapsulation: Actions - encap(mpls),encap(ethernet) Incoming packet -> | ETH | IP | Payload | 1 Actions - encap(mpls) [Datapath action - ADD_MPLS:0x8847] Outgoing packet -> | MPLS | ETH | Payload| 2 Actions - encap(ethernet) [ Datapath action - push_eth ] Outgoing packet -> | ETH | MPLS | ETH | Payload| Decapsulation: Incoming packet -> | ETH | MPLS | ETH | IP | Payload | Actions - decap(),decap(packet_type(ns=0,type=0)) 1 Actions - decap() [Datapath action - pop_eth) Outgoing packet -> | MPLS | ETH | IP | Payload| 2 Actions - decap(packet_type(ns=0,type=0)) [Datapath action - POP_MPLS:0x6558] Outgoing packet -> | ETH | IP | Payload| Signed-off-by: Martin Varghese <martin.varghese@nokia.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* compat: handle NF_REPEAT error on nf_conntrack_in.Vladislav Odintsov2021-12-091-1/+8
| | | | | | | | | | | | | | | | | | | | | | In patch [1] rpl_nf_conntrack_in was backported as static inline function without do..while loop handling NF_REPEAT error. In patch [2] rpl_nf_conntrack_in backported function was removed from compat/include/net/netfilter/nf_conntrack_core.h as an unused. As a result the do..while loop around nf_conntrack_in was lost and this caused problems on old RHEL kernels with the tcp SYN loss on a connection with same 5-tuple, which ran in last nf_conntrack_tcp_timeout_time_wait. The connection could be initiated on a tcp SYN retry after one second. 1: https://github.com/openvswitch/ovs/commit/4fdec8986a203b0dc9d9c183c932826967572e0f 2: https://github.com/openvswitch/ovs/commit/e9b33ad780f3bc712a5de6be9e1e0803fadcd249 Reported-at: https://mail.openvswitch.org/pipermail/ovs-dev/2021-September/387623.html Reported-at: https://mail.openvswitch.org/pipermail/ovs-dev/2021-October/388424.html Signed-off-by: Vladislav Odintsov <odivlad@gmail.com> Reviewed-by: Greg Rose <gvrose8192@gmail.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* dpctl: dpif: Allow viewing and configuring dp cache sizes.Eelco Chaudron2021-11-081-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds a general way of viewing/configuring datapath cache sizes. With an implementation for the netlink interface. The ovs-dpctl/ovs-appctl show commands will display the current cache sizes configured: $ ovs-dpctl show system@ovs-system: lookups: hit:25 missed:63 lost:0 flows: 0 masks: hit:282 total:0 hit/pkt:3.20 cache: hit:4 hit-rate:4.54% caches: masks-cache: size:256 port 0: ovs-system (internal) port 1: br-int (internal) port 2: genev_sys_6081 (geneve: packet_type=ptap) port 3: br-ex (internal) port 4: eth2 port 5: sw0p1 (internal) port 6: sw0p3 (internal) A specific cache can be configured as follows: $ ovs-appctl dpctl/cache-set-size DP CACHE SIZE $ ovs-dpctl cache-set-size DP CACHE SIZE For example to disable the cache do: $ ovs-dpctl cache-set-size system@ovs-system masks-cache 0 Setting cache size successful, new size 0. Signed-off-by: Eelco Chaudron <echaudro@redhat.com> Acked-by: Paolo Valerio <pvalerio@redhat.com> Acked-by: Flavio Leitner <fbl@sysclose.org> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* dpctl: dpif: Add kernel datapath cache hit output.Eelco Chaudron2021-11-081-1/+1
| | | | | | | | | | | | | | | | | | | | | | | This patch adds cache usage statistics to the output: $ ovs-dpctl show system@ovs-system: lookups: hit:24 missed:71 lost:0 flows: 0 masks: hit:334 total:0 hit/pkt:3.52 cache: hit:4 hit-rate:4.21% port 0: ovs-system (internal) port 1: genev_sys_6081 (geneve: packet_type=ptap) port 2: br-int (internal) port 3: br-ex (internal) port 4: eth2 port 5: sw1p1 (internal) port 6: sw0p4 (internal) Signed-off-by: Eelco Chaudron <echaudro@redhat.com> Acked-by: Flavio Leitner <fbl@sysclose.org> Acked-by: Paolo Valerio <pvalerio@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* datapath: handle DNAT tuple collision.Dumitru Ceara2021-10-121-9/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Upstream commit: commit 8aa7b526dc0b5dbf40c1b834d76a667ad672a410 Author: Dumitru Ceara <dceara@redhat.com> Date: Wed Oct 7 17:48:03 2020 +0200 openvswitch: handle DNAT tuple collision With multiple DNAT rules it's possible that after destination translation the resulting tuples collide. For example, two openvswitch flows: nw_dst=10.0.0.10,tp_dst=10, actions=ct(commit,table=2,nat(dst=20.0.0.1:20)) nw_dst=10.0.0.20,tp_dst=10, actions=ct(commit,table=2,nat(dst=20.0.0.1:20)) Assuming two TCP clients initiating the following connections: 10.0.0.10:5000->10.0.0.10:10 10.0.0.10:5000->10.0.0.20:10 Both tuples would translate to 10.0.0.10:5000->20.0.0.1:20 causing nf_conntrack_confirm() to fail because of tuple collision. Netfilter handles this case by allocating a null binding for SNAT at egress by default. Perform the same operation in openvswitch for DNAT if no explicit SNAT is requested by the user and allocate a null binding for SNAT for packets in the "original" direction. Reported-at: https://bugzilla.redhat.com/1877128 Suggested-by: Florian Westphal <fw@strlen.de> Fixes: 05752523e565 ("openvswitch: Interface with NAT.") Signed-off-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Fixes: f8f97cdce9ad ("datapath: Interface with NAT.") Signed-off-by: Dumitru Ceara <dceara@redhat.com> Acked-by: Paolo Valerio <pvalerio@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* dpif-netlink: Introduce per-cpu upcall dispatch.Mark Gray2021-07-161-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The Open vSwitch kernel module uses the upcall mechanism to send packets from kernel space to user space when it misses in the kernel space flow table. The upcall sends packets via a Netlink socket. Currently, a Netlink socket is created for every vport. In this way, there is a 1:1 mapping between a vport and a Netlink socket. When a packet is received by a vport, if it needs to be sent to user space, it is sent via the corresponding Netlink socket. This mechanism, with various iterations of the corresponding user space code, has seen some limitations and issues: * On systems with a large number of vports, there is correspondingly a large number of Netlink sockets which can limit scaling. (https://bugzilla.redhat.com/show_bug.cgi?id=1526306) * Packet reordering on upcalls. (https://bugzilla.redhat.com/show_bug.cgi?id=1844576) * A thundering herd issue. (https://bugzilla.redhat.com/show_bug.cgi?id=1834444) This patch introduces an alternative, feature-negotiated, upcall mode using a per-cpu dispatch rather than a per-vport dispatch. In this mode, the Netlink socket to be used for the upcall is selected based on the CPU of the thread that is executing the upcall. In this way, it resolves the issues above as: a) The number of Netlink sockets scales with the number of CPUs rather than the number of vports. b) Ordering per-flow is maintained as packets are distributed to CPUs based on mechanisms such as RSS and flows are distributed to a single user space thread. c) Packets from a flow can only wake up one user space thread. Reported-at: https://bugzilla.redhat.com/1844576 Signed-off-by: Mark Gray <mark.d.gray@redhat.com> Acked-by: Flavio Leitner <fbl@sysclose.org> Acked-by: Aaron Conole <aconole@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* tunnel: Bareudp Tunnel Support.Martin Varghese2020-12-221-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are various L3 encapsulation standards using UDP being discussed to leverage the UDP based load balancing capability of different networks. MPLSoUDP (__ https://tools.ietf.org/html/rfc7510) is one among them. The Bareudp tunnel provides a generic L3 encapsulation support for tunnelling different L3 protocols like MPLS, IP, NSH etc. inside a UDP tunnel. An example to create bareudp device to tunnel MPLS traffic is given $ ovs-vsctl add-port br_mpls udp_port -- set interface udp_port \ type=bareudp options:remote_ip=2.1.1.3 options:local_ip=2.1.1.2 \ options:payload_type=0x8847 options:dst_port=6635 The bareudp device supports special handling for MPLS & IP as they can have multiple ethertypes. MPLS procotcol can have ethertypes ETH_P_MPLS_UC (unicast) & ETH_P_MPLS_MC (multicast). IP protocol can have ethertypes ETH_P_IP (v4) & ETH_P_IPV6 (v6). The bareudp device to tunnel L3 traffic with multiple ethertypes (MPLS & IP) can be created by passing the L3 protocol name as string in the field payload_type. An example to create bareudp device to tunnel MPLS unicast & multicast traffic is given below.:: $ ovs-vsctl add-port br_mpls udp_port -- set interface udp_port \ type=bareudp options:remote_ip=2.1.1.3 options:local_ip=2.1.1.2 \ options:payload_type=mpls options:dst_port=6635 Signed-off-by: Martin Varghese <martin.varghese@nokia.com> Acked-By: Greg Rose <gvrose8192@gmail.com> Tested-by: Greg Rose <gvrose8192@gmail.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* datapath: ovs_ct_exit to be done under ovs_lockTonghao Zhang2020-11-272-2/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Upstream commit: commit 27de77cec985233bdf6546437b9761853265c505 Author: Tonghao Zhang <xiangxia.m.yue@gmail.com> Date: Fri Apr 17 02:57:31 2020 +0800 net: openvswitch: ovs_ct_exit to be done under ovs_lock syzbot wrote: | ============================= | WARNING: suspicious RCU usage | 5.7.0-rc1+ #45 Not tainted | ----------------------------- | net/openvswitch/conntrack.c:1898 RCU-list traversed in non-reader section!! | | other info that might help us debug this: | rcu_scheduler_active = 2, debug_locks = 1 | ... | | stack backtrace: | Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014 | Workqueue: netns cleanup_net | Call Trace: | ... | ovs_ct_exit | ovs_exit_net | ops_exit_list.isra.7 | cleanup_net | process_one_work | worker_thread To avoid that warning, invoke the ovs_ct_exit under ovs_lock and add lockdep_ovsl_is_held as optional lockdep expression. Link: https://lore.kernel.org/lkml/000000000000e642a905a0cbee6e@google.com Fixes: 11efd5cb04a1 ("openvswitch: Support conntrack zone limit") Cc: Pravin B Shelar <pshelar@ovn.org> Cc: Yi-Hung Wei <yihung.wei@gmail.com> Reported-by: syzbot+7ef50afd3a211f879112@syzkaller.appspotmail.com Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Acked-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net> Cc: Tonghao Zhang <xiangxia.m.yue@gmail.com> Fixes: cb2a5486a3a3 ("datapath: conntrack: Support conntrack zone limit") Signed-off-by: Greg Rose <gvrose8192@gmail.com> Acked-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* compat: rcu: Add support for consolidated-RCU reader checkingJoel Fernandes (Google)2020-11-271-2/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | Upstream commit: commit 28875945ba98d1b47a8a706812b6494d165bb0a0 Author: Joel Fernandes (Google) <joel@joelfernandes.org> Date: Tue Jul 16 18:12:22 2019 -0400 rcu: Add support for consolidated-RCU reader checking This commit adds RCU-reader checks to list_for_each_entry_rcu() and hlist_for_each_entry_rcu(). These checks are optional, and are indicated by a lockdep expression passed to a new optional argument to these two macros. If this optional lockdep expression is omitted, these two macros act as before, checking for an RCU read-side critical section. Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> [ paulmck: Update to eliminate return within macro and update comment. ] Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com> Backport portion of upstream commit for hlist_for_each_entry_rcu() macro so that it can be used in following bug fix. Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Greg Rose <gvrose8192@gmail.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* compat: Fix compile warning.Greg Rose2020-11-161-1/+4
| | | | | | | | | | | | | | | | In ../compat/nf_conntrack_reasm.c nf_frags_cache_name is declared if OVS_NF_DEFRAG6_BACKPORT is defined. However, later in the patch it is only used if HAVE_INET_FRAGS_WITH_FRAGS_WORK is defined and HAVE_INET_FRAGS_RND is not defined. This will cause a compile warning about unused variables. Fix it up by using the same defines that enable its use to decide if it should be declared and avoid the compiler warning. Fixes: 4a90b277baca ("compat: Fixup ipv6 fragmentation on 4.9.135+ kernels") Signed-off-by: Greg Rose <gvrose8192@gmail.com> Acked-by: Yi-Hung Wei <yihung.wei@gmail.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* compat: Fix build issue on RHEL 7.7.Greg Rose2020-11-162-12/+2
| | | | | | | | | | | | | | | | | | RHEL 7.2 introduced a KABI fixup in struct sk_buff for the name change of l4_rxhash to l4_hash. Then patch 9ba57fc7cccc ("datapath: Add hash info to upcall") introduced a compile error by using l4_hash and not fixing up the HAVE_L4_RXHASH configuration flag. Remove all references to HAVE_L4_RXHASH and always use l4_hash to resolve the issue. This will break compilation on RHEL 7.0 and RHEL 7.1 but dropping support for these older kernels shouldn't be a problem. Fixes: 9ba57fc7cccc ("datapath: Add hash info to upcall") Signed-off-by: Greg Rose <gvrose8192@gmail.com> Acked-by: Yi-Hung Wei <yihung.wei@gmail.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* compat: Remove stale code.Greg Rose2020-11-162-7/+1
| | | | | | | | | | Remove stale and unused code left over after support for kernels older than 3.10 was removed. Fixes: 8063e0958780 ("datapath: Drop support for kernel older than 3.10") Signed-off-by: Greg Rose <gvrose8192@gmail.com> Acked-by: Yi-Hung Wei <yihung.wei@gmail.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* datapath: use hlist_for_each_entry_rcu instead of hlist_for_each_entryTonghao Zhang2020-10-171-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | Upstream commit: commit 64948427a63f49dd0ce403388d232f22cc1971a8 Author: Tonghao Zhang <xiangxia.m.yue@gmail.com> Date: Thu Mar 26 04:27:24 2020 +0800 net: openvswitch: use hlist_for_each_entry_rcu instead of hlist_for_each_entry The struct sw_flow is protected by RCU, when traversing them, use hlist_for_each_entry_rcu. Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Tested-by: Greg Rose <gvrose8192@gmail.com> Reviewed-by: Greg Rose <gvrose8192@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Compat fixup - OVS doesn't support lockdep_ovsl_is_held() yet Reviewed-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Acked-by: Yi-Hung Wei <yihung.wei@gmail.com> Signed-off-by: Greg Rose <gvrose8192@gmail.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* datapath: Distribute switch variables for initializationKees Cook2020-10-171-8/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Upstream commit: commit 16a556eeb7ed2dc3709fe2c5be76accdfa4901ab Author: Kees Cook <keescook@chromium.org> Date: Wed Feb 19 22:23:09 2020 -0800 openvswitch: Distribute switch variables for initialization Variables declared in a switch statement before any case statements cannot be automatically initialized with compiler instrumentation (as they are not part of any execution flow). With GCC's proposed automatic stack variable initialization feature, this triggers a warning (and they don't get initialized). Clang's automatic stack variable initialization (via CONFIG_INIT_STACK_ALL=y) doesn't throw a warning, but it also doesn't initialize such variables[1]. Note that these warnings (or silent skipping) happen before the dead-store elimination optimization phase, so even when the automatic initializations are later elided in favor of direct initializations, the warnings remain. To avoid these problems, move such variables into the "case" where they're used or lift them up into the main function body. net/openvswitch/flow_netlink.c: In function ‘validate_set’: net/openvswitch/flow_netlink.c:2711:29: warning: statement will never be executed [-Wswitch-unreachable] 2711 | const struct ovs_key_ipv4 *ipv4_key; | ^~~~~~~~ [1] https://bugs.llvm.org/show_bug.cgi?id=44916 Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David S. Miller <davem@davemloft.net> Acked-by: Yi-Hung Wei <yihung.wei@gmail.com> Signed-off-by: Greg Rose <gvrose8192@gmail.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* datapath: use skb_list_walk_safe helper for gso segmentsJason A. Donenfeld2020-10-172-7/+11
| | | | | | | | | | | | | | | | | | | Upstream commit: commit 2cec4448db38758832c2edad439f99584bb8fa0d Author: Jason A. Donenfeld <Jason@zx2c4.com> Date: Mon Jan 13 18:42:29 2020 -0500 net: openvswitch: use skb_list_walk_safe helper for gso segments This is a straight-forward conversion case for the new function, keeping the flow of the existing code as intact as possible. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: David S. Miller <davem@davemloft.net> Acked-by: Yi-Hung Wei <yihung.wei@gmail.com> Signed-off-by: Greg Rose <gvrose8192@gmail.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* datapath: support asymmetric conntrackaaron conole2020-10-171-0/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Upstream commit: commit 5d50aa83e2c8e91ced2cca77c198b468ca9210f4 author: aaron conole <aconole@redhat.com> date: tue dec 3 16:34:13 2019 -0500 openvswitch: support asymmetric conntrack the openvswitch module shares a common conntrack and nat infrastructure exposed via netfilter. it's possible that a packet needs both snat and dnat manipulation, due to e.g. tuple collision. netfilter can support this because it runs through the nat table twice - once on ingress and again after egress. the openvswitch module doesn't have such capability. like netfilter hook infrastructure, we should run through nat twice to keep the symmetry. fixes: 05752523e565 ("openvswitch: interface with nat.") signed-off-by: aaron conole <aconole@redhat.com> signed-off-by: david s. miller <davem@davemloft.net> Fixes: c5f6c06b58d6 ("datapath: Interface with NAT.") Acked-by: Aaron Conole <aconole@redhat.com> Acked-by: Yi-Hung Wei <yihung.wei@gmail.com> Signed-off-by: Greg Rose <gvrose8192@gmail.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* datapath: remove another BUG_ON()Paolo Abeni2020-10-171-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | Upstream commit: commit 8a574f86652a4540a2433946ba826ccb87f398cc Author: Paolo Abeni <pabeni@redhat.com> Date: Sun Dec 1 18:41:25 2019 +0100 openvswitch: remove another BUG_ON() If we can't build the flow del notification, we can simply delete the flow, no need to crash the kernel. Still keep a WARN_ON to preserve debuggability. Note: the BUG_ON() predates the Fixes tag, but this change can be applied only after the mentioned commit. v1 -> v2: - do not leak an skb on error Fixes: aed067783e50 ("openvswitch: Minimize ovs_flow_cmd_del critical section.") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Acked-by: Yi-Hung Wei <yihung.wei@gmail.com> Signed-off-by: Greg Rose <gvrose8192@gmail.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* datapath: drop unneeded BUG_ON() in ovs_flow_cmd_build_info()Paolo Abeni2020-10-171-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | Upstream commit: commit 8ffeb03fbba3b599690b361467bfd2373e8c450f Author: Paolo Abeni <pabeni@redhat.com> Date: Sun Dec 1 18:41:24 2019 +0100 openvswitch: drop unneeded BUG_ON() in ovs_flow_cmd_build_info() All the callers of ovs_flow_cmd_build_info() already deal with error return code correctly, so we can handle the error condition in a more gracefull way. Still dump a warning to preserve debuggability. v1 -> v2: - clarify the commit message - clean the skb and report the error (DaveM) Fixes: ccb1352e76cf ("net: Add Open vSwitch kernel components.") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Acked-by: Yi-Hung Wei <yihung.wei@gmail.com> Signed-off-by: Greg Rose <gvrose8192@gmail.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* datapath: fix flow command message sizePaolo Abeni2020-10-171-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | | Upstream commit: commit 4e81c0b3fa93d07653e2415fa71656b080a112fd Author: Paolo Abeni <pabeni@redhat.com> Date: Tue Nov 26 12:55:50 2019 +0100 openvswitch: fix flow command message size When user-space sets the OVS_UFID_F_OMIT_* flags, and the relevant flow has no UFID, we can exceed the computed size, as ovs_nla_put_identifier() will always dump an OVS_FLOW_ATTR_KEY attribute. Take the above in account when computing the flow command message size. Fixes: 74ed7ab9264c ("openvswitch: Add support for unique flow IDs.") Reported-by: Qi Jun Ding <qding@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Acked-by: Yi-Hung Wei <yihung.wei@gmail.com> Signed-off-by: Greg Rose <gvrose8192@gmail.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* datapath: don't call pad_packet if not necessaryTonghao Zhang2020-10-171-14/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Upstream commit: commit 61ca533c0e94104c35fcb7858a23ec9a05d78143 Author: Tonghao Zhang <xiangxia.m.yue@gmail.com> Date: Thu Nov 14 23:51:08 2019 +0800 net: openvswitch: don't call pad_packet if not necessary The nla_put_u16/nla_put_u32 makes sure that *attrlen is align. The call tree is that: nla_put_u16/nla_put_u32 -> nla_put attrlen = sizeof(u16) or sizeof(u32) -> __nla_put attrlen -> __nla_reserve attrlen -> skb_put(skb, nla_total_size(attrlen)) nla_total_size returns the total length of attribute including padding. Cc: Joe Stringer <joe@ovn.org> Cc: William Tu <u9012063@gmail.com> Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Acked-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net> Reviewed-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Acked-by: Yi-Hung Wei <yihung.wei@gmail.com> Signed-off-by: Greg Rose <gvrose8192@gmail.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* datapath: select vport upcall portid directlyTonghao Zhang2020-10-171-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | Upstream commit: commit 90ce9f23a886bdef7a4b7a9bd52c7a50a6a81635 Author: Tonghao Zhang <xiangxia.m.yue@gmail.com> Date: Thu Nov 7 00:34:28 2019 +0800 net: openvswitch: select vport upcall portid directly The commit 69c51582ff786 ("dpif-netlink: don't allocate per thread netlink sockets"), in Open vSwitch ovs-vswitchd, has changed the number of allocated sockets to just one per port by moving the socket array from a per handler structure to a per datapath one. In the kernel datapath, a vport will have only one socket in most case, if so select it directly in fast-path. Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Acked-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net> Reviewed-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Acked-by: Yi-Hung Wei <yihung.wei@gmail.com> Signed-off-by: Greg Rose <gvrose8192@gmail.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* datapath: simplify the ovs_dp_cmd_newTonghao Zhang2020-10-171-22/+38
| | | | | | | | | | | | | | | | | | | | | Upstream commit: commit eec62eadd1d757b0743ccbde55973814f3ad396e Author: Tonghao Zhang <xiangxia.m.yue@gmail.com> Date: Fri Nov 1 22:23:54 2019 +0800 net: openvswitch: simplify the ovs_dp_cmd_new use the specified functions to init resource. Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Tested-by: Greg Rose <gvrose8192@gmail.com> Acked-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net> Reviewed-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Acked-by: Yi-Hung Wei <yihung.wei@gmail.com> Signed-off-by: Greg Rose <gvrose8192@gmail.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* datapath: fix possible memleak on destroy flow-tableTonghao Zhang2020-10-172-88/+106
| | | | | | | | | | | | | | | | | | | | | | | | Upstream commit: commit 50b0e61b32ee890a75b4377d5fbe770a86d6a4c1 Author: Tonghao Zhang <xiangxia.m.yue@gmail.com> Date: Fri Nov 1 22:23:52 2019 +0800 net: openvswitch: fix possible memleak on destroy flow-table When we destroy the flow tables which may contain the flow_mask, so release the flow mask struct. Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Tested-by: Greg Rose <gvrose8192@gmail.com> Acked-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net> Added additional compat layer fixup for WRITE_ONCE() Reviewed-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Acked-by: Yi-Hung Wei <yihung.wei@gmail.com> Signed-off-by: Greg Rose <gvrose8192@gmail.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* datapath: add likely in flow_lookupTonghao Zhang2020-10-171-2/+2
| | | | | | | | | | | | | | | | | | | | | | | Upstream commit: commit 0a3e01371db17d753dd92ec4d0fc6247412d3b01 Author: Tonghao Zhang <xiangxia.m.yue@gmail.com> Date: Fri Nov 1 22:23:51 2019 +0800 net: openvswitch: add likely in flow_lookup The most case *index < ma->max, and flow-mask is not NULL. We add un/likely for performance. Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Tested-by: Greg Rose <gvrose8192@gmail.com> Acked-by: William Tu <u9012063@gmail.com> Acked-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net> Reviewed-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Acked-by: Yi-Hung Wei <yihung.wei@gmail.com> Signed-off-by: Greg Rose <gvrose8192@gmail.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* datapath: simplify the flow_hashTonghao Zhang2020-10-171-5/+2
| | | | | | | | | | | | | | | | | | | | | | Upstream commit: commit 515b65a4b99197ae062a795ab4de919e6d04be04 Author: Tonghao Zhang <xiangxia.m.yue@gmail.com> Date: Fri Nov 1 22:23:50 2019 +0800 net: openvswitch: simplify the flow_hash Simplify the code and remove the unnecessary BUILD_BUG_ON. Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Tested-by: Greg Rose <gvrose8192@gmail.com> Acked-by: William Tu <u9012063@gmail.com> Acked-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net> Reviewed-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Acked-by: Yi-Hung Wei <yihung.wei@gmail.com> Signed-off-by: Greg Rose <gvrose8192@gmail.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* datapath: optimize flow-mask looking upTonghao Zhang2020-10-171-50/+53
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Upstream commit: commit 57f7d7b9164426c496300d254fd5167fbbf205ea Author: Tonghao Zhang <xiangxia.m.yue@gmail.com> Date: Fri Nov 1 22:23:49 2019 +0800 net: openvswitch: optimize flow-mask looking up The full looking up on flow table traverses all mask array. If mask-array is too large, the number of invalid flow-mask increase, performance will be drop. One bad case, for example: M means flow-mask is valid and NULL of flow-mask means deleted. +-------------------------------------------+ | M | NULL | ... | NULL | M| +-------------------------------------------+ In that case, without this patch, openvswitch will traverses all mask array, because there will be one flow-mask in the tail. This patch changes the way of flow-mask inserting and deleting, and the mask array will be keep as below: there is not a NULL hole. In the fast path, we can "break" "for" (not "continue") in flow_lookup when we get a NULL flow-mask. "break" v +-------------------------------------------+ | M | M | NULL |... | NULL | NULL| +-------------------------------------------+ This patch don't optimize slow or control path, still using ma->max to traverse. Slow path: * tbl_mask_array_realloc * ovs_flow_tbl_lookup_exact * flow_mask_find Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Tested-by: Greg Rose <gvrose8192@gmail.com> Acked-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net> Reviewed-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Acked-by: Yi-Hung Wei <yihung.wei@gmail.com> Signed-off-by: Greg Rose <gvrose8192@gmail.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* datapath: don't unlock mutex when changing the user_features failsTonghao Zhang2020-10-171-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | Upstream commit: commit 4c76bf696a608ea5cc555fe97ec59a9033236604 Author: Tonghao Zhang <xiangxia.m.yue@gmail.com> Date: Fri Nov 1 22:23:53 2019 +0800 net: openvswitch: don't unlock mutex when changing the user_features fails Unlocking of a not locked mutex is not allowed. Other kernel thread may be in critical section while we unlock it because of setting user_feature fail. Fixes: 95a7233c4 ("net: openvswitch: Set OvS recirc_id from tc chain index") Cc: Paul Blakey <paulb@mellanox.com> Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Tested-by: Greg Rose <gvrose8192@gmail.com> Acked-by: William Tu <u9012063@gmail.com> Acked-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net> Reviewed-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Acked-by: Yi-Hung Wei <yihung.wei@gmail.com> Signed-off-by: Greg Rose <gvrose8192@gmail.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* datapath: fix GFP flags in rtnl_net_notifyid()Guillaume Nault2020-10-171-9/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Upstream commit: commit d4e4fdf9e4a27c87edb79b1478955075be141f67 Author: Guillaume Nault <gnault@redhat.com> Date: Wed Oct 23 18:39:04 2019 +0200 netns: fix GFP flags in rtnl_net_notifyid() In rtnl_net_notifyid(), we certainly can't pass a null GFP flag to rtnl_notify(). A GFP_KERNEL flag would be fine in most circumstances, but there are a few paths calling rtnl_net_notifyid() from atomic context or from RCU critical sections. The later also precludes the use of gfp_any() as it wouldn't detect the RCU case. Also, the nlmsg_new() call is wrong too, as it uses GFP_KERNEL unconditionally. Therefore, we need to pass the GFP flags as parameter and propagate it through function calls until the proper flags can be determined. In most cases, GFP_KERNEL is fine. The exceptions are: * openvswitch: ovs_vport_cmd_get() and ovs_vport_cmd_dump() indirectly call rtnl_net_notifyid() from RCU critical section, * rtnetlink: rtmsg_ifinfo_build_skb() already receives GFP flags as parameter. Also, in ovs_vport_cmd_build_info(), let's change the GFP flags used by nlmsg_new(). The function is allowed to sleep, so better make the flags consistent with the ones used in the following ovs_vport_cmd_fill_info() call. Found by code inspection. Fixes: 9a9634545c70 ("netns: notify netns id events") Signed-off-by: Guillaume Nault <gnault@redhat.com> Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Acked-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net> Backport the datapath.c portion of this fix. Acked-by: Yi-Hung Wei <yihung.wei@gmail.com> Signed-off-by: Greg Rose <gvrose8192@gmail.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* datapath: Set OvS recirc_id from tc chain indexPaul Blakey2020-10-174-5/+55
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Upstream commit: commit 95a7233c452a58a4c2310c456c73997853b2ec46 Author: Paul Blakey <paulb@mellanox.com> Date: Wed Sep 4 16:56:37 2019 +0300 net: openvswitch: Set OvS recirc_id from tc chain index Offloaded OvS datapath rules are translated one to one to tc rules, for example the following simplified OvS rule: recirc_id(0),in_port(dev1),eth_type(0x0800),ct_state(-trk) actions:ct(),recirc(2) Will be translated to the following tc rule: $ tc filter add dev dev1 ingress \ prio 1 chain 0 proto ip \ flower tcp ct_state -trk \ action ct pipe \ action goto chain 2 Received packets will first travel though tc, and if they aren't stolen by it, like in the above rule, they will continue to OvS datapath. Since we already did some actions (action ct in this case) which might modify the packets, and updated action stats, we would like to continue the proccessing with the correct recirc_id in OvS (here recirc_id(2)) where we left off. To support this, introduce a new skb extension for tc, which will be used for translating tc chain to ovs recirc_id to handle these miss cases. Last tc chain index will be set by tc goto chain action and read by OvS datapath. Signed-off-by: Paul Blakey <paulb@mellanox.com> Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net> Backport the local datapath changes from this patch and add compat layer fixup for the DECLARE_STATIC_KEY_FALSE macro. Acked-by: Yi-Hung Wei <yihung.wei@gmail.com> Signed-off-by: Greg Rose <gvrose8192@gmail.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* datapath: Print error when ovs_execute_actions() failsYifeng Sun2020-10-171-2/+5
| | | | | | | | | | | | | | | | | | | | | | Upstream commit: commit aa733660dbd8d9192b8c528ae0f4b84f3fef74e4 Author: Yifeng Sun <pkusunyifeng@gmail.com> Date: Sun Aug 4 19:56:11 2019 -0700 openvswitch: Print error when ovs_execute_actions() fails Currently in function ovs_dp_process_packet(), return values of ovs_execute_actions() are silently discarded. This patch prints out an debug message when error happens so as to provide helpful hints for debugging. Acked-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net> Reviewed-by: Yifeng Sun <pkusunyifeng@gmail.com> Acked-by: Yi-Hung Wei <yihung.wei@gmail.com> Signed-off-by: Greg Rose <gvrose8192@gmail.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* datapath: do not update max_headroom if new headroom is equal to old headroomTaehee Yoo2020-10-171-11/+27
| | | | | | | | | | | | | | | | | | | | | | | | Upstream commit: commit 6b660c4177aaebdc73df7a3378f0e8b110aa4b51 Author: Taehee Yoo <ap420073@gmail.com> Date: Sat Jul 6 01:08:09 2019 +0900 net: openvswitch: do not update max_headroom if new headroom is equal to old headroom When a vport is deleted, the maximum headroom size would be changed. If the vport which has the largest headroom is deleted, the new max_headroom would be set. But, if the new headroom size is equal to the old headroom size, updating routine is unnecessary. Signed-off-by: Taehee Yoo <ap420073@gmail.com> Tested-by: Greg Rose <gvrose8192@gmail.com> Reviewed-by: Greg Rose <gvrose8192@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Acked-by: Yi-Hung Wei <yihung.wei@gmail.com> Signed-off-by: Greg Rose <gvrose8192@gmail.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* datapath: drop unneeded likely() call around IS_ERR()Enrico Weigelt2020-10-171-1/+1
| | | | | | | | | | | | | | | | | | | Upstream commit: commit b90f5aa4d6268e81dd1fd51e5ef89d2892bf040d Author: Enrico Weigelt <info@metux.net> Date: Wed Jun 5 23:06:40 2019 +0200 net: openvswitch: drop unneeded likely() call around IS_ERR() IS_ERR() already calls unlikely(), so this extra likely() call around the !IS_ERR() is not needed. Signed-off-by: Enrico Weigelt <info@metux.net> Signed-off-by: David S. Miller <davem@davemloft.net> Acked-by: Yi-Hung Wei <yihung.wei@gmail.com> Signed-off-by: Greg Rose <gvrose8192@gmail.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* datapath: return an error instead of doing BUG_ON()Eelco Chaudron2020-10-171-2/+5
| | | | | | | | | | | | | | | | | | | | | | Upstream commit: commit a734d1f4c2fc962ef4daa179e216df84a8ec5f84 Author: Eelco Chaudron <echaudro@redhat.com> Date: Thu May 2 16:12:38 2019 -0400 net: openvswitch: return an error instead of doing BUG_ON() For all other error cases in queue_userspace_packet() the error is returned, so it makes sense to do the same for these two error cases. Reported-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: Eelco Chaudron <echaudro@redhat.com> Acked-by: Flavio Leitner <fbl@sysclose.org> Signed-off-by: David S. Miller <davem@davemloft.net> Acked-by: Eelco Chaudron <echaudro@redhat.com> Acked-by: Yi-Hung Wei <yihung.wei@gmail.com> Signed-off-by: Greg Rose <gvrose8192@gmail.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* Eliminate "whitelist" and "blacklist" terms.Ben Pfaff2020-10-163-2/+2
| | | | | | | | There is one remaining use under datapath. That change should happen upstream in Linux first according to our usual policy. Signed-off-by: Ben Pfaff <blp@ovn.org> Acked-by: Alin Gabriel Serdean <aserdean@ovn.org>
* datapath: Fix exposing OVS_TUNNEL_KEY_ATTR_GTPU_OPTS to kernel module.Ilya Maximets2020-10-081-0/+3
| | | | | | | | | | | | | | | | Kernel module doesn't know about GTPU and it should return correct out-of-range error in case this tunnel attribute passed there for any reason. Current out-of-tree module will pass the range check and will try to access ovs_tunnel_key_lens[] array by index OVS_TUNNEL_KEY_ATTR_GTPU_OPTS. Even though it might not produce issues in current code, this is not a good thing to do since ovs_tunnel_key_lens[] array is not explicitly initialized for OVS_TUNNEL_KEY_ATTR_GTPU_OPTS and we will likely have misleading error about incorrect attribute length in the end. Fixes: 3c6d05a02e0f ("userspace: Add GTP-U support.") Acked-by: Greg Rose <gvrose8192@gmail.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* datapath: Remove duplicated includesYunjian Wang2020-07-142-2/+0
| | | | | | | | Remove duplicated includes. Acked-by: Greg Rose <gvrose8192@gmail.com> Signed-off-by: Yunjian Wang <wangyunjian@huawei.com> Signed-off-by: William Tu <u9012063@gmail.com>
* userspace: Avoid dp_hash recirculation for balance-tcp bond mode.Vishal Deep Ajmera2020-06-221-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Problem: In OVS, flows with output over a bond interface of type “balance-tcp” gets translated by the ofproto layer into "HASH" and "RECIRC" datapath actions. After recirculation, the packet is forwarded to the bond member port based on 8-bits of the datapath hash value computed through dp_hash. This causes performance degradation in the following ways: 1. The recirculation of the packet implies another lookup of the packet’s flow key in the exact match cache (EMC) and potentially Megaflow classifier (DPCLS). This is the biggest cost factor. 2. The recirculated packets have a new “RSS” hash and compete with the original packets for the scarce number of EMC slots. This implies more EMC misses and potentially EMC thrashing causing costly DPCLS lookups. 3. The 256 extra megaflow entries per bond for dp_hash bond selection put additional load on the revalidation threads. Owing to this performance degradation, deployments stick to “balance-slb” bond mode even though it does not do active-active load balancing for VXLAN- and GRE-tunnelled traffic because all tunnel packet have the same source MAC address. Proposed optimization: This proposal introduces a new load-balancing output action instead of recirculation. Maintain one table per-bond (could just be an array of uint16's) and program it the same way internal flows are created today for each possible hash value (256 entries) from ofproto layer. Use this table to load-balance flows as part of output action processing. Currently xlate_normal() -> output_normal() -> bond_update_post_recirc_rules() -> bond_may_recirc() and compose_output_action__() generate 'dp_hash(hash_l4(0))' and 'recirc(<RecircID>)' actions. In this case the RecircID identifies the bond. For the recirculated packets the ofproto layer installs megaflow entries that match on RecircID and masked dp_hash and send them to the corresponding output port. Instead, we will now generate action as 'lb_output(<bond id>)' This combines hash computation (only if needed, else re-use RSS hash) and inline load-balancing over the bond. This action is used *only* for balance-tcp bonds in userspace datapath (the OVS kernel datapath remains unchanged). Example: Current scheme: With 8 UDP flows (with random UDP src port): flow-dump from pmd on cpu core: 2 recirc_id(0),in_port(7),<...> actions:hash(hash_l4(0)),recirc(0x1) recirc_id(0x1),dp_hash(0xf8e02b7e/0xff),<...> actions:2 recirc_id(0x1),dp_hash(0xb236c260/0xff),<...> actions:1 recirc_id(0x1),dp_hash(0x7d89eb18/0xff),<...> actions:1 recirc_id(0x1),dp_hash(0xa78d75df/0xff),<...> actions:2 recirc_id(0x1),dp_hash(0xb58d846f/0xff),<...> actions:2 recirc_id(0x1),dp_hash(0x24534406/0xff),<...> actions:1 recirc_id(0x1),dp_hash(0x3cf32550/0xff),<...> actions:1 New scheme: We can do with a single flow entry (for any number of new flows): in_port(7),<...> actions:lb_output(1) A new CLI has been added to dump datapath bond cache as given below. # ovs-appctl dpif-netdev/bond-show [dp] Bond cache: bond-id 1 : bucket 0 - slave 2 bucket 1 - slave 1 bucket 2 - slave 2 bucket 3 - slave 1 Co-authored-by: Manohar Krishnappa Chidambaraswamy <manukc@gmail.com> Signed-off-by: Manohar Krishnappa Chidambaraswamy <manukc@gmail.com> Signed-off-by: Vishal Deep Ajmera <vishal.deep.ajmera@ericsson.com> Tested-by: Matteo Croce <mcroce@redhat.com> Tested-by: Adrian Moreno <amorenoz@redhat.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* datapath: Add hash info to upcall.Han Zhou2020-05-283-1/+75
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch backports below upstream patches, and add __skb_set_hash to compat for older kernels. commit b5ab1f1be6180a2e975eede18731804b5164a05d Author: Jakub Kicinski <kuba@kernel.org> Date: Mon Mar 2 21:05:18 2020 -0800 openvswitch: add missing attribute validation for hash Add missing attribute validation for OVS_PACKET_ATTR_HASH to the netlink policy. Fixes: bd1903b7c459 ("net: openvswitch: add hash info to upcall") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Greg Rose <gvrose8192@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> commit bd1903b7c4596ba6f7677d0dfefd05ba5876707d Author: Tonghao Zhang <xiangxia.m.yue@gmail.com> Date: Wed Nov 13 23:04:49 2019 +0800 net: openvswitch: add hash info to upcall When using the kernel datapath, the upcall don't include skb hash info relatived. That will introduce some problem, because the hash of skb is important in kernel stack. For example, VXLAN module uses it to select UDP src port. The tx queue selection may also use the hash in stack. Hash is computed in different ways. Hash is random for a TCP socket, and hash may be computed in hardware, or software stack. Recalculation hash is not easy. Hash of TCP socket is computed: tcp_v4_connect -> sk_set_txhash (is random) __tcp_transmit_skb -> skb_set_hash_from_sk There will be one upcall, without information of skb hash, to ovs-vswitchd, for the first packet of a TCP session. The rest packets will be processed in Open vSwitch modules, hash kept. If this tcp session is forward to VXLAN module, then the UDP src port of first tcp packet is different from rest packets. TCP packets may come from the host or dockers, to Open vSwitch. To fix it, we store the hash info to upcall, and restore hash when packets sent back. +---------------+ +-------------------------+ | Docker/VMs | | ovs-vswitchd | +----+----------+ +-+--------------------+--+ | ^ | | | | | | upcall v restore packet hash (not recalculate) | +-+--------------------+--+ | tap netdev | | vxlan module +---------------> +--> Open vSwitch ko +--> or internal type | | +-------------------------+ Reported-at: https://mail.openvswitch.org/pipermail/ovs-dev/2019-October/364062.html Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Acked-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net> Tested-by: Aliasgar Ginwala <aginwala@ebay.com> Acked-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Signed-off-by: Han Zhou <hzhou@ovn.org> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* compat: Backport ipv6_stub changeGreg Rose2020-05-242-2/+27
| | | | | | | | | | | | | | | | | A patch backported to the Linux stable 4.14 tree and present in the latest stable 4.14.181 kernel breaks ipv6_stub usage. The commit is 8ab8786f78c3 ("net ipv6_stub: use ip6_dst_lookup_flow instead of ip6_dst_lookup"). Create the compat layer define to check for it and fixup usage in vxlan and geneve modules. Passes Travis here: https://travis-ci.org/github/gvrose8192/ovs-experimental/builds/689798733 Signed-off-by: Greg Rose <gvrose8192@gmail.com> Signed-off-by: William Tu <u9012063@gmail.com>
* compat: Fix ipv6_dst_lookup build errorYi-Hung Wei2020-04-302-10/+15
| | | | | | | | | | | | | | | | | | | The geneve/vxlan compat code base invokes ipv6_dst_lookup() which is recently replaced by ipv6_dst_lookup_flow() in the stable kernel tree. This causes travis build failure: * https://travis-ci.org/github/openvswitch/ovs/builds/681084038 This patch updates the backport logic to invoke the right function. Related patch in git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git b9f3e457098e ("net: ipv6_stub: use ip6_dst_lookup_flow instead of ip6_dst_lookup") Signed-off-by: Yi-Hung Wei <yihung.wei@gmail.com> Signed-off-by: William Tu <u9012063@gmail.com>
* compat: Fix broken partial backport of extack op parameterGreg Rose2020-04-157-15/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A series of commits added support for the extended ack parameter to the newlink, changelink and validate ops in the rtnl_link_ops structure: a8b8a889e369d ("net: add netlink_ext_ack argument to rtnl_link_ops.validate") 7a3f4a185169b ("net: add netlink_ext_ack argument to rtnl_link_ops.newlink") ad744b223c521 ("net: add netlink_ext_ack argument to rtnl_link_ops.changelink") These commits were all added at the same time and present since the Linux kernel 4.13 release. In our compatiblity layer we have a define HAVE_EXT_ACK_IN_RTNL_LINKOPS that indicates the presence of the extended ack parameter for these three link operations. At least one distro has only backported two of the three patches, for newlink and changelink, while not backporting patch a8b8a889e369d for the validate op. Our compatibility layer code in acinclude.m4 is able to find the presence of the extack within the rtnl_link_ops structure so it defines HAVE_EXT_ACK_IN_RTNL_LINKOPS but since the validate link op does not have the extack parameter the compilation fails on recent kernels for that particular distro. Other kernel distributions based upon this distro will presumably also encounter the compile errors. Introduce a new function in acinclude.m4 that will find function op definitions and then search for the required parameter. Then use this function to define HAVE_RTNLOP_VALIDATE_WITH_EXTACK so that we can detect and enable correct compilation on kernels which have not backported the entire set of patches. This function is generic to any function op - it need not be in a structure. In places where HAVE_EXT_ACK_IN_RTNL_LINKOPS wraps validate functions replace it with the new HAVE_RTNLOP_VALIDATE_WITH_EXTACK define. Passes Travis here: https://travis-ci.org/github/gvrose8192/ovs-experimental/builds/674599698 Passes a kernel check-kmod test on several systems, including sles12 sp4 4.12.14-95.48-default kernel, without any regressions. VMWare-BZ: #2544032 Signed-off-by: Greg Rose <gvrose8192@gmail.com> Reviewed-by: Yifeng Sun <pkusunyifeng@gmail.com> Signed-off-by: William Tu <u9012063@gmail.com>
* userspace: Add GTP-U support.William Tu2020-03-251-0/+2
| | | | | | | | | | | | | | | | | | | | | | | GTP, GPRS Tunneling Protocol, is a group of IP-based communications protocols used to carry general packet radio service (GPRS) within GSM, UMTS and LTE networks. GTP protocol has two parts: Signalling (GTP-Control, GTP-C) and User data (GTP-User, GTP-U). GTP-C is used for setting up GTP-U protocol, which is an IP-in-UDP tunneling protocol. Usually GTP is used in connecting between base station for radio, Serving Gateway (S-GW), and PDN Gateway (P-GW). This patch implements GTP-U protocol for userspace datapath, supporting only required header fields and G-PDU message type. See spec in: https://tools.ietf.org/html/draft-hmm-dmm-5g-uplane-analysis-00 Tested-at: https://travis-ci.org/github/williamtu/ovs-travis/builds/666518784 Signed-off-by: Feng Yang <yangfengee04@gmail.com> Co-authored-by: Feng Yang <yangfengee04@gmail.com> Signed-off-by: Yi Yang <yangyi01@inspur.com> Co-authored-by: Yi Yang <yangyi01@inspur.com> Signed-off-by: William Tu <u9012063@gmail.com> Acked-by: Ben Pfaff <blp@ovn.org>
* compat: Fix nf_ip_hook parameters for RHEL 8Greg Rose2020-03-241-1/+1
| | | | | | | | | | A RHEL release version check was only checking for RHEL releases greater than 7.0 so that ended up including a compat fixup that is not needed for 8.0. Fix up the version check. Signed-off-by: Greg Rose <gvrose8192@gmail.com> Acked-by: Yi-Hung Wei <yihung.wei@gmail.com> Signed-off-by: William Tu <u9012063@gmail.com>
* datapath: conntrack: mark expected switch fall-throughGustavo A. R. Silva2020-03-061-0/+1
| | | | | | | | | | | | | | | | | | | | | | Upstream commit: commit 279badc2a85be83e0187b8c566e3b476b76a87a2 Author: Gustavo A. R. Silva <garsilva@embeddedor.com> Date: Thu Oct 19 12:55:03 2017 -0500 openvswitch: conntrack: mark expected switch fall-through In preparation to enabling -Wimplicit-fallthrough, mark switch cases where we are expecting to fall through. Notice that in this particular case I placed a "fall through" comment on its own line, which is what GCC is expecting to find. Signed-off-by: Gustavo A. R. Silva <garsilva@embeddedor.com> Signed-off-by: David S. Miller <davem@davemloft.net> Acked-by: Yi-Hung Wei <yihung.wei@gmail.com> Signed-off-by: Greg Rose <gvrose8192@gmail.com> Signed-off-by: Ben Pfaff <blp@ovn.org>
* compat: Use nla_parse deprecated functionsJohannes Berg2020-03-065-13/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Upstream commit: commit 8cb081746c031fb164089322e2336a0bf5b3070c Author: Johannes Berg <johannes.berg@intel.com> Date: Fri Apr 26 14:07:28 2019 +0200 netlink: make validation more configurable for future strictness We currently have two levels of strict validation: 1) liberal (default) - undefined (type >= max) & NLA_UNSPEC attributes accepted - attribute length >= expected accepted - garbage at end of message accepted 2) strict (opt-in) - NLA_UNSPEC attributes accepted - attribute length >= expected accepted Split out parsing strictness into four different options: * TRAILING - check that there's no trailing data after parsing attributes (in message or nested) * MAXTYPE - reject attrs > max known type * UNSPEC - reject attributes with NLA_UNSPEC policy entries * STRICT_ATTRS - strictly validate attribute size The default for future things should be *everything*. The current *_strict() is a combination of TRAILING and MAXTYPE, and is renamed to _deprecated_strict(). The current regular parsing has none of this, and is renamed to *_parse_deprecated(). Additionally it allows us to selectively set one of the new flags even on old policies. Notably, the UNSPEC flag could be useful in this case, since it can be arranged (by filling in the policy) to not be an incompatible userspace ABI change, but would then going forward prevent forgetting attribute entries. Similar can apply to the POLICY flag. We end up with the following renames: * nla_parse -> nla_parse_deprecated * nla_parse_strict -> nla_parse_deprecated_strict * nlmsg_parse -> nlmsg_parse_deprecated * nlmsg_parse_strict -> nlmsg_parse_deprecated_strict * nla_parse_nested -> nla_parse_nested_deprecated * nla_validate_nested -> nla_validate_nested_deprecated Using spatch, of course: @@ expression TB, MAX, HEAD, LEN, POL, EXT; @@ -nla_parse(TB, MAX, HEAD, LEN, POL, EXT) +nla_parse_deprecated(TB, MAX, HEAD, LEN, POL, EXT) @@ expression NLH, HDRLEN, TB, MAX, POL, EXT; @@ -nlmsg_parse(NLH, HDRLEN, TB, MAX, POL, EXT) +nlmsg_parse_deprecated(NLH, HDRLEN, TB, MAX, POL, EXT) @@ expression NLH, HDRLEN, TB, MAX, POL, EXT; @@ -nlmsg_parse_strict(NLH, HDRLEN, TB, MAX, POL, EXT) +nlmsg_parse_deprecated_strict(NLH, HDRLEN, TB, MAX, POL, EXT) @@ expression TB, MAX, NLA, POL, EXT; @@ -nla_parse_nested(TB, MAX, NLA, POL, EXT) +nla_parse_nested_deprecated(TB, MAX, NLA, POL, EXT) @@ expression START, MAX, POL, EXT; @@ -nla_validate_nested(START, MAX, POL, EXT) +nla_validate_nested_deprecated(START, MAX, POL, EXT) @@ expression NLH, HDRLEN, MAX, POL, EXT; @@ -nlmsg_validate(NLH, HDRLEN, MAX, POL, EXT) +nlmsg_validate_deprecated(NLH, HDRLEN, MAX, POL, EXT) For this patch, don't actually add the strict, non-renamed versions yet so that it breaks compile if I get it wrong. Also, while at it, make nla_validate and nla_parse go down to a common __nla_validate_parse() function to avoid code duplication. Ultimately, this allows us to have very strict validation for every new caller of nla_parse()/nlmsg_parse() etc as re-introduced in the next patch, while existing things will continue to work as is. In effect then, this adds fully strict validation for any new command. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net> Backport portions of this commit applicable to openvswitch and added necessary compatibility layer changes to support older kernels. Acked-by: Yi-Hung Wei <yihung.wei@gmail.com> Signed-off-by: Greg Rose <gvrose8192@gmail.com> Signed-off-by: Ben Pfaff <blp@ovn.org>
* datapath: Kbuild: Add kcompat.h header to front of NOSTDINCGreg Rose2020-03-061-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since this commit in the Linux upstream kernel: 'commit 9b9a3f20cbe0 ("kbuild: split final module linking out into Makefile.modfinal")' The openvswitch kernel module fails to build against the upstream Linux kernel. The cause of the build failure is that the include of the KBUILD_EXTMOD variable was dropped in Makefile.modfinal when it was split out from Makefile.modpost. Our Kbuild was setting the ccflags-y variable to include our kcompat.h header as the first header file. The Linux kernel maintainer has said that it is incorrect to rely on the ccflags-y variable for the modfinal phase of the build so that is why KBUILD_EXTMOD is not included. We fix this by breaking a different Linux kernel make rule. We add '-include $(builddir)/kcompat.h' to the front of the NOSTDINC variable setting in our Kbuild makefile. As noted already in the comment for the NOSTDINC setting: \# These include directories have to go before -I$(KSRC)/include. \# NOSTDINC_FLAGS just happens to be a variable that goes in the \# right place, even though it's conceptually incorrect. So we continue the misuse of the NOSTDINC variable to fix this issue as well. The assumption of the Linux kernel maintainers is that any local, out-of-tree build include files can be added to the end of the command line. In our case that is wrong of course, but there is nothing we can do about it that I know of other than using some utility like unifdef to strip out offending chunks of our compatibility layer code before invocation of Makefile.modfinal. That is a big change that would take a lot of work to implement. We could ask the Linux kernel maintainers to provide some way for out-of-tree kernel modules to include their own header files first in a proper manner. I consider that to be a very low probability of success but something we could ask about. For now we cheat and take the easy way out. Reported-by: David Ahern <dsahern@gmail.com> Acked-by: Yi-Hung Wei <yihung.wei@gmail.com> Signed-off-by: Greg Rose <gvrose8192@gmail.com> Signed-off-by: Ben Pfaff <blp@ovn.org>
* datapath: Use sizeof_field macroPankaj Bharadiya2020-03-069-13/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Upstream commit: commit c593642c8be046915ca3a4a300243a68077cd207 Author: Pankaj Bharadiya <pankaj.laxminarayan.bharadiya@intel.com> Date: Mon Dec 9 10:31:43 2019 -0800 treewide: Use sizeof_field() macro Replace all the occurrences of FIELD_SIZEOF() with sizeof_field() except at places where these are defined. Later patches will remove the unused definition of FIELD_SIZEOF(). This patch is generated using following script: EXCLUDE_FILES="include/linux/stddef.h|include/linux/kernel.h" git grep -l -e "\bFIELD_SIZEOF\b" | while read file; do if [[ "$file" =~ $EXCLUDE_FILES ]]; then continue fi sed -i -e 's/\bFIELD_SIZEOF\b/sizeof_field/g' $file; done Signed-off-by: Pankaj Bharadiya <pankaj.laxminarayan.bharadiya@intel.com> Link: https://lore.kernel.org/r/20190924105839.110713-3-pankaj.laxminarayan.bharadiya@intel.com Co-developed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: David Miller <davem@davemloft.net> # for net Also added a compatibility layer macro for older kernels that still use FIELD_SIZEOF Acked-by: Yi-Hung Wei <yihung.wei@gmail.com> Signed-off-by: Greg Rose <gvrose8192@gmail.com> Signed-off-by: Ben Pfaff <blp@ovn.org>
* compat: Remove flex_array codeGreg Rose2020-03-061-18/+10
| | | | | | | | | Flex array support is removed since kernel 5.1. Convert to use kvmalloc_array instead. Acked-by: Yi-Hung Wei <yihung.wei@gmail.com> Signed-off-by: Greg Rose <gvrose8192@gmail.com> Signed-off-by: Ben Pfaff <blp@ovn.org>