summaryrefslogtreecommitdiff
path: root/NEWS
Commit message (Collapse)AuthorAgeFilesLines
* python: Replace pyOpenSSL with ssl.Timothy Redaelli2021-11-031-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | Currently, pyOpenSSL is half-deprecated upstream and so it's removed on some distributions (for example on CentOS Stream 9, https://issues.redhat.com/browse/CS-336), but since OVS only supports Python 3 it's possible to replace pyOpenSSL with "import ssl" included in base Python 3. Stream recv and send had to be splitted as _recv and _send, since SSLError is a subclass of socket.error and so it was not possible to except for SSLWantReadError and SSLWantWriteError in recv and send of SSLStream. TCPstream._open cannot be used in SSLStream, since Python ssl module requires the SSL socket to be created before connecting it, so SSLStream._open needs to create the socket, create SSL socket and then connect the SSL socket. Reported-by: Timothy Redaelli <tredaelli@redhat.com> Reported-at: https://bugzilla.redhat.com/1988429 Signed-off-by: Timothy Redaelli <tredaelli@redhat.com> Acked-by: Terry Wilson <twilson@redhat.com> Tested-by: Terry Wilson <twilson@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* netdev-offload-dpdk: Don't ignore frags as they are handled.Eli Britstein2021-09-161-0/+2
| | | | | | | Signed-off-by: Eli Britstein <elibr@nvidia.com> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> Tested-by: Emma Finn <emma.finn@intel.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* Set release date for 2.16.0.Ilya Maximets2021-08-171-1/+1
| | | | | | Acked-by: Numan Siddique <numans@ovn.org> Acked-by: Flavio Leitner <fbl@sysclose.org> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* dpdk: Stop configuring socket-limit with the value of socket-mem.Rosemarie O'Riorden2021-07-261-0/+4
| | | | | | | | | | | | | | | | This change removes the automatic memory limit on start-up of OVS with DPDK. As DPDK supports dynamic memory allocation, there is no need to limit the amount of memory available, if not requested. Currently, if socket-limit is not configured, it is set to the value of socket-mem. With this change, the user can decide to set it or have no memory limit. Removed logs that announce this change and fixed documentation. Reported at: https://bugzilla.redhat.com/show_bug.cgi?id=1949850 Signed-off-by: Rosemarie O'Riorden <roriorde@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* dpdk: Remove default values for socket-mem and limit.Rosemarie O'Riorden2021-07-261-0/+4
| | | | | | | | | | | | | | | | | | | This change removes the default values for EAL args socket-mem and socket-limit. As DPDK supports dynamic memory allocation, there is no need to allocate a certain amount of memory on start-up, nor limit the amount of memory available, if not requested. Currently, socket-mem has a default value of 1024 when it is not configured by the user, and socket-limit takes on the value of socket-mem, 1024, by default. With this change, socket-mem is not configured by default, meaning that socket-limit is not either. Neither, either or both options can be set. Removed extra logs that announce this change and fixed documentation. Reported at: https://bugzilla.redhat.com/show_bug.cgi?id=1949850 Signed-off-by: Rosemarie O'Riorden <roriorde@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* Prepare for post-2.16.0 (2.16.90).Ilya Maximets2021-07-161-0/+4
| | | | | Acked-by: Aaron Conole <aconole@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* Prepare for 2.16.0.Ilya Maximets2021-07-161-1/+1
| | | | | Acked-by: Aaron Conole <aconole@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* dpif-netlink: Introduce per-cpu upcall dispatch.Mark Gray2021-07-161-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The Open vSwitch kernel module uses the upcall mechanism to send packets from kernel space to user space when it misses in the kernel space flow table. The upcall sends packets via a Netlink socket. Currently, a Netlink socket is created for every vport. In this way, there is a 1:1 mapping between a vport and a Netlink socket. When a packet is received by a vport, if it needs to be sent to user space, it is sent via the corresponding Netlink socket. This mechanism, with various iterations of the corresponding user space code, has seen some limitations and issues: * On systems with a large number of vports, there is correspondingly a large number of Netlink sockets which can limit scaling. (https://bugzilla.redhat.com/show_bug.cgi?id=1526306) * Packet reordering on upcalls. (https://bugzilla.redhat.com/show_bug.cgi?id=1844576) * A thundering herd issue. (https://bugzilla.redhat.com/show_bug.cgi?id=1834444) This patch introduces an alternative, feature-negotiated, upcall mode using a per-cpu dispatch rather than a per-vport dispatch. In this mode, the Netlink socket to be used for the upcall is selected based on the CPU of the thread that is executing the upcall. In this way, it resolves the issues above as: a) The number of Netlink sockets scales with the number of CPUs rather than the number of vports. b) Ordering per-flow is maintained as packets are distributed to CPUs based on mechanisms such as RSS and flows are distributed to a single user space thread. c) Packets from a flow can only wake up one user space thread. Reported-at: https://bugzilla.redhat.com/1844576 Signed-off-by: Mark Gray <mark.d.gray@redhat.com> Acked-by: Flavio Leitner <fbl@sysclose.org> Acked-by: Aaron Conole <aconole@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* dpif-netdev: Allow pin rxq and non-isolate PMD.Kevin Traynor2021-07-161-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | Pinning an rxq to a PMD with pmd-rxq-affinity may be done for various reasons such as reserving a full PMD for an rxq, or to ensure that multiple rxqs from a port are handled on different PMDs. Previously pmd-rxq-affinity always isolated the PMD so no other rxqs could be assigned to it by OVS. There may be cases where there is unused cycles on those pmds and the user would like other rxqs to also be able to be assigned to it by OVS. Add an option to pin the rxq and non-isolate the PMD. The default behaviour is unchanged, which is pin and isolate the PMD. In order to pin and non-isolate: ovs-vsctl set Open_vSwitch . other_config:pmd-rxq-isolate=false Note this is available only with group assignment type, as pinning conflicts with the operation of the other rxq assignment algorithms. Signed-off-by: Kevin Traynor <ktraynor@redhat.com> Acked-by: Sunil Pai G <sunil.pai.g@intel.com> Acked-by: David Marchand <david.marchand@redhat.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>
* dpif-netdev: Add group rxq scheduling assignment type.Kevin Traynor2021-07-161-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add an rxq scheduling option that allows rxqs to be grouped on a pmd based purely on their load. The current default 'cycles' assignment sorts rxqs by measured processing load and then assigns them to a list of round robin PMDs. This helps to keep the rxqs that require most processing on different cores but as it selects the PMDs in round robin order, it equally distributes rxqs to PMDs. 'cycles' assignment has the advantage in that it separates the most loaded rxqs from being on the same core but maintains the rxqs being spread across a broad range of PMDs to mitigate against changes to traffic pattern. 'cycles' assignment has the disadvantage that in order to make the trade off between optimising for current traffic load and mitigating against future changes, it tries to assign and equal amount of rxqs per PMD in a round robin manner and this can lead to a less than optimal balance of the processing load. Now that PMD auto load balance can help mitigate with future changes in traffic patterns, a 'group' assignment can be used to assign rxqs based on their measured cycles and the estimated running total of the PMDs. In this case, there is no restriction about keeping equal number of rxqs per PMD as it is purely load based. This means that one PMD may have a group of low load rxqs assigned to it while another PMD has one high load rxq assigned to it, as that is the best balance of their measured loads across the PMDs. Signed-off-by: Kevin Traynor <ktraynor@redhat.com> Acked-by: Sunil Pai G <sunil.pai.g@intel.com> Acked-by: David Marchand <david.marchand@redhat.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>
* ofproto-dpif: APIs and CLI option to add/delete static fdb entry.Vasu Dasari2021-07-161-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently there is an option to add/flush/show ARP/ND neighbor. This covers L3 side. For L2 side, there is only fdb show command. This commit gives an option to add/del an fdb entry via ovs-appctl. CLI command looks like: To add: ovs-appctl fdb/add <bridge> <port> <vlan> <Mac> ovs-appctl fdb/add br0 p1 0 50:54:00:00:00:05 To del: ovs-appctl fdb/del <bridge> <vlan> <Mac> ovs-appctl fdb/del br0 0 50:54:00:00:00:05 Added two new APIs to provide convenient interface to add and delete static-macs. bool xlate_add_static_mac_entry(const struct ofproto_dpif *, ofp_port_t in_port, struct eth_addr dl_src, int vlan); bool xlate_delete_static_mac_entry(const struct ofproto_dpif *, struct eth_addr dl_src, int vlan); 1. Static entry should not age. To indicate that entry being programmed is a static entry, 'expires' field in 'struct mac_entry' will be set to a MAC_ENTRY_AGE_STATIC_ENTRY. A check for this value is made while deleting mac entry as part of regular aging process. 2. Another change to the mac-update logic, when a packet with same dl_src as that of a static-mac entry arrives on any port, the logic will not modify the expires field. 3. While flushing fdb entries, made sure static ones are not evicted. 4. Updated "ovs-appctl fdb/stats-show br0" to display number of static entries in switch Added following tests: ofproto-dpif - static-mac add/del/flush ofproto-dpif - static-mac mac moves Reported-at: https://mail.openvswitch.org/pipermail/ovs-discuss/2019-June/048894.html Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=1597752 Signed-off-by: Vasu Dasari <vdasari@gmail.com> Tested-by: Eelco Chaudron <echaudro@redhat.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* dpdk: Logs to announce removal of defaults for socket-mem and limit.Rosemarie O'Riorden2021-07-161-0/+2
| | | | | | | | | | | Deprecate current OVS provided defaults for DPDK socket-mem and socket-limit that are planned to be removed in OVS 2.17. At that point DPDK defaults will be used instead. Warnings have been added to alert users in advance. Signed-off-by: Rosemarie O'Riorden <roriorde@redhat.com> Acked-by: Kevin Traynor <ktraynor@redhat.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>
* python: Add cooperative_yield() API method to Idl.Terry Wilson2021-07-161-0/+5
| | | | | | | | | | | | When using eventlet monkey_patch()'d code, greenthreads can be blocked on connection for several seconds while the database contents are parsed. Eventlet recommends adding a sleep(0) call to cooperatively yield in cpu-bound code. asyncio code has asyncio.sleep(0). This patch adds an API method that defaults to doing nothing, but can be overridden to yield as needed. Signed-off-by: Terry Wilson <twilson@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* dpif-netdev/mfex: Add more AVX512 traffic profilesHarry van Haaren2021-07-161-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | This commit adds 3 new traffic profile implementations to the existing avx512 miniflow extract infrastructure. The profiles added are: - Ether()/IP()/TCP() - Ether()/Dot1Q()/IP()/UDP() - Ether()/Dot1Q()/IP()/TCP() The design of the avx512 code here is for scalability to add more traffic profiles, as well as enabling CPU ISA. Note that an implementation is primarily adding static const data, which the compiler then specializes away when the profile specific function is declared below. As a result, the code is relatively maintainable, and scalable for new traffic profiles as well as new ISA, and does not lower performance compared with manually written code for each profile/ISA. Note that confidence in the correctness of each implementation is achieved through autovalidation, unit tests with known packets, and fuzz tested packets. Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Acked-by: Flavio Leitner <fbl@sysclose.org> Signed-off-by: Ian Stokes <ian.stokes@intel.com>
* dpdk: Add additional CPU ISA detection stringsHarry van Haaren2021-07-161-0/+1
| | | | | | | | | | | | | | This commit enables OVS to at runtime check for more detailed AVX512 capabilities, specifically Byte and Word (BW) extensions, and Vector Bit Manipulation Instructions (VBMI). These instructions will be used in the CPU ISA optimized implementations of traffic profile aware miniflow extract. Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Acked-by: Flavio Leitner <fbl@sysclose.org> Signed-off-by: Ian Stokes <ian.stokes@intel.com>
* dpif-netdev: Add configure to enable autovalidator at build time.Kumar Amber2021-07-161-0/+2
| | | | | | | | | | | | | | | This commit adds a new command to allow the user to enable autovalidatior by default at build time thus allowing for runnig unit test by default. $ ./configure --enable-mfex-default-autovalidator Signed-off-by: Kumar Amber <kumar.amber@intel.com> Co-authored-by: Harry van Haaren <harry.van.haaren@intel.com> Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Acked-by: Flavio Leitner <fbl@sysclose.org> Signed-off-by: Ian Stokes <ian.stokes@intel.com>
* dpif-netdev: Add study function to select the best mfex functionKumar Amber2021-07-161-0/+3
| | | | | | | | | | | | | | | | | The study function runs all the available implementations of miniflow_extract and makes a choice whose hitmask has maximum hits and sets the mfex to that function. Study can be run at runtime using the following command: $ ovs-appctl dpif-netdev/miniflow-parser-set study Signed-off-by: Kumar Amber <kumar.amber@intel.com> Co-authored-by: Harry van Haaren <harry.van.haaren@intel.com> Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Acked-by: Flavio Leitner <fbl@sysclose.org> Signed-off-by: Ian Stokes <ian.stokes@intel.com>
* dpif-netdev: Add auto validation function for miniflow extractKumar Amber2021-07-161-0/+2
| | | | | | | | | | | | | | | | | | | This patch introduced the auto-validation function which allows users to compare the batch of packets obtained from different miniflow implementations against the linear miniflow extract and return a hitmask. The autovaidator function can be triggered at runtime using the following command: $ ovs-appctl dpif-netdev/miniflow-parser-set autovalidator Signed-off-by: Kumar Amber <kumar.amber@intel.com> Co-authored-by: Harry van Haaren <harry.van.haaren@intel.com> Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Acked-by: Flavio Leitner <fbl@sysclose.org> Signed-off-by: Ian Stokes <ian.stokes@intel.com>
* dpif-netdev: Add command line and function pointer for miniflow extractKumar Amber2021-07-161-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | This patch introduces the MFEX function pointers which allows the user to switch between different miniflow extract implementations which are provided by the OVS based on optimized ISA CPU. The user can query for the available minflow extract variants available for that CPU by following commands: $ovs-appctl dpif-netdev/miniflow-parser-get Similarly an user can set the miniflow implementation by the following command : $ ovs-appctl dpif-netdev/miniflow-parser-set name This allows for more performance and flexibility to the user to choose the miniflow implementation according to the needs. Signed-off-by: Kumar Amber <kumar.amber@intel.com> Co-authored-by: Harry van Haaren <harry.van.haaren@intel.com> Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Acked-by: Flavio Leitner <fbl@sysclose.org> Signed-off-by: Ian Stokes <ian.stokes@intel.com>
* docs: Add documentation for ovsdb relay mode.Ilya Maximets2021-07-151-0/+3
| | | | | | | | | Main documentation for the service model and tutorial with the use case and configuration examples. Acked-by: Mark D. Gray <mark.d.gray@redhat.com> Acked-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* dpcls-avx512: Enable avx512 vector popcount instruction.Harry van Haaren2021-07-091-0/+3
| | | | | | | | | | | | | | | | This commit enables the AVX512-VPOPCNTDQ Vector Popcount instruction. This instruction is not available on every CPU that supports the AVX512-F Foundation ISA, hence it is enabled only when the additional VPOPCNTDQ ISA check is passed. The vector popcount instruction is used instead of the AVX512 popcount emulation code present in the avx512 optimized DPCLS today. It provides higher performance in the SIMD miniflow processing as that requires the popcount to calculate the miniflow block indexes. Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com> Acked-by: Flavio Leitner <fbl@sysclose.org> Signed-off-by: Ian Stokes <ian.stokes@intel.com>
* dpif-netdev/dpcls: Specialize more subtable signatures.Harry van Haaren2021-07-091-0/+2
| | | | | | | | | | This commit adds more subtables to be specialized. The traffic pattern here being matched is VXLAN traffic subtables, which commonly have (5,3), (9,1) and (9,4) subtable fingerprints. Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com> Acked-by: Flavio Leitner <fbl@sysclose.org> Signed-off-by: Ian Stokes <ian.stokes@intel.com>
* dpif-netdev/dpcls-avx512: Enable 16 block processing.Harry van Haaren2021-07-091-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit implements larger subtable searches in avx512. A limitation of the previous implementation was that up to 8 blocks of miniflow data could be matched on (so a subtable with 8 blocks was handled in avx, but 9 blocks or more would fall back to scalar/generic). This limitation is removed in this patch, where up to 16 blocks of subtable can be matched on. From an implementation perspective, the key to enabling 16 blocks over 8 blocks was to do bitmask calculation up front, and then use the pre-calculated bitmasks for 2x passes of the "blocks gather" routine. The bitmasks need to be shifted for k-mask usage in the upper (8-15) block range, but it is relatively trivial. This also helps in case expanding to 24 blocks is desired in future. The implementation of the 2nd iteration to handle > 8 blocks is behind a conditional branch which checks the total number of bits. This helps the specialized versions of the function that have a miniflow fingerprint of less-than-or-equal 8 blocks, as the code can be statically stripped out of those functions. Specialized functions that do require more than 8 blocks will have the branch removed and unconditionally execute the 2nd blocks gather routine. Lastly, the _any() flavour will have the conditional branch, and the branch predictor may mispredict a bit, but per burst will likely get most packets correct (particularly towards the middle and end of a burst). The code has been run with unit tests under autovalidation and passes all cases, and unit test coverage has been checked to ensure the 16 block code paths are executing. Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com> Acked-by: Flavio Leitner <fbl@sysclose.org> Signed-off-by: Ian Stokes <ian.stokes@intel.com>
* dpif-netdev: Add a partial HWOL PMD statistic.Cian Ferriter2021-07-091-0/+2
| | | | | | | | | | It is possible for packets traversing the userspace datapath to match a flow before hitting on EMC by using a mark ID provided by a NIC. Add a PMD statistic for this hit. Signed-off-by: Cian Ferriter <cian.ferriter@intel.com> Acked-by: Flavio Leitner <fbl@sysclose.org> Signed-off-by: Ian Stokes <ian.stokes@intel.com>
* dpif-netdev: Add command to get dpif implementations.Harry van Haaren2021-07-091-0/+1
| | | | | | | | | | | | | | | | This commit adds a new command to retrieve the list of available DPIF implementations. This can be used by to check what implementations of the DPIF are available in any given OVS binary. It also returns which implementations are in use by the OVS PMD threads. Usage: $ ovs-appctl dpif-netdev/dpif-impl-get Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com> Co-authored-by: Cian Ferriter <cian.ferriter@intel.com> Signed-off-by: Cian Ferriter <cian.ferriter@intel.com> Acked-by: Flavio Leitner <fbl@sysclose.org> Signed-off-by: Ian Stokes <ian.stokes@intel.com>
* dpif-avx512: Add ISA implementation of dpif.Harry van Haaren2021-07-091-0/+2
| | | | | | | | | | | | | | | | | | This commit adds the AVX512 implementation of DPIF functionality, specifically the dp_netdev_input_outer_avx512 function. This function only handles outer (no re-circulations), and is optimized to use the AVX512 ISA for packet batching and other DPIF work. Sparse is not able to handle the AVX512 intrinsics, causing compile time failures, so it is disabled for this file. Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com> Co-authored-by: Cian Ferriter <cian.ferriter@intel.com> Signed-off-by: Cian Ferriter <cian.ferriter@intel.com> Co-authored-by: Kumar Amber <kumar.amber@intel.com> Signed-off-by: Kumar Amber <kumar.amber@intel.com> Acked-by: Flavio Leitner <fbl@sysclose.org> Signed-off-by: Ian Stokes <ian.stokes@intel.com>
* dpif-netdev: Refactor to multiple header files.Harry van Haaren2021-07-091-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Split the very large file dpif-netdev.c and the datastructures it contains into multiple header files. Each header file is responsible for the datastructures of that component. This logical split allows better reuse and modularity of the code, and reduces the very large file dpif-netdev.c to be more managable. Due to dependencies between components, it is not possible to move component in smaller granularities than this patch. To explain the dependencies better, eg: DPCLS has no deps (from dpif-netdev.c file) FLOW depends on DPCLS (struct dpcls_rule) DFC depends on DPCLS (netdev_flow_key) and FLOW (netdev_flow_key) THREAD depends on DFC (struct dfc_cache) DFC_PROC depends on THREAD (struct pmd_thread) DPCLS lookup.h/c require only DPCLS DPCLS implementations require only dpif-netdev-lookup.h. - This change was made in 2.12 release with function pointers - This commit only refactors the name to "private-dpcls.h" netdev_flow_key_equal_mf() is renamed to emc_flow_key_equal_mf(). Rename functions specific to dpcls from netdev_* namespace to the dpcls_* namespace, as they are only used by dpcls code. 'inline' is added to the dp_netdev_flow_hash() when it is moved definition to fix a compiler error. One valid checkpatch issue with the use of the EMC_FOR_EACH_POS_WITH_HASH() macro was fixed. Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com> Co-authored-by: Cian Ferriter <cian.ferriter@intel.com> Signed-off-by: Cian Ferriter <cian.ferriter@intel.com> Acked-by: Flavio Leitner <fbl@sysclose.org> Signed-off-by: Ian Stokes <ian.stokes@intel.com>
* conntrack: Handle SNAT with all-zero IP address.Paolo Valerio2021-07-081-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch introduces for the userspace datapath the handling of rules like the following: ct(commit,nat(src=0.0.0.0),...) Kernel datapath already handle this case that is particularly handy in scenarios like the following: Given A: 10.1.1.1, B: 192.168.2.100, C: 10.1.1.2 A opens a connection toward B on port 80 selecting as source port 10000. B's IP gets dnat'ed to C's IP (10.1.1.1:10000 -> 192.168.2.100:80). This will result in: tcp,orig=(src=10.1.1.1,dst=192.168.2.100,sport=10000,dport=80), reply=(src=10.1.1.2,dst=10.1.1.1,sport=80,dport=10000), protoinfo=(state=ESTABLISHED) A now tries to establish another connection with C using source port 10000, this time using C's IP address (10.1.1.1:10000 -> 10.1.1.2:80). This second connection, if processed by conntrack with no SNAT/DNAT involved, collides with the reverse tuple of the first connection, so the entry for this valid connection doesn't get created. With this commit, and adding a SNAT rule with 0.0.0.0 for 10.1.1.1:10000 -> 10.1.1.2:80 will allow to create the conn entry: tcp,orig=(src=10.1.1.1,dst=10.1.1.2,sport=10000,dport=80), reply=(src=10.1.1.2,dst=10.1.1.1,sport=80,dport=10001), protoinfo=(state=ESTABLISHED) tcp,orig=(src=10.1.1.1,dst=192.168.2.100,sport=10000,dport=80), reply=(src=10.1.1.2,dst=10.1.1.1,sport=80,dport=10000), protoinfo=(state=ESTABLISHED) The issue exists even in the opposite case (with A trying to connect to C using B's IP after establishing a direct connection from A to C). This commit refactors the relevant function in a way that both of the previously mentioned cases are handled as well. Suggested-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Paolo Valerio <pvalerio@redhat.com> Acked-by: Gaetan Rivet <grive@u256.net> Acked-by: Aaron Conole <aconole@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* conntrack: Document all-zero IP SNAT behavior and add a test case.Eelco Chaudron2021-07-081-0/+3
| | | | | | | | | | | | | | | | | | | | Currently, conntrack in the kernel has an undocumented feature referred to as all-zero IP address SNAT. Basically, when a source port collision is detected during the commit, the source port will be translated to an ephemeral port. If there is no collision, no SNAT is performed. This patchset documents this behavior and adds a self-test to verify it's not changing. In addition, a datapath feature flag is added for the all-zero IP SNAT case. This will help applications on top of OVS, like OVN, to determine this feature can be used. Signed-off-by: Eelco Chaudron <echaudro@redhat.com> Acked-by: Aaron Conole <aconole@redhat.com> Acked-by: Dumitru Ceara <dceara@redhat.com> Acked-by: Alin-Gabriel Serdean <aserdean@ovn.org> Acked-by: Paolo Valerio <pvalerio@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* ovs-numa: Support non-contiguous numa nodes and offline CPU cores.David Wilder2021-07-071-0/+1
| | | | | | | | | | | | | | | | | | | | | | | This change removes the assumption that numa nodes and cores are numbered contiguously in linux. This change is required to support some Power systems. A check has been added to verify that cores are online, offline cores result in non-contiguously numbered cores. DPDK EAL option generation is updated to work with non-contiguous numa nodes. These options can be seen in the ovs-vswitchd.log. For example: a system containing only numa nodes 0 and 8 will generate the following: EAL ARGS: ovs-vswitchd --socket-mem 1024,0,0,0,0,0,0,0,1024 \ --socket-limit 1024,0,0,0,0,0,0,0,1024 -l 0 Tests for pmd and dpif-netdev have been updated to validate non-contiguous numbered nodes. Signed-off-by: David Wilder <dwilder@us.ibm.com> Acked-by: Kevin Traynor <ktraynor@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* stream-ssl: Remove unsafe 1024 bit dh paramsJaime Caamaño Ruiz2021-07-071-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Using 1024 bit params for DH is considered unsafe [1]. Additionally, from [2]: "Modern servers that do not support export ciphersuites are advised to either use SSL_CTX_set_tmp_dh() or alternatively, use the callback but ignore keylength and is_export and simply supply at least 2048-bit parameters in the callback." Additionally, using 1024 bit dh params may block clients running on recent openssl version from connecting given the stricter default security requirements of those new openssl versions. The error message for these clients looks like: error:141A318A:SSL routines:tls_process_ske_dhe:dh key too small:ssl/statem/statem_clnt.c:2150 As a workaround, this error can be suppressed tweaking the cipher list (--ssl-ciphers) to either 'HIGH:!aNULL:!MD5:@SECLEVEL=1' to reduce security requirements or 'HIGH:!aNULL:!MD5:!DH' to avoid using fixed param DH based ciphers. The first option is recommended though as it likely a fixed param DH cipher is the best possible option in that situation. [1] https://weakdh.org/ [2] https://www.openssl.org/docs/man1.1.1/man3/SSL_CTX_set_tmp_dh_callback.html Signed-off-by: Jaime Caamaño Ruiz <jcaamano@redhat.com> Signed-off-by: Ben Pfaff <blp@ovn.org>
* NEWS: Add note about PPS support for ingress policingLouis Peens2021-07-021-0/+4
| | | | | | | | | Packet-per-second based rate limiting on ingress recently landed, add a NEWS item for this. Signed-off-by: Louis Peens <louis.peens@corigine.com> Signed-off-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: Ben Pfaff <blp@ovn.org>
* netdev-dpdk-offload: Add vxlan pattern matching function.Eli Britstein2021-06-241-0/+2
| | | | | | | | | | | | | For VXLAN offload, matches should be done on outer header for tunnel properties as well as inner packet matches. Add a function for parsing VXLAN tunnel matches. Signed-off-by: Eli Britstein <elibr@nvidia.com> Reviewed-by: Gaetan Rivet <gaetanr@nvidia.com> Acked-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com> Tested-by: Emma Finn <emma.finn@intel.com> Tested-by: Marko Kovacevic <marko.kovacevic@intel.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* netdev-offload-dpdk: Support tunnel pop action.Eli Britstein2021-06-241-0/+2
| | | | | | | | | | | Support tunnel pop action. Signed-off-by: Eli Britstein <elibr@nvidia.com> Reviewed-by: Gaetan Rivet <gaetanr@nvidia.com> Acked-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com> Tested-by: Emma Finn <emma.finn@intel.com> Tested-by: Marko Kovacevic <marko.kovacevic@intel.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* dpif-netdev: Expand the meter capacity.Tonghao Zhang2021-06-241-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For now, ovs-vswitchd use the array of the dp_meter struct to store meter's data, and at most, there are only 65536 (defined by MAX_METERS) meters that can be used. But in some case, for example, in the edge gateway, we should use 200,000+, at least, meters for IP address bandwidth limitation. Every one IP address will use two meters for its rx and tx path[1]. In other way, ovs-vswitchd should support meter-offload (rte_mtr_xxx api introduced by dpdk.), but there are more than 65536 meters in the hardware, such as Mellanox ConnectX-6. This patch use cmap to manage the meter, instead of the array. * Insertion performance, ovs-ofctl add-meter 1000+ meters, the cmap takes about 4000ms, as same as previous implementation. * Lookup performance in datapath, we add 1000+ meters which rate limit are 10Gbps (the NIC cards are 10Gbps, so netdev-datapath will not drop the packets.), and a flow which only forward packets from p0 to p1, with meter action[2]. On other machine, pktgen-dpdk will generate 64B packets to p0. The forwarding performance always is 1324 Kpps on my server which CPU is Intel E5-2650, 2.00GHz. [1]. $ in_port=p0,ip,ip_dst=1.1.1.x action=meter:n,output:p1 $ in_port=p1,ip,ip_src=1.1.1.x action=meter:m,output:p0 [2]. $ in_port=p0 action=meter:100,output:p1 Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* dpdk: Add debug appctl to get malloc statistics.Eli Britstein2021-06-181-0/+1
| | | | | | | | | | | New appctl 'dpdk/get-malloc-stats' implemented to get result of 'rte_malloc_dump_stats()' function. Could be used for debugging. Signed-off-by: Eli Britstein <elibr@nvidia.com> Reviewed-by: Salem Sol <salems@nvidia.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* docs: Add a topic about record/replay with ovsdb-server.Ilya Maximets2021-06-071-0/+4
| | | | | | | Also added a NEWS entry. Signed-off-by: Ilya Maximets <i.maximets@ovn.org> Acked-by: Dumitru Ceara <dceara@redhat.com>
* ovsdb-tool: add --election-timer=ms option to 'create-cluster'Dan Williams2021-05-271-0/+3
| | | | | | | | | | | | After creating the new clustered database write a raft entry that sets the desired election timer. This allows CMSes to set the election timer at cluster start and avoid an error-prone election timer modification process after the cluster is up. Reported-at: https://bugzilla.redhat.com/1831778 Signed-off-by: Dan Williams <dcbw@redhat.com> Signed-off-by: Ben Pfaff <blp@ovn.org>
* dpdk: Use DPDK 20.11.1 release.Hariprasad Govindharajan2021-05-121-0/+3
| | | | | | | | | | | Modify ci linux build script to use the latest DPDK stable release. Modify Documentation to use the latest DPDK stable release 20.11.1 Update NEWS file to reflect the latest DPDK stable releases. FAQ is updated to reflect the latest DPDK for each branch. Signed-off-by: Hariprasad Govindharajan <hariprasad.govindharajan@intel.com> Acked-by: Kevin Traynor <ktraynor@redhat.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>
* dpif-netdev: Allow PMD auto load balance with cross-numa.Kevin Traynor2021-03-221-0/+3
| | | | | | | | | | | | | | | | | | Previously auto load balance did not trigger a reassignment when there was any cross-numa polling as an rxq could be polled from a different numa after reassign and it could impact estimates. In the case where there is only one numa with pmds available, the same numa will always poll before and after reassignment, so estimates are valid. Allow PMD auto load balance to trigger a reassignment in this case. Acked-by: Eelco Chaudron <echaudro@redhat.com> Acked-by: David Marchand <david.marchand@redhat.com> Tested-by: Sunil Pai G <sunil.pai.g@intel.com> Acked-by: Flavio Leitner <fbl@sysclose.org> Signed-off-by: Kevin Traynor <ktraynor@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* ovs-ctl: Allow recording hostname separately.Frode Nordahl2021-03-011-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | ovs-ctl determines the system FQDN or hostname and records it in the `external-ids:hostname` field of the `Open-vSwitch` table on system startup if it is not already set. This value may be consumed by downstream software and having it unset or set to a incorrect value could lead to erratic behavior of a system. When a system is configured to use an Open vSwitch controlled datapath as its only network connection, the current ordering of events would always record a unreliable hostname. To tackle this problem this patch adds an optional argument that allows starting Open vSwitch without recording the hostname in the database as well as a new ctl command to record the hostname separately. This command can be called by the system startup scripts when the system is ready to collect and record this information. Reported-At: https://bugs.launchpad.net/bugs/1915829 Signed-off-by: Frode Nordahl <frode.nordahl@canonical.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* Set release date for 2.15.0.Ilya Maximets2021-02-151-1/+1
| | | | | Acked-by: Ian Stokes <ian.stokes@intel.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* db-ctl-base: Add {in} and {not-in} set relational operators.Ben Pfaff2021-02-021-0/+2
| | | | | | | | | | | I would have found these useful for the OVN tests. The {in} operator is the same as {<=}, but it's still useful to have the alternate syntax because most of the time we think of set inclusion separately from set subsets. The {not-in} operator is different from any existing operator though. Signed-off-by: Ben Pfaff <blp@ovn.org> Acked-by: Ilya Maximets <i.maximets@ovn.org>
* Prepare for post-2.15.0 (2.15.90).Ilya Maximets2021-01-151-0/+4
| | | | | | Acked-by: Flavio Leitner <fbl@sysclose.org> Acked-by: Aaron Conole <aconole@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* Prepare for 2.15.0.Ilya Maximets2021-01-151-1/+1
| | | | | | Acked-by: Flavio Leitner <fbl@sysclose.org> Acked-by: Aaron Conole <aconole@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* doc: Deprecate building Linux kernel module from OVS source tree.Greg Rose2021-01-151-0/+5
| | | | | | | | | | | | | It is decided (1) to deprecate building the Linux kernel module from the Open vSwitch source tree. Update the NEWS and FAQ to provide notice. 1. https://mail.openvswitch.org/pipermail/ovs-dev/2020-December/378831.html Signed-off-by: Greg Rose <gvrose8192@gmail.com> Acked-by: Flavio Leitner <fbl@sysclose.org> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* ovsdb: Use column diffs for ovsdb and raft log entries.Ilya Maximets2021-01-151-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, ovsdb-server stores complete value for the column in a database file and in a raft log in case this column changed. This means that transaction that adds, for example, one new acl to a port group creates a log entry with all UUIDs of all existing acls + one new. Same for ports in logical switches and routers and more other columns with sets in Northbound DB. There could be thousands of acls in one port group or thousands of ports in a single logical switch. And the typical use case is to add one new if we're starting a new service/VM/container or adding one new node in a kubernetes or OpenStack cluster. This generates huge amount of traffic within ovsdb raft cluster, grows overall memory consumption and hurts performance since all these UUIDs are parsed and formatted to/from json several times and stored on disks. And more values we have in a set - more space a single log entry will occupy and more time it will take to process by ovsdb-server cluster members. Simple test: 1. Start OVN sandbox with clustered DBs: # make sandbox SANDBOXFLAGS='--nbdb-model=clustered --sbdb-model=clustered' 2. Run a script that creates one port group and adds 4000 acls into it: # cat ../memory-test.sh pg_name=my_port_group export OVN_NB_DAEMON=$(ovn-nbctl --pidfile --detach --log-file -vsocket_util:off) ovn-nbctl pg-add $pg_name for i in $(seq 1 4000); do echo "Iteration: $i" ovn-nbctl --log acl-add $pg_name from-lport $i udp drop done ovn-nbctl acl-del $pg_name ovn-nbctl pg-del $pg_name ovs-appctl -t $(pwd)/sandbox/nb1 memory/show ovn-appctl -t ovn-nbctl exit --- 4. Check the current memory consumption of ovsdb-server processes and space occupied by database files: # ls sandbox/[ns]b*.db -alh # ps -eo vsz,rss,comm,cmd | egrep '=[ns]b[123].pid' Test results with current ovsdb log format: On-disk Nb DB size : ~369 MB RSS of Nb ovsdb-servers: ~2.7 GB Time to finish the test: ~2m In order to mitigate memory consumption issues and reduce computational load on ovsdb-servers let's store diff between old and new values instead. This will make size of each log entry that adds single acl to port group (or port to logical switch or anything else like that) very small and independent from the number of already existing acls (ports, etc.). Added a new marker '_is_diff' into a file transaction to specify that this transaction contains diffs instead of replacements for the existing data. One side effect is that this change will actually increase the size of file transaction that removes more than a half of entries from the set, because diff will be larger than the resulted new value. However, such operations are rare. Test results with change applied: On-disk Nb DB size : ~2.7 MB ---> reduced by 99% RSS of Nb ovsdb-servers: ~580 MB ---> reduced by 78% Time to finish the test: ~1m27s ---> reduced by 27% After this change new ovsdb-server is still able to read old databases, but old ovsdb-server will not be able to read new ones. Since new servers could join ovsdb cluster dynamically it's hard to implement any runtime mechanism to handle cases where different versions of ovsdb-server joins the cluster. However we still need to handle cluster upgrades. For this case added special command line argument to disable new functionality. Documentation updated with the recommended way to upgrade the ovsdb cluster. Acked-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* dpif-netdev: Add parameters to configure PMD auto load balance.Christophe Fontaine2021-01-151-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Two important parts of how PMD auto load balance operates are how loaded a core needs to be and how much improvement is estimated before a PMD auto load balance can trigger. Previously they were hardcoded to 95% loaded and 25% variance improvement. These default values may not be suitable for all use cases and we may want to use a more (or less) aggressive rebalance, either on the pmd load threshold or on the minimum variance improvement threshold. The defaults are not changed, but "pmd-auto-lb-load-threshold" and "pmd-auto-lb-improvement-threshold" parameters are added to override the defaults. $ ovs-vsctl set open_vswitch . other_config:pmd-auto-lb-load-threshold="70" $ ovs-vsctl set open_vswitch . other_config:pmd-auto-lb-improvement-threshold="20" Signed-off-by: Christophe Fontaine <cfontain@redhat.com> Co-Authored-by: Kevin Traynor <ktraynor@redhat.com> Signed-off-by: Kevin Traynor <ktraynor@redhat.com> Acked-by: David Marchand <david.marchand@redhat.com> Acked-by: Ian Stokes <ian.stokes@intel.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* ovs-monitor-ipsec: Add option to not restart IKE daemon.Mark Gray2021-01-061-0/+2
| | | | | | | Signed-off-by: Mark Gray <mark.d.gray@redhat.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Acked-by: Flavio Leitner <fbl@sysclose.org> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* ovs-monitor-ipsec: Allow exit of ipsec daemon maintaining state.Mark Gray2021-01-061-0/+3
| | | | | | | | | | | | | | When 'ovs-monitor-ipsec' exits, it clears all persistent state (i.e. active ipsec connections, /etc/ipsec.conf, certs/keys). In some use-cases, we may want to exit and maintain state so that ipsec connectivity is maintained. One example of this is during an upgrade. This will require the caller to clear this persistent state when appropriate (e.g. before 'ovs-monitor-ipsec') is restarted. Signed-off-by: Mark Gray <mark.d.gray@redhat.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Acked-by: Flavio Leitner <fbl@sysclose.org> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>