summaryrefslogtreecommitdiff
Commit message (Collapse)AuthorAgeFilesLines
...
* system-offloads-traffic: Skip tests if nc is not present.Simon Horman2023-01-301-0/+2
| | | | | | | | | | | | | | | The following tests use the nc command and should be skipped if nc is not present. - "offloads - check interface meter offloading - offloads disabled" - "offloads - check interface meter offloading - offloads enabled" Fixes: 5660b89a309d ("dpif-netlink: Offloading meter to tc police action") Reported-by: David Marchand <david.marchand@redhat.com> Reviewed-by: Louis Peens <louis.peens@corigine.com> Signed-off-by: Simon Horman <simon.horman@corigine.com> Acked-by: Ilya Maximets <i.maximets@ovn.org> Reviewed-by: David Marchand <david.marchand@redhat.com>
* system-traffic: Remove unnecessary dependency on nc.Simon Horman2023-01-301-1/+0
| | | | | | | | | | | | The conntrack - ICMP related to original direction" test does not use nc and therefore does not need to be skipped if nc is not present. Fixes: d0e4206230b3 ("tests: ICMP related to original direction test.") Reported-by: David Marchand <david.marchand@redhat.com> Reviewed-by: Louis Peens <louis.peens@corigine.com> Signed-off-by: Simon Horman <simon.horman@corigine.com> Acked-by: Ilya Maximets <i.maximets@ovn.org> Reviewed-by: David Marchand <david.marchand@redhat.com>
* netdev-offload-tc: Fix misaligned access to ct label.Ilya Maximets2023-01-271-10/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | UndefinedBehaviorSanitizer: lib/netdev-offload-tc.c:1356:50: runtime error: member access within misaligned address 0x60700001a89c for type 'const struct (unnamed struct at lib/netdev-offload-tc.c:1350:27)', which requires 8 byte alignment 0x60700001a89c: note: pointer points here 24 00 04 00 01 00 00 05 00 00 0d 00 0a 00 00 00 00 00 00 00 ... ^ 0 0xd5d183 in parse_put_flow_ct_action lib/netdev-offload-tc.c:1356:50 1 0xd5783f in netdev_tc_parse_nl_actions lib/netdev-offload-tc.c:2015:19 2 0xd4027c in netdev_tc_flow_put lib/netdev-offload-tc.c:2355:11 3 0x9666d7 in netdev_flow_put lib/netdev-offload.c:318:14 4 0xcd4c0a in parse_flow_put lib/dpif-netlink.c:2297:11 5 0xcd4c0a in try_send_to_netdev lib/dpif-netlink.c:2384:15 6 0xcd4c0a in dpif_netlink_operate lib/dpif-netlink.c:2455:23 7 0x87d40e in dpif_operate lib/dpif.c:1372:13 8 0x6d43e9 in handle_upcalls ofproto/ofproto-dpif-upcall.c:1674:5 9 0x6d43e9 in recv_upcalls ofproto/ofproto-dpif-upcall.c:905:9 10 0x6cf6ea in udpif_upcall_handler ofproto/ofproto-dpif-upcall.c:801:13 11 0xb6d7ea in ovsthread_wrapper lib/ovs-thread.c:423:12 12 0x7f5ccf017801 in start_thread 13 0x7f5ccefb744f in __GI___clone3 Fixes: 9221c721bec0 ("netdev-offload-tc: Add conntrack label and mark support") Reviewed-by: Simon Horman <simon.horman@corigine.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* dpif-netdev-perf: Add metric averages when no iterations.Kevin Traynor2023-01-271-23/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | pmd-perf-show with pmd-perf-metrics=true displays a histogram with averages. However, averages were not displayed when there is no iterations. They will be all zero so it is not hiding useful information but the stats look incomplete without them, especially when they are displayed for some PMD thread cores and not others. The histogram print is large and this is just an extra couple of lines, so might as well print them all the time to ensure that the user does not think there is something missing from the display. Before patch: Histograms cycles/it 499 0 716 0 1025 0 1469 0 <snip> After patch: Histograms cycles/it 499 0 716 0 1025 0 1469 0 <snip> --------------- cycles/it 0 Acked-by: Mike Pattrick <mkp@redhat.com> Signed-off-by: Kevin Traynor <ktraynor@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* dpif-netdev-perf: Remove not a number stat value.Kevin Traynor2023-01-271-3/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Some stats in pmd-perf-show don't check for divide by zero which results in not a number (-nan). This is a normal case for some of the stats when there are no Rx queues assigned to the PMD thread core. It is not obvious what -nan is to a user so add a check for divide by zero and set stat to 0 if present. Before patch: pmd thread numa_id 1 core_id 9: Iterations: 0 (-nan us/it) - Used TSC cycles: 0 ( 0.0 % of total cycles) - idle iterations: 0 ( -nan % of used cycles) - busy iterations: 0 ( -nan % of used cycles) After patch: pmd thread numa_id 1 core_id 9: Iterations: 0 (0.00 us/it) - Used TSC cycles: 0 ( 0.0 % of total cycles) - idle iterations: 0 ( 0.0 % of used cycles) - busy iterations: 0 ( 0.0 % of used cycles) Acked-by: Mike Pattrick <mkp@redhat.com> Signed-off-by: Kevin Traynor <ktraynor@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* system-traffic.at: Skip the 'ICMP6 Related' test if nc is missing.Ilya Maximets2023-01-271-0/+1
| | | | | | | | | Test fails is 'nc' is not available, it should be skipped instead. Fixes: b020a416e24c ("System Tests: Enhance NAT tests.") Reviewed-by: David Marchand <david.marchand@redhat.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* utilities: Add revalidator measurement script and needed USDT probes.Eelco Chaudron2023-01-274-0/+991
| | | | | | | | | | | | | | | | | | | | This patch adds a Python script that can be used to analyze the revalidator runs by providing statistics (including some real time graphs). The USDT events can also be captured to a file and used for later offline analysis. The following blog explains the Open vSwitch revalidator implementation and how this tool can help you understand what is happening in your system. https://developers.redhat.com/articles/2022/10/19/open-vswitch-revalidator-process-explained Signed-off-by: Eelco Chaudron <echaudro@redhat.com> Acked-by: Adrian Moreno <amorenoz@redhat.com> Acked-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* tests/mfex: Silence Blowfish/CAST5 deprecation warnings.Robin Jarry2023-01-271-0/+8
| | | | | | | | | | | | | | | | | On Fedora 37 (at least), MFEX unit tests are failing because of deprecation warnings: $ python3 tests/mfex_fuzzy.py test_traffic.pcap 2000 /usr/lib/python3.11/site-packages/scapy/layers/ipsec.py:471: CryptographyDeprecationWarning: Blowfish has been deprecated cipher=algorithms.Blowfish, /usr/lib/python3.11/site-packages/scapy/layers/ipsec.py:485: CryptographyDeprecationWarning: CAST5 has been deprecated cipher=algorithms.CAST5, Signed-off-by: Robin Jarry <rjarry@redhat.com> Signed-off-by: David Marchand <david.marchand@redhat.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* revalidator: Allow min-revalidator-pps to be 0.Han Zhou2023-01-273-2/+7
| | | | | | | | | | | | | | | | | | | | Today the minimum value for this setting is 1. This patch allows it to be 0, meaning not checking pps at all, and always do revalidation. This is particularly useful for environments where some of the applications with long-lived connections may have very low traffic for certain period but have high rate of burst periodically. It is desirable to keep the datapath flows instead of periodically deleting them to avoid burst of packet miss to userspace. When setting to 0, there may be more datapath flows to be revalidated, resulting in higher CPU cost of revalidator threads. This is the downside but in certain cases this is still more desirable than packet misses to user space. Signed-off-by: Han Zhou <hzhou@ovn.org> Acked-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* netdev-dpdk: Free mbufs in bulk.Ilya Maximets2023-01-271-10/+3
| | | | | | | | | | | | | rte_pktmbuf_free_bulk() function was introduced in 19.11 and became stable in 21.11. Use it to free arrays of mbufs instead of freeing packets one by one. In simple V2V testing with 64B packets, 2 PMD threads and bidirectional traffic this change improves performance by 3.5 - 4.5 %. Reviewed-by: David Marchand <david.marchand@redhat.com> Acked-by: Kevin Traynor <ktraynor@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* ovsdb: Don't convert unchanged columns during database conversion.Ilya Maximets2023-01-271-11/+48
| | | | | | | | | | | | | | | | | | | | | | | | Column conversion involves converting it to json and back. These are heavy operations and completely unnecessary if the column type didn't change. Most of the time schema changes only add new columns/tables without changing existing ones at all. Clone the column instead to save some time. This will also save time while destroying the original database since we will only need to reduce reference counters on unchanged datum objects that were cloned instead of actually freeing them. Additionally, moving the column lookup into a separate loop, so we don't perform an shash lookup for each column of each row. Testing with 440 MB OVN_Southbound database shows 70% speed up of the ovsdb_convert() function. Execution time reduced from 15 to 4.4 seconds, 3.5 of which is a post-conversion transaction replay. Overall time required for the online database conversion reduced from 37 to 25 seconds. Acked-by: Han Zhou <hzhou@ovn.org> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* ovsdb-types: Add functions to compare types for equality.Ilya Maximets2023-01-272-0/+64
| | | | | | | Will be used in the next commit to optimize database conversion. Acked-by: Han Zhou <hzhou@ovn.org> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* dpif-netdev: Set PMD load based sleep start/inc to 1 us.Kevin Traynor2023-01-234-18/+10
| | | | | | | | | | | | Now that the timer slack for the PMD threads is reduced we can also reduce the start/increment for PMD load based sleeping to match it. This will further reduce initial sleep times making it more resilient to interfaces that might be sensitive to large sleep times. Signed-off-by: Kevin Traynor <ktraynor@redhat.com> Reviewed-by: David Marchand <david.marchand@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* dpif-netdev: Set timer slack for PMD threads.David Marchand2023-01-234-5/+21
| | | | | | | | | | | | | | | | | | | The default Linux timer slack groups timer expires into 50 uS intervals. With some traffic patterns this can mean that returning to process packets after a sleep takes too long and packets are dropped. Add a helper to util.c and set use it to reduce the timer slack for PMD threads, so that sleeps with smaller resolutions can be done to prevent sleeping for too long. Fixes: de3bbdc479a9 ("dpif-netdev: Add PMD load based sleeping.") Reported-at: https://mail.openvswitch.org/pipermail/ovs-dev/2023-January/401121.html Reported-by: Ilya Maximets <i.maximets@ovn.org> Signed-off-by: David Marchand <david.marchand@redhat.com> Co-authored-by: Kevin Traynor <ktraynor@redhat.com> Signed-off-by: Kevin Traynor <ktraynor@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* netdev-dpdk: Fix deadlock due to virtqueue stats retrieval.David Marchand2023-01-191-20/+102
| | | | | | | | | | | | | | | | | | | | As Ilya reported, we have a ABBA deadlock between DPDK vq->access_lock and OVS dev->mutex when OVS main thread refreshes statistics, while a vring state change event is being processed for a same vhost port. To break from this situation, move vring state change notifications handling from the vhost-events DPDK thread to a dedicated thread using a lockless queue. Besides, for the case when a bogus/malicious guest is sending continuous updates, add a counter of pending updates in the queue and warn if a threshold of 1000 entries is reached. Reported-at: https://mail.openvswitch.org/pipermail/ovs-dev/2023-January/401101.html Fixes: 3b29286db1c5 ("netdev-dpdk: Add per virtqueue statistics.") Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> Signed-off-by: David Marchand <david.marchand@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* ovsdb: Fix database statistics during the database replacement.Ilya Maximets2023-01-182-0/+21
| | | | | | | | | | | | | | | The counter for the number of atoms has to be re-set to the number from the new database, otherwise the value will be incorrect. For example, this is causing the atom counter doubling after online conversion of a clustered database. Miscounting may also lead to increased memory consumption by the transaction history or otherwise too aggressive transaction history sweep. Fixes: 317b1bfd7dd3 ("ovsdb: Don't let transaction history grow larger than the database.") Acked-by: Han Zhou <hzhou@ovn.org> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* Prepare for post-3.1.0 (3.1.90).Ilya Maximets2023-01-163-1/+11
| | | | | Acked-by: Aaron Conole <aconole@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* Prepare for 3.1.0.Ilya Maximets2023-01-165-6/+7
| | | | | Acked-by: Aaron Conole <aconole@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* ovs-vsctl: Do not sent 'set_db_change_aware'.Han Zhou2023-01-161-0/+1
| | | | | | | | | | | ovs-vsctl's connections are short-lived, so it doesn't care about db status changes. Reported-by: Tobias Hofmann <tohofman@cisco.com> Reported-at: https://mail.openvswitch.org/pipermail/ovs-discuss/2021-February/050914.html Acked-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Han Zhou <hzhou@ovn.org> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* ovsdb-idl: Provide API to disable set_db_change_aware request.Han Zhou2023-01-164-1/+34
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For ovsdb clients that are short-lived, e.g. when using ovn-nbctl/ovn-sbctl to read some metrics from the OVN NB/SB server, they don't really need to be aware of db changes, because they exit immediately after getting the initial response for the requested data. In such use cases, however, the clients still send 'set_db_change_aware' request, which results in server side error logs when the server tries to send out the response for the 'set_db_change_aware' request, because at the moment the client that is supposed to receive the request has already closed the connection and exited. E.g.: 2023-01-10T18:23:29.431Z|00007|jsonrpc|WARN|unix#3: receive error: Connection reset by peer 2023-01-10T18:23:29.431Z|00008|reconnect|WARN|unix#3: connection dropped (Connection reset by peer) To avoid such problems, this patch provides an API to allow a client to choose to not send the 'set_db_change_aware' request. There was an earlier attempt to fix this [0], but it was not accepted back then as discussed in the email [1]. It was also discussed in the emails that an alternative approach is to use notification instead of request, but that would require protocol changes and taking backward compatibility into consideration. So this patch takes a different approach and tries to keep the change small. [0] http://patchwork.ozlabs.org/project/openvswitch/patch/1594380801-32134-1-git-send-email-dceara@redhat.com/ [1] https://mail.openvswitch.org/pipermail/ovs-discuss/2021-February/050919.html Reported-by: Girish Moodalbail <gmoodalbail@nvidia.com> Reported-at: https://mail.openvswitch.org/pipermail/ovs-discuss/2020-July/050343.html Reported-by: Tobias Hofmann <tohofman@cisco.com> Reported-at: https://mail.openvswitch.org/pipermail/ovs-discuss/2021-February/050914.html Acked-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Han Zhou <hzhou@ovn.org> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* openflow: Add extension to flush CT by generic match.Ales Musil2023-01-1616-22/+562
| | | | | | | | | | | Add extension that allows to flush connections from CT by specifying fields that the connections should be matched against. This allows to match only some fields of the connection e.g. source address for orig direction. Reported-at: https://bugzilla.redhat.com/2120546 Signed-off-by: Ales Musil <amusil@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* ofp, dpif: Allow CT flush based on partial match.Ales Musil2023-01-1610-141/+615
| | | | | | | | | | | | | | | | Currently, the CT can be flushed by dpctl only by specifying the whole 5-tuple. This is not very convenient when there are only some fields known to the user of CT flush. Add new struct ofp_ct_match which represents the generic filtering that can be done for CT flush. The match is done only on fields that are non-zero with exception to the icmp fields. This allows the filtering just within dpctl, however it is a preparation for OpenFlow extension. Reported-at: https://bugzilla.redhat.com/2120546 Signed-off-by: Ales Musil <amusil@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* dpif-netdev: Add PMD load based sleeping.Kevin Traynor2023-01-127-10/+213
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Sleep for an incremental amount of time if none of the Rx queues assigned to a PMD have at least half a batch of packets (i.e. 16 pkts) on an polling iteration of the PMD. Upon detecting the threshold of >= 16 pkts on an Rxq, reset the sleep time to zero (i.e. no sleep). Sleep time will be increased on each iteration where the low load conditions remain up to a total of the max sleep time which is set by the user e.g: ovs-vsctl set Open_vSwitch . other_config:pmd-maxsleep=500 The default pmd-maxsleep value is 0, which means that no sleeps will occur and the default behaviour is unchanged from previously. Also add new stats to pmd-perf-show to get visibility of operation e.g. ... - sleep iterations: 153994 ( 76.8 % of iterations) Sleep time (us): 9159399 ( 59 us/iteration avg.) ... Reviewed-by: Robin Jarry <rjarry@redhat.com> Reviewed-by: David Marchand <david.marchand@redhat.com> Signed-off-by: Kevin Traynor <ktraynor@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* util: Add non quiesce xnanosleep.Kevin Traynor2023-01-122-4/+18
| | | | | | | | | | | | | | | | xnanosleep forces the thread into quiesce state in anticipation that it will be sleeping for a considerable time and that the thread may need to quiesce before the sleep is finished. In some cases, a very short sleep may be requested and in that case the overhead of going to into quiesce state may be unnecessary. To allow for those cases add a xnanosleep_no_quiesce() variant. Suggested-by: Ilya Maximets <i.maximets@ovn.org> Reviewed-by: David Marchand <david.marchand@redhat.com> Signed-off-by: Kevin Traynor <ktraynor@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* Documentation: Remove link to obsolete sources.David Marchand2023-01-121-15/+14
| | | | | | | | | | This archive website disappeared. On the other hand, the link to an obsolete dpif-provider man page probably did not provide much info and we can simply mention the current file. Signed-off-by: David Marchand <david.marchand@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* Documentation: Remove reference to RST online editor.David Marchand2023-01-111-4/+0
| | | | | | | | rst.ninjs.org is not available anymore, but there are alternatives listed in this doc. Signed-off-by: David Marchand <david.marchand@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* Documentation: Fix link to Netperf.David Marchand2023-01-111-4/+4
| | | | | | | netperf.org was shut down in favor of some HP related resources. Signed-off-by: David Marchand <david.marchand@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* Documentation: Fix link to AppVeyor.David Marchand2023-01-111-3/+3
| | | | | | | | | | | | | Sphinx linkcheck complains with: Warning, treated as error: .../Documentation/intro/install/windows.rst:1093:broken link: www.appveyor.com () Add a https scheme in link to AppVeyor website. Signed-off-by: David Marchand <david.marchand@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* Documentation: Fix link to iproute2 git repository.David Marchand2023-01-111-1/+1
| | | | | | | | | | iproute2 git repositories were split and moved around v4.15 [1]. It is time to fix the link in OVS documentation. 1: https://lore.kernel.org/netdev/20180129082052.0eb85e9b@xeon-e3/ Signed-off-by: David Marchand <david.marchand@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* netdev-offload-dpdk: Fix transfer flows.David Marchand2023-01-111-1/+1
| | | | | | | | | | | Following DPDK commit bd2a4d4b2e3a ("ethdev: forbid direction attribute in transfer flow rules"), the ingress attribute presence is rejected for transfer flows. Fixes: a77c7796f23a ("dpdk: Update to use v22.11.1.") Acked-by: Eli Britstein <elibr@nvidia.com> Signed-off-by: David Marchand <david.marchand@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* tests: Add unit tests to rculist.Adrian Moreno2023-01-113-0/+241
| | | | | | | | Low test coverage on this area caused some errors to remain unnoticed. Add basic functional test of rculist. Signed-off-by: Adrian Moreno <amorenoz@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* cirrus: Update to use FreeBSD 12.4.Ilya Maximets2023-01-091-1/+1
| | | | | | | | 12.4 was released in December. That means that 12.3 will become unavailable in a near future. Updating. Acked-by: Aaron Conole <aconole@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* system-dpdk: Fix error message in ping vhost-user ports.Eelco Chaudron2023-01-091-0/+3
| | | | | | | | | | | | | | | | In some environments, ovs-vswitchd gets shutdown before the pkill of testpmd has been completed, which results in the following error messages: Removing port 'dpdkvhostuser0' while vhost device still attached. To restore connectivity after re-adding of port, VM on socket '' must be restarted. This patch will wait for the socket disconnect to be handled by the vhost-user before shutting down OVS. Signed-off-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: David Marchand <david.marchand@redhat.com> Co-authored-by: David Marchand <david.marchand@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* netdev-dpdk: Drop coverage counter for vhost IRQs.David Marchand2023-01-091-9/+0
| | | | | | | | | | | | | | | | | | | | | | The vhost library now provides finegrained statistics for guest notifications: - notifications for buffer reclaim by the guest, - notifications for buffer availability to the guest, Example before this patch: $ ovs-appctl coverage/show | grep vhost_notification vhost_notification 0.0/sec 0.000/sec 2.0283/sec total: 7302 $ ovs-vsctl get interface vhost4 statistics | sed -e 's#[{}]##g' -e 's#, #\n#g' | grep guest_notifications rx_q0_guest_notifications=66 tx_q0_guest_notifications=7236 Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> Signed-off-by: David Marchand <david.marchand@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* netdev-dpdk: Add per virtqueue statistics.David Marchand2023-01-092-132/+348
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The DPDK vhost-user library maintains more granular per queue stats which can replace what OVS was providing for vhost-user ports. The benefits for OVS: - OVS can skip parsing packet sizes on the rx side, - dev->stats_lock won't be taken in rx/tx code unless some packet is dropped, - vhost-user is aware of which packets are transmitted to the guest, so per *transmitted* packet size stats can be reported, - more internal stats from vhost-user may be exposed, without OVS needing to understand them, Note: the vhost-user library does not provide global stats for a port. The proposed implementation is to have the global stats (exposed via netdev_get_stats()) computed by querying and aggregating all per queue stats. Since per queue stats are exposed via another netdev ops (netdev_get_custom_stats()), this may lead to some race and small discrepancies. This issue might already affect other netdev classes. Example: $ ovs-vsctl get interface vhost4 statistics | sed -e 's#[{}]##g' -e 's#, #\n#g' | grep -v =0$ rx_1_to_64_packets=12 rx_256_to_511_packets=15 rx_65_to_127_packets=21 rx_broadcast_packets=15 rx_bytes=7497 rx_multicast_packets=33 rx_packets=48 rx_q0_good_bytes=242 rx_q0_good_packets=3 rx_q0_guest_notifications=3 rx_q0_multicast_packets=3 rx_q0_size_65_127_packets=2 rx_q0_undersize_packets=1 rx_q1_broadcast_packets=15 rx_q1_good_bytes=7255 rx_q1_good_packets=45 rx_q1_guest_notifications=45 rx_q1_multicast_packets=30 rx_q1_size_256_511_packets=15 rx_q1_size_65_127_packets=19 rx_q1_undersize_packets=11 tx_1_to_64_packets=36 tx_256_to_511_packets=45 tx_65_to_127_packets=63 tx_broadcast_packets=45 tx_bytes=22491 tx_multicast_packets=99 tx_packets=144 tx_q0_broadcast_packets=30 tx_q0_good_bytes=14994 tx_q0_good_packets=96 tx_q0_guest_notifications=96 tx_q0_multicast_packets=66 tx_q0_size_256_511_packets=30 tx_q0_size_65_127_packets=42 tx_q0_undersize_packets=24 tx_q1_broadcast_packets=15 tx_q1_good_bytes=7497 tx_q1_good_packets=48 tx_q1_guest_notifications=48 tx_q1_multicast_packets=33 tx_q1_size_256_511_packets=15 tx_q1_size_65_127_packets=21 tx_q1_undersize_packets=12 Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> Signed-off-by: David Marchand <david.marchand@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* tc: Add support for TCA_STATS_PKT64.Mike Pattrick2023-01-061-41/+69
| | | | | | | | | | | | | | | | | Currently tc offload flow packet counters will roll over every ~4 billion packets. This is because the packet counter in struct tc_stats provided by TCA_STATS_BASIC is a 32bit integer. Now we check for the optional TCA_STATS_PKT64 attribute which provides the full 64bit packet counter if the 32bit one has rolled over. Because the TCA_STATS_PKT64 attribute may appear multiple times in a netlink message, the method of parsing attributes was changed. Fixes: f98e418fbdb6 ("tc: Add tc flower functions") Reported-at: https://bugzilla.redhat.com/1776816 Acked-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Mike Pattrick <mkp@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* Documentation: Fix links in maintainers.rst.Ilya Maximets2023-01-062-6/+20
| | | | | | | | | | | | | | | | | | | | | | | GitHub and Sphinx are parsing links differently. Sphinx knows about the overall documentation structure and all the sections defined in other docs, while GitHub is using direct rst 2 html conversion and doesn't know any of that. Sphinx wants links to sections in other docs to be defined with a :doc: field, but GitHub can't parse that and requires having a direct link to the other rST document. The problem is that we have a top level MAINTAINERS.rst, that should be parseable by GitHub, included in the maintainers.rst in the main documentation section that is used by Sphinx to generate html, pdf and other docs. So, it's hard to make links work in both. Working around that limitation by using rST substitutions for the links. Cutting off the substitutions for actual links and adding :doc: links instead during the file inclusion for Sphinx. Reported-by: Igor Zhukov <ivzhukov@sbercloud.ru> Acked-by: Han Zhou <hzhou@ovn.org> Acked-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* Documentation: Fix links in the DPDK guide on physical ports.Ilya Maximets2023-01-061-7/+7
| | | | | | | | | | The text enclosed in '<...>' supposed to be an actual link and not the name of the link. This generates incorrect links that lead nowhere. Also, a single underscore supposed to be used for external links. Reviewed-by: David Marchand <david.marchand@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* treewide: Don't use non-portable '==' with test command.Ilya Maximets2023-01-064-12/+12
| | | | | | | | | | | | | | '==' is not defined by POSIX and not supported by some shells. This is causing test failures and potential other issues: ./tests/testsuite: 54: test: X2: unexpected operator ./tests/testsuite: 54: test: X157: unexpected operator ./tests/testsuite: 54: test: X116: unexpected operator Reported-at: https://mail.openvswitch.org/pipermail/ovs-discuss/2022-December/052157.html Reviewed-by: David Marchand <david.marchand@redhat.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* dpif: Fix tunnel key set for IPv6 tunnels with SLOW_ACTION.Eelco Chaudron2023-01-062-1/+49
| | | | | | | | | | | | | | The dpif_execute_helper_cb() function is supposed to add the OVS_ACTION_ATTR_SET(OVS_KEY_ATTR_TUNNEL()) action to the list of actions when passing it down to the kernel. This function was only checking if the IPv4 destination address was set, not both. This patch fixes this, including a datapath testcase. Fixes: 076caa2fb077 ("ofproto: Meter translation.") Signed-off-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* utilities: Add USDT script to monitor dpif netlink execute message queuing.Eelco Chaudron2023-01-063-0/+666
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds the dpif_nl_exec_monitor.py script that will used the existing dpif_netlink_operate__:op_flow_execute USDT probe to show all DPIF_OP_EXECUTE operations being queued for transmission over the netlink interface. Here is an example, truncated output: Display DPIF_OP_EXECUTE operations being queued for transmission... TIME CPU COMM PID NL_SIZE 3124.516679897 1 ovs-vswitchd 8219 180 nlmsghdr : len = 0, type = 36, flags = 1, seq = 0, pid = 0 genlmsghdr: cmd = 3, version = 1, reserver = 0 ovs_header: dp_ifindex = 21 > Decode OVS_PACKET_ATTR_* TLVs: nla_len 46, nla_type OVS_PACKET_ATTR_PACKET[1], data: 00 00 00... nla_len 20, nla_type OVS_PACKET_ATTR_KEY[2], data: 08 00 02 00... > Decode OVS_KEY_ATTR_* TLVs: nla_len 8, nla_type OVS_KEY_ATTR_PRIORITY[2], data: 00 00... nla_len 8, nla_type OVS_KEY_ATTR_SKB_MARK[15], data: 00 00... nla_len 88, nla_type OVS_PACKET_ATTR_ACTIONS[3], data: 4c 00 03... > Decode OVS_ACTION_ATTR_* TLVs: nla_len 76, nla_type OVS_ACTION_ATTR_SET[3], data: 48 00... > Decode OVS_TUNNEL_KEY_ATTR_* TLVs: nla_len 12, nla_type OVS_TUNNEL_KEY_ATTR_ID[0], data:... nla_len 20, nla_type OVS_TUNNEL_KEY_ATTR_IPV6_DST[13], ... nla_len 5, nla_type OVS_TUNNEL_KEY_ATTR_TTL[4], data: 40 nla_len 4, nla_type OVS_TUNNEL_KEY_ATTR_DONT_FRAGMENT[5]... nla_len 4, nla_type OVS_TUNNEL_KEY_ATTR_CSUM[6], data: nla_len 6, nla_type OVS_TUNNEL_KEY_ATTR_TP_DST[10],... nla_len 12, nla_type OVS_TUNNEL_KEY_ATTR_GENEVE_OPTS[8],... nla_len 8, nla_type OVS_ACTION_ATTR_OUTPUT[1], data: 02 00 00 00 - Dumping OVS_PACKET_ATR_PACKET data: ###[ Ethernet ]### dst = 00:00:00:00:ec:01 src = 04:f4:bc:28:57:00 type = IPv4 ###[ IP ]### version = 4 ihl = 5 tos = 0x0 len = 50 id = 0 flags = frag = 0 ttl = 127 proto = icmp chksum = 0x2767 src = 10.0.0.1 dst = 10.0.0.100 \options \ ###[ ICMP ]### type = echo-request code = 0 chksum = 0xf7f3 id = 0x0 seq = 0xc Acked-by: Adrian Moreno <amorenoz@redhat.com> Signed-off-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* rhel: Enable AF_XDP by default in Fedora builds.Ilya Maximets2023-01-031-2/+2
| | | | | | | | | | | | All supported versions of Fedora do package libxdp and libbpf, so it makes sense to enable AF_XDP support. Control files for debian packaging are much less flexible, so its hard to enable AF_XDP builds while not breaking builds for version of Ubuntu and Debian that do not package libbpf or libxdp. Acked-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* acinclude.m4: Build with AF_XDP support by default if possible.Ilya Maximets2023-01-035-36/+72
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With this change we will try to detect all the netdev-afxdp dependencies and enable AF_XDP support by default if they are present at the build time. Configuration script behaves in a following way: - ./configure --enable-afxdp Will check for AF_XDP dependencies and fail if they are not available. - ./configure --disable-afxdp Disables checking for AF_XDP. Build will not support AF_XDP even if all dependencies are installed. - Just ./configure or ./configure --enable-afxdp=auto Will check for AF_XDP dependencies. Will print a warning if they are not available, but will continue without AF_XDP support. If dependencies are available in a system, this option is equal to --enable-afxdp. '--disable-afxdp' added to the debian and fedora package builds to keep predictable behavior. Acked-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* Documentation/afxdp: Use packaged libbpf/libxdp for the build.Ilya Maximets2023-01-031-31/+8
| | | | | | | | | | Necessary bits was removed from the kernel's libbpf in 6.0 release, so the instructions on how to build libbpf from kernel sources are now incorrect. Suggest to use libbpf and libxdp packaged by distributions instead. Acked-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* github: Test AF_XDP build using libbpf instead of kernel sources.Ilya Maximets2023-01-032-84/+3
| | | | | | | | | | | | | | | | | AF_XDP bits was removed from kernel's libbpf in 6.0. libbpf and libxdp are now primary way to build AF_XDP applications. Most of modern distributions are already packaging some version of libbpf, so it's better to test building with it instead of building old unsupported kernel tree. Ubuntu started packaging libxdp only in 22.10, so not using it for now. Kernel build infrastructure in CI scripts is not needed anymore. Removed. Acked-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* netdev-afxdp: Hide too large memset from sparse.Ilya Maximets2023-01-032-4/+4
| | | | | | | | | | | Sparse complains about 64M umem initialization. Hide it from the checker instead of disabling a warning globally. SPARSE_FLAGS are kept in the CI script even though they are empty at the moment. Acked-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* netdev-afxdp: Allow building with libxdp and newer libbpf.Ilya Maximets2023-01-038-27/+53
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | AF_XDP functions was deprecated in libbpf 0.7 and moved to libxdp. Functions bpf_get/set_link_xdp_id() was deprecated in libbpf 0.8 and replaced with bpf_xdp_query_id() and bpf_xdp_attach/detach(). Updating configuration and source code to accommodate above changes and allow building OVS with AF_XDP support on newer systems: - Checking the version of libbpf by detecting availability of bpf_xdp_detach. - Checking availability of the libxdp in a system by looking for a library providing libxdp_strerror(), if libbpf is newer than 0.6. And checking for xsk.h header provided by libxdp-dev[el]. - Use xsk.h from libbpf if it is older than 0.7 and not linking with libxdp in this case as there are known incompatible versions of libxdp in distributions. - Check for the NEED_WAKEUP feature replaced with direct checking in the source code if XDP_USE_NEED_WAKEUP is defined. - Checking availability of bpf_xdp_query_id and bpf_xdp_detach and using them instead of deprecated APIs. Fall back to old functions if not found. - Dropped LIBBPF_LDADD variable as it makes library and function detection much harder without providing any actual benefits. AC_SEARCH_LIBS is used instead and it allows use of AC_CHECK_FUNCS. - Header includes moved around to files where they are actually used. - Removed libelf dependency as it is not really used. With these changes it should be possible to build OVS with either: - libbpf built from the kernel sources (5.19 or older). - libbpf < 0.7 provided in distributions. - libxdp and libbpf >= 0.7 provided in newer distributions. While it is technically possible to build with libbpf 0.7+ without libxdp at the moment we're not allowing that for a few reasons. First, required functions in libbpf are deprecated and can be removed in future releases. Second, support for all these combinations makes the detection code fairly complex. AFAIK, most of the distributions packaging libbpf 0.7+ do package libxdp as well. libxdp added as a build dependency for Fedora build since all supported versions of Fedora are packaging this library. Acked-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* netdev-afxdp: Disable -Wfree-nonheap-object on receive.Ilya Maximets2023-01-031-0/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | GCC 11+ generates a warning: In file included from lib/netdev-linux-private.h:30, from lib/netdev-afxdp.c:19: In function 'dp_packet_delete', inlined from 'dp_packet_delete' at lib/dp-packet.h:246:1, inlined from 'dp_packet_batch_add__' at lib/dp-packet.h:775:9, inlined from 'dp_packet_batch_add' at lib/dp-packet.h:783:5, inlined from 'netdev_afxdp_rxq_recv' at lib/netdev-afxdp.c:898:9: lib/dp-packet.h:260:9: warning: 'free' called on pointer '*umem.xpool.array' with nonzero offset [8, 2558044588346441168] [-Wfree-nonheap-object] 260 | free(b); | ^~~~~~~ But it is a false positive since the code path is not possible. In this call chain the packet will always have source DPBUF_AFXDP and the free() will never be called. GCC doesn't see that, because initialization function dp_packet_use_afxdp() is part of a different translation unit. Disabling a warning in this particular place to avoid build failures. Older versions of clang do not have the -Wfree-nonheap-object, so we need to additionally guard the pragmas. Clang is using GCC pragmas and complains about unknown ones. Reported-at: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108187 Acked-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* ci: Fix overriding OPTS provided from the yml.Ilya Maximets2023-01-031-1/+1
| | | | | | | | | | | For GCC builds we're overriding --disable-ssl or --enable-shared options set up in the GHA yml file. Fix that by adding to EXTRA_OPTS instead. Fixes: 2581b0ad1159 ("travis: Combine kernel builds.") Acked-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* dpif-netdev: Calculate per numa variance.Cheng Li2022-12-212-48/+47
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, pmd_rebalance_dry_run() calculate overall variance of all pmds regardless of their numa location. The overall result may hide un-balance in an individual numa. Considering the following case. Numa0 is free because VMs on numa0 are not sending pkts, while numa1 is busy. Within numa1, pmds workloads are not balanced. Obviously, moving 500 kpps workloads from pmd 126 to pmd 62 will make numa1 much more balance. For numa1 the variance improvement will be almost 100%, because after rebalance each pmd in numa1 holds same workload(variance ~= 0). But the overall variance improvement is only about 20%, which may not trigger auto_lb. ``` numa_id core_id kpps 0 30 0 0 31 0 0 94 0 0 95 0 1 126 1500 1 127 1000 1 63 1000 1 62 500 ``` As auto_lb doesn't balance workload across numa nodes. So it makes more sense to calculate variance improvement per numa node. Signed-off-by: Cheng Li <lic121@chinatelecom.cn> Signed-off-by: Kevin Traynor <ktraynor@redhat.com> Co-authored-by: Kevin Traynor <ktraynor@redhat.com> Acked-by: Kevin Traynor <ktraynor@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>