diff options
author | Kevin Traynor <ktraynor@redhat.com> | 2023-01-11 09:35:01 +0000 |
---|---|---|
committer | Ilya Maximets <i.maximets@ovn.org> | 2023-01-12 18:56:05 +0100 |
commit | de3bbdc479a9a78135e1922e4e6011732515e7ef (patch) | |
tree | 0e637afa610a029ca0b68567361d56108934851b | |
parent | f4c884135139f0d9e309bcd58244191145c5abba (diff) | |
download | openvswitch-de3bbdc479a9a78135e1922e4e6011732515e7ef.tar.gz |
dpif-netdev: Add PMD load based sleeping.
Sleep for an incremental amount of time if none of the Rx queues
assigned to a PMD have at least half a batch of packets (i.e. 16 pkts)
on an polling iteration of the PMD.
Upon detecting the threshold of >= 16 pkts on an Rxq, reset the
sleep time to zero (i.e. no sleep).
Sleep time will be increased on each iteration where the low load
conditions remain up to a total of the max sleep time which is set
by the user e.g:
ovs-vsctl set Open_vSwitch . other_config:pmd-maxsleep=500
The default pmd-maxsleep value is 0, which means that no sleeps
will occur and the default behaviour is unchanged from previously.
Also add new stats to pmd-perf-show to get visibility of operation
e.g.
...
- sleep iterations: 153994 ( 76.8 % of iterations)
Sleep time (us): 9159399 ( 59 us/iteration avg.)
...
Reviewed-by: Robin Jarry <rjarry@redhat.com>
Reviewed-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
-rw-r--r-- | Documentation/topics/dpdk/pmd.rst | 54 | ||||
-rw-r--r-- | NEWS | 3 | ||||
-rw-r--r-- | lib/dpif-netdev-perf.c | 24 | ||||
-rw-r--r-- | lib/dpif-netdev-perf.h | 5 | ||||
-rw-r--r-- | lib/dpif-netdev.c | 65 | ||||
-rw-r--r-- | tests/pmd.at | 46 | ||||
-rw-r--r-- | vswitchd/vswitch.xml | 26 |
7 files changed, 213 insertions, 10 deletions
diff --git a/Documentation/topics/dpdk/pmd.rst b/Documentation/topics/dpdk/pmd.rst index 9006fd40f..604ac3f6b 100644 --- a/Documentation/topics/dpdk/pmd.rst +++ b/Documentation/topics/dpdk/pmd.rst @@ -324,5 +324,59 @@ A user can use this option to set a minimum frequency of Rx queue to PMD reassignment due to PMD Auto Load Balance. For example, this could be set (in min) such that a reassignment is triggered at most every few hours. +PMD load based sleeping (Experimental) +-------------------------------------- + +PMD threads constantly poll Rx queues which are assigned to them. In order to +reduce the CPU cycles they use, they can sleep for small periods of time +when there is no load or very-low load on all the Rx queues they poll. + +This can be enabled by setting the max requested sleep time (in microseconds) +for a PMD thread:: + + $ ovs-vsctl set open_vswitch . other_config:pmd-maxsleep=500 + +Non-zero values will be rounded up to the nearest 10 microseconds to avoid +requesting very small sleep times. + +With a non-zero max value a PMD may request to sleep by an incrementing amount +of time up to the maximum time. If at any point the threshold of at least half +a batch of packets (i.e. 16) is received from an Rx queue that the PMD is +polling is met, the requested sleep time will be reset to 0. At that point no +sleeps will occur until the no/low load conditions return. + +Sleeping in a PMD thread will mean there is a period of time when the PMD +thread will not process packets. Sleep times requested are not guaranteed +and can differ significantly depending on system configuration. The actual +time not processing packets will be determined by the sleep and processor +wake-up times and should be tested with each system configuration. + +Sleep time statistics for 10 secs can be seen with:: + + $ ovs-appctl dpif-netdev/pmd-stats-clear \ + && sleep 10 && ovs-appctl dpif-netdev/pmd-perf-show + +Example output, showing that during the last 10 seconds, 76.8% of iterations +had a sleep of some length. The total amount of sleep time was 9.15 seconds and +the average sleep time per iteration was 46 microseconds:: + + - sleep iterations: 153994 ( 76.8 % of iterations) + Sleep time (us): 9159399 ( 59 us/iteration avg.) + +Any potential power saving from PMD load based sleeping is dependent on the +system configuration (e.g. enabling processor C-states) and workloads. + +.. note:: + + If there is a sudden spike of packets while the PMD thread is sleeping and + the processor is in a low-power state it may result in some lost packets or + extra latency before the PMD thread returns to processing packets at full + rate. + +.. note:: + + By default Linux kernel groups timer expirations and this can add an + overhead of up to 50 microseconds to a requested timer expiration. + .. _ovs-vswitchd(8): http://openvswitch.org/support/dist-docs/ovs-vswitchd.8.html @@ -30,6 +30,9 @@ Post-v3.0.0 - Userspace datapath: * Add '-secs' argument to appctl 'dpif-netdev/pmd-rxq-show' to show the pmd usage of an Rx queue over a configurable time period. + * Add new experimental PMD load based sleeping feature. PMD threads can + request to sleep up to a user configured 'pmd-maxsleep' value under + low load conditions. v3.0.0 - 15 Aug 2022 diff --git a/lib/dpif-netdev-perf.c b/lib/dpif-netdev-perf.c index a2a7d8f0b..1a7bab04c 100644 --- a/lib/dpif-netdev-perf.c +++ b/lib/dpif-netdev-perf.c @@ -230,18 +230,26 @@ pmd_perf_format_overall_stats(struct ds *str, struct pmd_perf_stats *s, uint64_t tot_iter = histogram_samples(&s->pkts); uint64_t idle_iter = s->pkts.bin[0]; uint64_t busy_iter = tot_iter >= idle_iter ? tot_iter - idle_iter : 0; + uint64_t sleep_iter = stats[PMD_SLEEP_ITER]; + uint64_t tot_sleep_cycles = stats[PMD_CYCLES_SLEEP]; ds_put_format(str, " Iterations: %12"PRIu64" (%.2f us/it)\n" " - Used TSC cycles: %12"PRIu64" (%5.1f %% of total cycles)\n" " - idle iterations: %12"PRIu64" (%5.1f %% of used cycles)\n" - " - busy iterations: %12"PRIu64" (%5.1f %% of used cycles)\n", - tot_iter, tot_cycles * us_per_cycle / tot_iter, + " - busy iterations: %12"PRIu64" (%5.1f %% of used cycles)\n" + " - sleep iterations: %12"PRIu64" (%5.1f %% of iterations)\n" + " Sleep time (us): %12.0f (%3.0f us/iteration avg.)\n", + tot_iter, + (tot_cycles + tot_sleep_cycles) * us_per_cycle / tot_iter, tot_cycles, 100.0 * (tot_cycles / duration) / tsc_hz, idle_iter, 100.0 * stats[PMD_CYCLES_ITER_IDLE] / tot_cycles, busy_iter, - 100.0 * stats[PMD_CYCLES_ITER_BUSY] / tot_cycles); + 100.0 * stats[PMD_CYCLES_ITER_BUSY] / tot_cycles, + sleep_iter, tot_iter ? 100.0 * sleep_iter / tot_iter : 0, + tot_sleep_cycles * us_per_cycle, + sleep_iter ? (tot_sleep_cycles * us_per_cycle) / sleep_iter : 0); if (rx_packets > 0) { ds_put_format(str, " Rx packets: %12"PRIu64" (%.0f Kpps, %.0f cycles/pkt)\n" @@ -518,14 +526,15 @@ OVS_REQUIRES(s->stats_mutex) void pmd_perf_end_iteration(struct pmd_perf_stats *s, int rx_packets, - int tx_packets, bool full_metrics) + int tx_packets, uint64_t sleep_cycles, + bool full_metrics) { uint64_t now_tsc = cycles_counter_update(s); struct iter_stats *cum_ms; uint64_t cycles, cycles_per_pkt = 0; char *reason = NULL; - cycles = now_tsc - s->start_tsc; + cycles = now_tsc - s->start_tsc - sleep_cycles; s->current.timestamp = s->iteration_cnt; s->current.cycles = cycles; s->current.pkts = rx_packets; @@ -539,6 +548,11 @@ pmd_perf_end_iteration(struct pmd_perf_stats *s, int rx_packets, histogram_add_sample(&s->cycles, cycles); histogram_add_sample(&s->pkts, rx_packets); + if (sleep_cycles) { + pmd_perf_update_counter(s, PMD_SLEEP_ITER, 1); + pmd_perf_update_counter(s, PMD_CYCLES_SLEEP, sleep_cycles); + } + if (!full_metrics) { return; } diff --git a/lib/dpif-netdev-perf.h b/lib/dpif-netdev-perf.h index 9673dddd8..84beced15 100644 --- a/lib/dpif-netdev-perf.h +++ b/lib/dpif-netdev-perf.h @@ -80,6 +80,8 @@ enum pmd_stat_type { PMD_CYCLES_ITER_IDLE, /* Cycles spent in idle iterations. */ PMD_CYCLES_ITER_BUSY, /* Cycles spent in busy iterations. */ PMD_CYCLES_UPCALL, /* Cycles spent processing upcalls. */ + PMD_SLEEP_ITER, /* Iterations where a sleep has taken place. */ + PMD_CYCLES_SLEEP, /* Total cycles slept to save power. */ PMD_N_STATS }; @@ -408,7 +410,8 @@ void pmd_perf_start_iteration(struct pmd_perf_stats *s); void pmd_perf_end_iteration(struct pmd_perf_stats *s, int rx_packets, - int tx_packets, bool full_metrics); + int tx_packets, uint64_t sleep_cycles, + bool full_metrics); /* Formatting the output of commands. */ diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index 7127068fe..a47d54c6f 100644 --- a/lib/dpif-netdev.c +++ b/lib/dpif-netdev.c @@ -171,6 +171,11 @@ static struct odp_support dp_netdev_support = { /* Time in microseconds to try RCU quiescing. */ #define PMD_RCU_QUIESCE_INTERVAL 10000LL +/* Number of pkts Rx on an interface that will stop pmd thread sleeping. */ +#define PMD_SLEEP_THRESH (NETDEV_MAX_BURST / 2) +/* Time in uS to increment a pmd thread sleep time. */ +#define PMD_SLEEP_INC_US 10 + struct dpcls { struct cmap_node node; /* Within dp_netdev_pmd_thread.classifiers */ odp_port_t in_port; @@ -279,6 +284,8 @@ struct dp_netdev { atomic_uint32_t emc_insert_min; /* Enable collection of PMD performance metrics. */ atomic_bool pmd_perf_metrics; + /* Max load based sleep request. */ + atomic_uint64_t pmd_max_sleep; /* Enable the SMC cache from ovsdb config */ atomic_bool smc_enable_db; @@ -4821,8 +4828,10 @@ dpif_netdev_set_config(struct dpif *dpif, const struct smap *other_config) uint64_t rebalance_intvl; uint8_t cur_rebalance_load; uint32_t rebalance_load, rebalance_improve; + uint64_t pmd_max_sleep, cur_pmd_max_sleep; bool log_autolb = false; enum sched_assignment_type pmd_rxq_assign_type; + static bool first_set_config = true; tx_flush_interval = smap_get_int(other_config, "tx-flush-interval", DEFAULT_TX_FLUSH_INTERVAL); @@ -4969,6 +4978,19 @@ dpif_netdev_set_config(struct dpif *dpif, const struct smap *other_config) bool autolb_state = smap_get_bool(other_config, "pmd-auto-lb", false); set_pmd_auto_lb(dp, autolb_state, log_autolb); + + pmd_max_sleep = smap_get_ullong(other_config, "pmd-maxsleep", 0); + pmd_max_sleep = ROUND_UP(pmd_max_sleep, 10); + pmd_max_sleep = MIN(PMD_RCU_QUIESCE_INTERVAL, pmd_max_sleep); + atomic_read_relaxed(&dp->pmd_max_sleep, &cur_pmd_max_sleep); + if (first_set_config || pmd_max_sleep != cur_pmd_max_sleep) { + atomic_store_relaxed(&dp->pmd_max_sleep, pmd_max_sleep); + VLOG_INFO("PMD max sleep request is %"PRIu64" usecs.", pmd_max_sleep); + VLOG_INFO("PMD load based sleeps are %s.", + pmd_max_sleep ? "enabled" : "disabled" ); + } + + first_set_config = false; return 0; } @@ -6929,6 +6951,7 @@ pmd_thread_main(void *f_) int poll_cnt; int i; int process_packets = 0; + uint64_t sleep_time = 0; poll_list = NULL; @@ -6989,10 +7012,13 @@ reload: ovs_mutex_lock(&pmd->perf_stats.stats_mutex); for (;;) { uint64_t rx_packets = 0, tx_packets = 0; + uint64_t time_slept = 0; + uint64_t max_sleep; pmd_perf_start_iteration(s); atomic_read_relaxed(&pmd->dp->smc_enable_db, &pmd->ctx.smc_enable_db); + atomic_read_relaxed(&pmd->dp->pmd_max_sleep, &max_sleep); for (i = 0; i < poll_cnt; i++) { @@ -7011,6 +7037,9 @@ reload: dp_netdev_process_rxq_port(pmd, poll_list[i].rxq, poll_list[i].port_no); rx_packets += process_packets; + if (process_packets >= PMD_SLEEP_THRESH) { + sleep_time = 0; + } } if (!rx_packets) { @@ -7018,7 +7047,30 @@ reload: * Check if we need to send something. * There was no time updates on current iteration. */ pmd_thread_ctx_time_update(pmd); - tx_packets = dp_netdev_pmd_flush_output_packets(pmd, false); + tx_packets = dp_netdev_pmd_flush_output_packets(pmd, + max_sleep && sleep_time + ? true : false); + } + + if (max_sleep) { + /* Check if a sleep should happen on this iteration. */ + if (sleep_time) { + struct cycle_timer sleep_timer; + + cycle_timer_start(&pmd->perf_stats, &sleep_timer); + xnanosleep_no_quiesce(sleep_time * 1000); + time_slept = cycle_timer_stop(&pmd->perf_stats, &sleep_timer); + pmd_thread_ctx_time_update(pmd); + } + if (sleep_time < max_sleep) { + /* Increase sleep time for next iteration. */ + sleep_time += PMD_SLEEP_INC_US; + } else { + sleep_time = max_sleep; + } + } else { + /* Reset sleep time as max sleep policy may have been changed. */ + sleep_time = 0; } /* Do RCU synchronization at fixed interval. This ensures that @@ -7058,7 +7110,7 @@ reload: break; } - pmd_perf_end_iteration(s, rx_packets, tx_packets, + pmd_perf_end_iteration(s, rx_packets, tx_packets, time_slept, pmd_perf_metrics_enabled(pmd)); } ovs_mutex_unlock(&pmd->perf_stats.stats_mutex); @@ -9909,7 +9961,7 @@ dp_netdev_pmd_try_optimize(struct dp_netdev_pmd_thread *pmd, struct polled_queue *poll_list, int poll_cnt) { struct dpcls *cls; - uint64_t tot_idle = 0, tot_proc = 0; + uint64_t tot_idle = 0, tot_proc = 0, tot_sleep = 0; unsigned int pmd_load = 0; if (pmd->ctx.now > pmd->next_cycle_store) { @@ -9926,10 +9978,13 @@ dp_netdev_pmd_try_optimize(struct dp_netdev_pmd_thread *pmd, pmd->prev_stats[PMD_CYCLES_ITER_IDLE]; tot_proc = pmd->perf_stats.counters.n[PMD_CYCLES_ITER_BUSY] - pmd->prev_stats[PMD_CYCLES_ITER_BUSY]; + tot_sleep = pmd->perf_stats.counters.n[PMD_CYCLES_SLEEP] - + pmd->prev_stats[PMD_CYCLES_SLEEP]; if (pmd_alb->is_enabled && !pmd->isolated) { if (tot_proc) { - pmd_load = ((tot_proc * 100) / (tot_idle + tot_proc)); + pmd_load = ((tot_proc * 100) / + (tot_idle + tot_proc + tot_sleep)); } atomic_read_relaxed(&pmd_alb->rebalance_load_thresh, @@ -9946,6 +10001,8 @@ dp_netdev_pmd_try_optimize(struct dp_netdev_pmd_thread *pmd, pmd->perf_stats.counters.n[PMD_CYCLES_ITER_IDLE]; pmd->prev_stats[PMD_CYCLES_ITER_BUSY] = pmd->perf_stats.counters.n[PMD_CYCLES_ITER_BUSY]; + pmd->prev_stats[PMD_CYCLES_SLEEP] = + pmd->perf_stats.counters.n[PMD_CYCLES_SLEEP]; /* Get the cycles that were used to process each queue and store. */ for (unsigned i = 0; i < poll_cnt; i++) { diff --git a/tests/pmd.at b/tests/pmd.at index ed90f88c4..e0f58f7a6 100644 --- a/tests/pmd.at +++ b/tests/pmd.at @@ -1254,3 +1254,49 @@ ovs-appctl: ovs-vswitchd: server returned an error OVS_VSWITCHD_STOP AT_CLEANUP + +dnl Check default state +AT_SETUP([PMD - pmd sleep]) +OVS_VSWITCHD_START + +dnl Check default +OVS_WAIT_UNTIL([tail ovs-vswitchd.log | grep "PMD max sleep request is 0 usecs."]) +OVS_WAIT_UNTIL([tail ovs-vswitchd.log | grep "PMD load based sleeps are disabled."]) + +dnl Check low value max sleep +get_log_next_line_num +AT_CHECK([ovs-vsctl set open_vswitch . other_config:pmd-maxsleep="1"]) +OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD max sleep request is 10 usecs."]) +OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD load based sleeps are enabled."]) + +dnl Check high value max sleep +get_log_next_line_num +AT_CHECK([ovs-vsctl set open_vswitch . other_config:pmd-maxsleep="10000"]) +OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD max sleep request is 10000 usecs."]) +OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD load based sleeps are enabled."]) + +dnl Check setting max sleep to zero +get_log_next_line_num +AT_CHECK([ovs-vsctl set open_vswitch . other_config:pmd-maxsleep="0"]) +OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD max sleep request is 0 usecs."]) +OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD load based sleeps are disabled."]) + +dnl Check above high value max sleep +get_log_next_line_num +AT_CHECK([ovs-vsctl set open_vswitch . other_config:pmd-maxsleep="10001"]) +OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD max sleep request is 10000 usecs."]) +OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD load based sleeps are enabled."]) + +dnl Check rounding +get_log_next_line_num +AT_CHECK([ovs-vsctl set open_vswitch . other_config:pmd-maxsleep="490"]) +OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD max sleep request is 490 usecs."]) +OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD load based sleeps are enabled."]) +dnl Check rounding +get_log_next_line_num +AT_CHECK([ovs-vsctl set open_vswitch . other_config:pmd-maxsleep="491"]) +OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD max sleep request is 500 usecs."]) +OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD load based sleeps are enabled."]) + +OVS_VSWITCHD_STOP +AT_CLEANUP diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml index f9bdb2d92..8c4acfb18 100644 --- a/vswitchd/vswitch.xml +++ b/vswitchd/vswitch.xml @@ -788,6 +788,32 @@ The default value is <code>25%</code>. </p> </column> + <column name="other_config" key="pmd-maxsleep" + type='{"type": "integer", + "minInteger": 0, "maxInteger": 10000}'> + <p> + Specifies the maximum sleep time that will be requested in + microseconds per iteration for a PMD thread which has received zero + or a small amount of packets from the Rx queues it is polling. + </p> + <p> + The actual sleep time requested is based on the load + of the Rx queues that the PMD polls and may be less than + the maximum value. + </p> + <p> + The default value is <code>0 microseconds</code>, which means + that the PMD will not sleep regardless of the load from the + Rx queues that it polls. + </p> + <p> + To avoid requesting very small sleeps (e.g. less than 10 us) the + value will be rounded up to the nearest 10 us. + </p> + <p> + The maximum value is <code>10000 microseconds</code>. + </p> + </column> <column name="other_config" key="userspace-tso-enable" type='{"type": "boolean"}'> <p> |