diff options
-rwxr-xr-x | .travis/linux-build.sh | 2 | ||||
-rw-r--r-- | Documentation/automake.mk | 1 | ||||
-rw-r--r-- | Documentation/faq/releases.rst | 4 | ||||
-rw-r--r-- | Documentation/howto/dpdk.rst | 22 | ||||
-rw-r--r-- | Documentation/intro/install/dpdk.rst | 14 | ||||
-rw-r--r-- | Documentation/topics/dpdk/index.rst | 1 | ||||
-rw-r--r-- | Documentation/topics/dpdk/memory.rst | 216 | ||||
-rw-r--r-- | Documentation/topics/dpdk/vhost-user.rst | 6 | ||||
-rw-r--r-- | NEWS | 3 | ||||
-rw-r--r-- | lib/dp-packet.h | 13 | ||||
-rw-r--r-- | lib/dpdk-stub.c | 6 | ||||
-rw-r--r-- | lib/dpdk.c | 12 | ||||
-rw-r--r-- | lib/dpdk.h | 1 | ||||
-rw-r--r-- | lib/dpif-netdev.c | 499 | ||||
-rw-r--r-- | lib/flow.c | 168 | ||||
-rw-r--r-- | lib/flow.h | 1 | ||||
-rw-r--r-- | lib/netdev-dpdk.c | 1029 | ||||
-rw-r--r-- | lib/netdev.h | 6 | ||||
-rw-r--r-- | vswitchd/vswitch.xml | 17 |
19 files changed, 1873 insertions, 148 deletions
diff --git a/.travis/linux-build.sh b/.travis/linux-build.sh index 2e611f809..4b9fc4ac1 100755 --- a/.travis/linux-build.sh +++ b/.travis/linux-build.sh @@ -83,7 +83,7 @@ fi if [ "$DPDK" ]; then if [ -z "$DPDK_VER" ]; then - DPDK_VER="17.11.2" + DPDK_VER="17.11.3" fi install_dpdk $DPDK_VER if [ "$CC" = "clang" ]; then diff --git a/Documentation/automake.mk b/Documentation/automake.mk index bc728dff3..244479490 100644 --- a/Documentation/automake.mk +++ b/Documentation/automake.mk @@ -36,6 +36,7 @@ DOC_SOURCE = \ Documentation/topics/dpdk/index.rst \ Documentation/topics/dpdk/bridge.rst \ Documentation/topics/dpdk/jumbo-frames.rst \ + Documentation/topics/dpdk/memory.rst \ Documentation/topics/dpdk/pdump.rst \ Documentation/topics/dpdk/phy.rst \ Documentation/topics/dpdk/pmd.rst \ diff --git a/Documentation/faq/releases.rst b/Documentation/faq/releases.rst index fab93b188..7021cec59 100644 --- a/Documentation/faq/releases.rst +++ b/Documentation/faq/releases.rst @@ -163,9 +163,9 @@ Q: What DPDK version does each Open vSwitch release work with? 2.4.x 2.0 2.5.x 2.2 2.6.x 16.07.2 - 2.7.x 16.11.6 + 2.7.x 16.11.7 2.8.x 17.05.2 - 2.9.x 17.11.2 + 2.9.x 17.11.3 ============ ======= Q: Are all the DPDK releases that OVS versions work with maintained? diff --git a/Documentation/howto/dpdk.rst b/Documentation/howto/dpdk.rst index 380181db0..82596f557 100644 --- a/Documentation/howto/dpdk.rst +++ b/Documentation/howto/dpdk.rst @@ -358,6 +358,28 @@ devices to bridge ``br0``. Once complete, follow the below steps: $ cat /proc/interrupts | grep virtio +.. _dpdk-flow-hardware-offload: + +Flow Hardware Offload (Experimental) +------------------------------------ + +The flow hardware offload is disabled by default and can be enabled by:: + + $ ovs-vsctl set Open_vSwitch . other_config:hw-offload=true + +So far only partial flow offload is implemented. Moreover, it only works +with PMD drivers have the rte_flow action "MARK + RSS" support. + +The validated NICs are: + +- Mellanox (ConnectX-4, ConnectX-4 Lx, ConnectX-5) +- Napatech (NT200B01) + +Supported protocols for hardware offload are: +- L2: Ethernet, VLAN +- L3: IPv4, IPv6 +- L4: TCP, UDP, SCTP, ICMP + Further Reading --------------- diff --git a/Documentation/intro/install/dpdk.rst b/Documentation/intro/install/dpdk.rst index 085e47990..2468c641b 100644 --- a/Documentation/intro/install/dpdk.rst +++ b/Documentation/intro/install/dpdk.rst @@ -40,7 +40,7 @@ Build requirements In addition to the requirements described in :doc:`general`, building Open vSwitch with DPDK will require the following: -- DPDK 17.11.2 +- DPDK 17.11.3 - A `DPDK supported NIC`_ @@ -69,9 +69,9 @@ Install DPDK #. Download the `DPDK sources`_, extract the file and set ``DPDK_DIR``:: $ cd /usr/src/ - $ wget http://fast.dpdk.org/rel/dpdk-17.11.2.tar.xz - $ tar xf dpdk-17.11.2.tar.xz - $ export DPDK_DIR=/usr/src/dpdk-stable-17.11.2 + $ wget http://fast.dpdk.org/rel/dpdk-17.11.3.tar.xz + $ tar xf dpdk-17.11.3.tar.xz + $ export DPDK_DIR=/usr/src/dpdk-stable-17.11.3 $ cd $DPDK_DIR #. (Optional) Configure DPDK as a shared library @@ -170,6 +170,12 @@ Mount the hugepages, if not already mounted by default:: $ mount -t hugetlbfs none /dev/hugepages`` +.. note:: + + The amount of hugepage memory required can be affected by various + aspects of the datapath and device configuration. Refer to + :doc:`/topics/dpdk/memory` for more details. + .. _dpdk-vfio: Setup DPDK devices using VFIO diff --git a/Documentation/topics/dpdk/index.rst b/Documentation/topics/dpdk/index.rst index 181f61abb..cf24a7b6d 100644 --- a/Documentation/topics/dpdk/index.rst +++ b/Documentation/topics/dpdk/index.rst @@ -40,3 +40,4 @@ The DPDK Datapath /topics/dpdk/qos /topics/dpdk/pdump /topics/dpdk/jumbo-frames + /topics/dpdk/memory diff --git a/Documentation/topics/dpdk/memory.rst b/Documentation/topics/dpdk/memory.rst new file mode 100644 index 000000000..e5fb166d5 --- /dev/null +++ b/Documentation/topics/dpdk/memory.rst @@ -0,0 +1,216 @@ +.. + Copyright (c) 2018 Intel Corporation + + Licensed under the Apache License, Version 2.0 (the "License"); you may + not use this file except in compliance with the License. You may obtain + a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, WITHOUT + WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the + License for the specific language governing permissions and limitations + under the License. + + Convention for heading levels in Open vSwitch documentation: + + ======= Heading 0 (reserved for the title in a document) + ------- Heading 1 + ~~~~~~~ Heading 2 + +++++++ Heading 3 + ''''''' Heading 4 + + Avoid deeper levels because they do not render well. + +========================= +DPDK Device Memory Models +========================= + +DPDK device memory can be allocated in one of two ways in OVS DPDK, +**shared memory** or **per port memory**. The specifics of both are +detailed below. + +Shared Memory +------------- + +By default OVS DPDK uses a shared memory model. This means that multiple +ports can share the same mempool. For example when a port is added it will +have a given MTU and socket ID associated with it. If a mempool has been +created previously for an existing port that has the same MTU and socket ID, +that mempool is used for both ports. If there is no existing mempool +supporting these parameters then a new mempool is created. + +Per Port Memory +--------------- + +In the per port memory model, mempools are created per device and are not +shared. The benefit of this is a more transparent memory model where mempools +will not be exhausted by other DPDK devices. However this comes at a potential +increase in cost for memory dimensioning for a given deployment. Users should +be aware of the memory requirements for their deployment before using this +model and allocate the required hugepage memory. + +Per port mempool support may be enabled via a global config value, +```per-port-memory```. Setting this to true enables the per port memory +model for all DPDK devices in OVS:: + + $ ovs-vsctl set Open_vSwitch . other_config:per-port-memory=true + +.. important:: + + This value should be set before setting dpdk-init=true. If set after + dpdk-init=true then the daemon must be restarted to use per-port-memory. + +Calculating Memory Requirements +------------------------------- + +The amount of memory required for a given mempool can be calculated by the +**number mbufs in the mempool \* mbuf size**. + +Users should be aware of the following: + +* The **number of mbufs** per mempool will differ between memory models. + +* The **size of each mbuf** will be affected by the requested **MTU** size. + +.. important:: + + An mbuf size in bytes is always larger than the requested MTU size due to + alignment and rounding needed in OVS DPDK. + +Below are a number of examples of memory requirement calculations for both +shared and per port memory models. + +Shared Memory Calculations +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +In the shared memory model the number of mbufs requested is directly +affected by the requested MTU size as described in the table below. + ++--------------------+-------------+ +| MTU Size | Num MBUFS | ++====================+=============+ +| 1500 or greater | 262144 | ++--------------------+-------------+ +| Less than 1500 | 16384 | ++------------+-------+-------------+ + +.. Important:: + + If a deployment does not have enough memory to provide 262144 mbufs then + the requested amount is halved up until 16384. + +Example 1 ++++++++++ +:: + + MTU = 1500 Bytes + Number of mbufs = 262144 + Mbuf size = 3008 Bytes + Memory required = 262144 * 3008 = 788 MB + +Example 2 ++++++++++ +:: + + MTU = 1800 Bytes + Number of mbufs = 262144 + Mbuf size = 3008 Bytes + Memory required = 262144 * 3008 = 788 MB + +.. note:: + + Assuming the same socket is in use for example 1 and 2 the same mempool + would be shared. + +Example 3 ++++++++++ +:: + + MTU = 6000 Bytes + Number of mbufs = 262144 + Mbuf size = 8128 Bytes + Memory required = 262144 * 8128 = 2130 MB + +Example 4 ++++++++++ +:: + + MTU = 9000 Bytes + Number of mbufs = 262144 + Mbuf size = 10176 Bytes + Memory required = 262144 * 10176 = 2667 MB + +Per Port Memory Calculations +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The number of mbufs requested in the per port model is more complicated and +accounts for multiple dynamic factors in the datapath and device +configuration. + +A rough estimation of the number of mbufs required for a port is: +:: + + packets required to fill the device rxqs + + packets that could be stuck on other ports txqs + + packets on the pmd threads + + additional corner case memory. + +The algorithm in OVS used to calculate this is as follows: +:: + + requested number of rxqs * requested rxq size + + requested number of txqs * requested txq size + + min(RTE_MAX_LCORE, requested number of rxqs) * netdev_max_burst + + MIN_NB_MBUF. + +where: + +* **requested number of rxqs**: Number of requested receive queues for a + device. +* **requested rxq size**: The number of descriptors requested for a rx queue. +* **requested number of txqs**: Number of requested transmit queues for a + device. Calculated as the number of PMDs configured +1. +* **requested txq size**: the number of descriptors requested for a tx queue. +* **min(RTE_MAX_LCORE, requested number of rxqs)**: Compare the maximum + number of lcores supported by DPDK to the number of requested receive + queues for the device and use the variable of lesser value. +* **NETDEV_MAX_BURST**: Maximum number of of packets in a burst, defined as + 32. +* **MIN_NB_MBUF**: Additional memory for corner case, defined as 16384. + +For all examples below assume the following values: + +* requested_rxq_size = 2048 +* requested_txq_size = 2048 +* RTE_MAX_LCORE = 128 +* netdev_max_burst = 32 +* MIN_NB_MBUF = 16384 + +Example 1: (1 rxq, 1 PMD, 1500 MTU) ++++++++++++++++++++++++++++++++++++ +:: + + MTU = 1500 + Number of mbufs = (1 * 2048) + (2 * 2048) + (1 * 32) + (16384) = 22560 + Mbuf size = 3008 Bytes + Memory required = 22560 * 3008 = 67 MB + +Example 2: (1 rxq, 2 PMD, 6000 MTU) ++++++++++++++++++++++++++++++++++++ +:: + + MTU = 6000 + Number of mbufs = (1 * 2048) + (3 * 2048) + (1 * 32) + (16384) = 24608 + Mbuf size = 8128 Bytes + Memory required = 24608 * 8128 = 200 MB + +Example 3: (2 rxq, 2 PMD, 9000 MTU) ++++++++++++++++++++++++++++++++++++ +:: + + MTU = 9000 + Number of mbufs = (2 * 2048) + (3 * 2048) + (1 * 32) + (16384) = 26656 + Mbuf size = 10176 Bytes + Memory required = 26656 * 10176 = 271 MB diff --git a/Documentation/topics/dpdk/vhost-user.rst b/Documentation/topics/dpdk/vhost-user.rst index c5b69fabd..b1e2285dc 100644 --- a/Documentation/topics/dpdk/vhost-user.rst +++ b/Documentation/topics/dpdk/vhost-user.rst @@ -320,9 +320,9 @@ To begin, instantiate a guest as described in :ref:`dpdk-vhost-user` or DPDK sources to VM and build DPDK:: $ cd /root/dpdk/ - $ wget http://fast.dpdk.org/rel/dpdk-17.11.2.tar.xz - $ tar xf dpdk-17.11.2.tar.xz - $ export DPDK_DIR=/root/dpdk/dpdk-stable-17.11.2 + $ wget http://fast.dpdk.org/rel/dpdk-17.11.3.tar.xz + $ tar xf dpdk-17.11.3.tar.xz + $ export DPDK_DIR=/root/dpdk/dpdk-stable-17.11.3 $ export DPDK_TARGET=x86_64-native-linuxapp-gcc $ export DPDK_BUILD=$DPDK_DIR/$DPDK_TARGET $ cd $DPDK_DIR @@ -36,6 +36,8 @@ Post-v2.9.0 See Testing topic for the details. * Add LSC interrupt support for DPDK physical devices. * Allow init to fail and record DPDK status/version in OVS database. + * Add experimental flow hardware offload support + * Support both shared and per port mempools for DPDK devices. - Userspace datapath: * Commands ovs-appctl dpif-netdev/pmd-*-show can now work on a single PMD * Detailed PMD performance metrics available with new command @@ -112,7 +114,6 @@ v2.9.0 - 19 Feb 2018 * New appctl command 'dpif-netdev/pmd-rxq-rebalance' to rebalance rxq to pmd assignments. * Add rxq utilization of pmd to appctl 'dpif-netdev/pmd-rxq-show'. - * Add support for vHost dequeue zero copy (experimental) - Userspace datapath: * Output packet batching support. - vswitchd: diff --git a/lib/dp-packet.h b/lib/dp-packet.h index 596cfe691..ba91e5891 100644 --- a/lib/dp-packet.h +++ b/lib/dp-packet.h @@ -691,6 +691,19 @@ reset_dp_packet_checksum_ol_flags(struct dp_packet *p) #define reset_dp_packet_checksum_ol_flags(arg) #endif +static inline bool +dp_packet_has_flow_mark(struct dp_packet *p OVS_UNUSED, + uint32_t *mark OVS_UNUSED) +{ +#ifdef DPDK_NETDEV + if (p->mbuf.ol_flags & PKT_RX_FDIR_ID) { + *mark = p->mbuf.hash.fdir.hi; + return true; + } +#endif + return false; +} + enum { NETDEV_MAX_BURST = 32 }; /* Maximum number packets in a batch. */ struct dp_packet_batch { diff --git a/lib/dpdk-stub.c b/lib/dpdk-stub.c index 1df1c5848..1e0f46101 100644 --- a/lib/dpdk-stub.c +++ b/lib/dpdk-stub.c @@ -56,6 +56,12 @@ dpdk_vhost_iommu_enabled(void) return false; } +bool +dpdk_per_port_memory(void) +{ + return false; +} + void print_dpdk_version(void) { diff --git a/lib/dpdk.c b/lib/dpdk.c index 5c68ce430..0ee3e19c6 100644 --- a/lib/dpdk.c +++ b/lib/dpdk.c @@ -48,6 +48,7 @@ static char *vhost_sock_dir = NULL; /* Location of vhost-user sockets */ static bool vhost_iommu_enabled = false; /* Status of vHost IOMMU support */ static bool dpdk_initialized = false; /* Indicates successful initialization * of DPDK. */ +static bool per_port_memory = false; /* Status of per port memory support */ static int process_vhost_flags(char *flag, const char *default_val, int size, @@ -384,6 +385,11 @@ dpdk_init__(const struct smap *ovs_other_config) VLOG_INFO("IOMMU support for vhost-user-client %s.", vhost_iommu_enabled ? "enabled" : "disabled"); + per_port_memory = smap_get_bool(ovs_other_config, + "per-port-memory", false); + VLOG_INFO("Per port memory for DPDK devices %s.", + per_port_memory ? "enabled" : "disabled"); + argv = grow_argv(&argv, 0, 1); argc = 1; argv[0] = xstrdup(ovs_get_program_name()); @@ -541,6 +547,12 @@ dpdk_vhost_iommu_enabled(void) return vhost_iommu_enabled; } +bool +dpdk_per_port_memory(void) +{ + return per_port_memory; +} + void dpdk_set_lcore_id(unsigned cpu) { diff --git a/lib/dpdk.h b/lib/dpdk.h index efdaa637c..bbb89d4e6 100644 --- a/lib/dpdk.h +++ b/lib/dpdk.h @@ -39,6 +39,7 @@ void dpdk_init(const struct smap *ovs_other_config); void dpdk_set_lcore_id(unsigned cpu); const char *dpdk_get_vhost_sock_dir(void); bool dpdk_vhost_iommu_enabled(void); +bool dpdk_per_port_memory(void); void print_dpdk_version(void); void dpdk_status(const struct ovsrec_open_vswitch *); #endif /* dpdk.h */ diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index 9390fff68..8e0067d56 100644 --- a/lib/dpif-netdev.c +++ b/lib/dpif-netdev.c @@ -76,6 +76,7 @@ #include "tnl-ports.h" #include "unixctl.h" #include "util.h" +#include "uuid.h" VLOG_DEFINE_THIS_MODULE(dpif_netdev); @@ -347,6 +348,37 @@ enum rxq_cycles_counter_type { RXQ_N_CYCLES }; +enum { + DP_NETDEV_FLOW_OFFLOAD_OP_ADD, + DP_NETDEV_FLOW_OFFLOAD_OP_MOD, + DP_NETDEV_FLOW_OFFLOAD_OP_DEL, +}; + +struct dp_flow_offload_item { + struct dp_netdev_pmd_thread *pmd; + struct dp_netdev_flow *flow; + int op; + struct match match; + struct nlattr *actions; + size_t actions_len; + + struct ovs_list node; +}; + +struct dp_flow_offload { + struct ovs_mutex mutex; + struct ovs_list list; + pthread_cond_t cond; +}; + +static struct dp_flow_offload dp_flow_offload = { + .mutex = OVS_MUTEX_INITIALIZER, + .list = OVS_LIST_INITIALIZER(&dp_flow_offload.list), +}; + +static struct ovsthread_once offload_thread_once + = OVSTHREAD_ONCE_INITIALIZER; + #define XPS_TIMEOUT 500000LL /* In microseconds. */ /* Contained by struct dp_netdev_port's 'rxqs' member. */ @@ -434,7 +466,9 @@ struct dp_netdev_flow { /* Hash table index by unmasked flow. */ const struct cmap_node node; /* In owning dp_netdev_pmd_thread's */ /* 'flow_table'. */ + const struct cmap_node mark_node; /* In owning flow_mark's mark_to_flow */ const ovs_u128 ufid; /* Unique flow identifier. */ + const ovs_u128 mega_ufid; /* Unique mega flow identifier. */ const unsigned pmd_id; /* The 'core_id' of pmd thread owning this */ /* flow. */ @@ -445,6 +479,7 @@ struct dp_netdev_flow { struct ovs_refcount ref_cnt; bool dead; + uint32_t mark; /* Unique flow mark assigned to a flow */ /* Statistics. */ struct dp_netdev_flow_stats stats; @@ -724,6 +759,8 @@ static void emc_clear_entry(struct emc_entry *ce); static void dp_netdev_request_reconfigure(struct dp_netdev *dp); static inline bool pmd_perf_metrics_enabled(const struct dp_netdev_pmd_thread *pmd); +static void queue_netdev_flow_del(struct dp_netdev_pmd_thread *pmd, + struct dp_netdev_flow *flow); static void emc_cache_init(struct emc_cache *flow_cache) @@ -1941,6 +1978,415 @@ dp_netdev_pmd_find_dpcls(struct dp_netdev_pmd_thread *pmd, return cls; } +#define MAX_FLOW_MARK (UINT32_MAX - 1) +#define INVALID_FLOW_MARK (UINT32_MAX) + +struct megaflow_to_mark_data { + const struct cmap_node node; + ovs_u128 mega_ufid; + uint32_t mark; +}; + +struct flow_mark { + struct cmap megaflow_to_mark; + struct cmap mark_to_flow; + struct id_pool *pool; +}; + +static struct flow_mark flow_mark = { + .megaflow_to_mark = CMAP_INITIALIZER, + .mark_to_flow = CMAP_INITIALIZER, +}; + +static uint32_t +flow_mark_alloc(void) +{ + uint32_t mark; + + if (!flow_mark.pool) { + /* Haven't initiated yet, do it here */ + flow_mark.pool = id_pool_create(0, MAX_FLOW_MARK); + } + + if (id_pool_alloc_id(flow_mark.pool, &mark)) { + return mark; + } + + return INVALID_FLOW_MARK; +} + +static void +flow_mark_free(uint32_t mark) +{ + id_pool_free_id(flow_mark.pool, mark); +} + +/* associate megaflow with a mark, which is a 1:1 mapping */ +static void +megaflow_to_mark_associate(const ovs_u128 *mega_ufid, uint32_t mark) +{ + size_t hash = dp_netdev_flow_hash(mega_ufid); + struct megaflow_to_mark_data *data = xzalloc(sizeof(*data)); + + data->mega_ufid = *mega_ufid; + data->mark = mark; + + cmap_insert(&flow_mark.megaflow_to_mark, + CONST_CAST(struct cmap_node *, &data->node), hash); +} + +/* disassociate meagaflow with a mark */ +static void +megaflow_to_mark_disassociate(const ovs_u128 *mega_ufid) +{ + size_t hash = dp_netdev_flow_hash(mega_ufid); + struct megaflow_to_mark_data *data; + + CMAP_FOR_EACH_WITH_HASH (data, node, hash, &flow_mark.megaflow_to_mark) { + if (ovs_u128_equals(*mega_ufid, data->mega_ufid)) { + cmap_remove(&flow_mark.megaflow_to_mark, + CONST_CAST(struct cmap_node *, &data->node), hash); + free(data); + return; + } + } + + VLOG_WARN("Masked ufid "UUID_FMT" is not associated with a mark?\n", + UUID_ARGS((struct uuid *)mega_ufid)); +} + +static inline uint32_t +megaflow_to_mark_find(const ovs_u128 *mega_ufid) +{ + size_t hash = dp_netdev_flow_hash(mega_ufid); + struct megaflow_to_mark_data *data; + + CMAP_FOR_EACH_WITH_HASH (data, node, hash, &flow_mark.megaflow_to_mark) { + if (ovs_u128_equals(*mega_ufid, data->mega_ufid)) { + return data->mark; + } + } + + VLOG_WARN("Mark id for ufid "UUID_FMT" was not found\n", + UUID_ARGS((struct uuid *)mega_ufid)); + return INVALID_FLOW_MARK; +} + +/* associate mark with a flow, which is 1:N mapping */ +static void +mark_to_flow_associate(const uint32_t mark, struct dp_netdev_flow *flow) +{ + dp_netdev_flow_ref(flow); + + cmap_insert(&flow_mark.mark_to_flow, + CONST_CAST(struct cmap_node *, &flow->mark_node), + hash_int(mark, 0)); + flow->mark = mark; + + VLOG_DBG("Associated dp_netdev flow %p with mark %u\n", flow, mark); +} + +static bool +flow_mark_has_no_ref(uint32_t mark) +{ + struct dp_netdev_flow *flow; + + CMAP_FOR_EACH_WITH_HASH (flow, mark_node, hash_int(mark, 0), + &flow_mark.mark_to_flow) { + if (flow->mark == mark) { + return false; + } + } + + return true; +} + +static int +mark_to_flow_disassociate(struct dp_netdev_pmd_thread *pmd, + struct dp_netdev_flow *flow) +{ + int ret = 0; + uint32_t mark = flow->mark; + struct cmap_node *mark_node = CONST_CAST(struct cmap_node *, + &flow->mark_node); + + cmap_remove(&flow_mark.mark_to_flow, mark_node, hash_int(mark, 0)); + flow->mark = INVALID_FLOW_MARK; + + /* + * no flow is referencing the mark any more? If so, let's + * remove the flow from hardware and free the mark. + */ + if (flow_mark_has_no_ref(mark)) { + struct dp_netdev_port *port; + odp_port_t in_port = flow->flow.in_port.odp_port; + + ovs_mutex_lock(&pmd->dp->port_mutex); + port = dp_netdev_lookup_port(pmd->dp, in_port); + if (port) { + ret = netdev_flow_del(port->netdev, &flow->mega_ufid, NULL); + } + ovs_mutex_unlock(&pmd->dp->port_mutex); + + flow_mark_free(mark); + VLOG_DBG("Freed flow mark %u\n", mark); + + megaflow_to_mark_disassociate(&flow->mega_ufid); + } + dp_netdev_flow_unref(flow); + + return ret; +} + +static void +flow_mark_flush(struct dp_netdev_pmd_thread *pmd) +{ + struct dp_netdev_flow *flow; + + CMAP_FOR_EACH (flow, mark_node, &flow_mark.mark_to_flow) { + if (flow->pmd_id == pmd->core_id) { + queue_netdev_flow_del(pmd, flow); + } + } +} + +static struct dp_netdev_flow * +mark_to_flow_find(const struct dp_netdev_pmd_thread *pmd, + const uint32_t mark) +{ + struct dp_netdev_flow *flow; + + CMAP_FOR_EACH_WITH_HASH (flow, mark_node, hash_int(mark, 0), + &flow_mark.mark_to_flow) { + if (flow->mark == mark && flow->pmd_id == pmd->core_id && + flow->dead == false) { + return flow; + } + } + + return NULL; +} + +static struct dp_flow_offload_item * +dp_netdev_alloc_flow_offload(struct dp_netdev_pmd_thread *pmd, + struct dp_netdev_flow *flow, + int op) +{ + struct dp_flow_offload_item *offload; + + offload = xzalloc(sizeof(*offload)); + offload->pmd = pmd; + offload->flow = flow; + offload->op = op; + + dp_netdev_flow_ref(flow); + dp_netdev_pmd_try_ref(pmd); + + return offload; +} + +static void +dp_netdev_free_flow_offload(struct dp_flow_offload_item *offload) +{ + dp_netdev_pmd_unref(offload->pmd); + dp_netdev_flow_unref(offload->flow); + + free(offload->actions); + free(offload); +} + +static void +dp_netdev_append_flow_offload(struct dp_flow_offload_item *offload) +{ + ovs_mutex_lock(&dp_flow_offload.mutex); + ovs_list_push_back(&dp_flow_offload.list, &offload->node); + xpthread_cond_signal(&dp_flow_offload.cond); + ovs_mutex_unlock(&dp_flow_offload.mutex); +} + +static int +dp_netdev_flow_offload_del(struct dp_flow_offload_item *offload) +{ + return mark_to_flow_disassociate(offload->pmd, offload->flow); +} + +/* + * There are two flow offload operations here: addition and modification. + * + * For flow addition, this function does: + * - allocate a new flow mark id + * - perform hardware flow offload + * - associate the flow mark with flow and mega flow + * + * For flow modification, both flow mark and the associations are still + * valid, thus only item 2 needed. + */ +static int +dp_netdev_flow_offload_put(struct dp_flow_offload_item *offload) +{ + struct dp_netdev_port *port; + struct dp_netdev_pmd_thread *pmd = offload->pmd; + struct dp_netdev_flow *flow = offload->flow; + odp_port_t in_port = flow->flow.in_port.odp_port; + bool modification = offload->op == DP_NETDEV_FLOW_OFFLOAD_OP_MOD; + struct offload_info info; + uint32_t mark; + int ret; + + if (flow->dead) { + return -1; + } + + if (modification) { + mark = flow->mark; + ovs_assert(mark != INVALID_FLOW_MARK); + } else { + /* + * If a mega flow has already been offloaded (from other PMD + * instances), do not offload it again. + */ + mark = megaflow_to_mark_find(&flow->mega_ufid); + if (mark != INVALID_FLOW_MARK) { + VLOG_DBG("Flow has already been offloaded with mark %u\n", mark); + if (flow->mark != INVALID_FLOW_MARK) { + ovs_assert(flow->mark == mark); + } else { + mark_to_flow_associate(mark, flow); + } + return 0; + } + + mark = flow_mark_alloc(); + if (mark == INVALID_FLOW_MARK) { + VLOG_ERR("Failed to allocate flow mark!\n"); + } + } + info.flow_mark = mark; + + ovs_mutex_lock(&pmd->dp->port_mutex); + port = dp_netdev_lookup_port(pmd->dp, in_port); + if (!port) { + ovs_mutex_unlock(&pmd->dp->port_mutex); + return -1; + } + ret = netdev_flow_put(port->netdev, &offload->match, + CONST_CAST(struct nlattr *, offload->actions), + offload->actions_len, &flow->mega_ufid, &info, + NULL); + ovs_mutex_unlock(&pmd->dp->port_mutex); + + if (ret) { + if (!modification) { + flow_mark_free(mark); + } else { + mark_to_flow_disassociate(pmd, flow); + } + return -1; + } + + if (!modification) { + megaflow_to_mark_associate(&flow->mega_ufid, mark); + mark_to_flow_associate(mark, flow); + } + + return 0; +} + +static void * +dp_netdev_flow_offload_main(void *data OVS_UNUSED) +{ + struct dp_flow_offload_item *offload; + struct ovs_list *list; + const char *op; + int ret; + + for (;;) { + ovs_mutex_lock(&dp_flow_offload.mutex); + if (ovs_list_is_empty(&dp_flow_offload.list)) { + ovsrcu_quiesce_start(); + ovs_mutex_cond_wait(&dp_flow_offload.cond, + &dp_flow_offload.mutex); + } + list = ovs_list_pop_front(&dp_flow_offload.list); + offload = CONTAINER_OF(list, struct dp_flow_offload_item, node); + ovs_mutex_unlock(&dp_flow_offload.mutex); + + switch (offload->op) { + case DP_NETDEV_FLOW_OFFLOAD_OP_ADD: + op = "add"; + ret = dp_netdev_flow_offload_put(offload); + break; + case DP_NETDEV_FLOW_OFFLOAD_OP_MOD: + op = "modify"; + ret = dp_netdev_flow_offload_put(offload); + break; + case DP_NETDEV_FLOW_OFFLOAD_OP_DEL: + op = "delete"; + ret = dp_netdev_flow_offload_del(offload); + break; + default: + OVS_NOT_REACHED(); + } + + VLOG_DBG("%s to %s netdev flow\n", + ret == 0 ? "succeed" : "failed", op); + dp_netdev_free_flow_offload(offload); + } + + return NULL; +} + +static void +queue_netdev_flow_del(struct dp_netdev_pmd_thread *pmd, + struct dp_netdev_flow *flow) +{ + struct dp_flow_offload_item *offload; + + if (ovsthread_once_start(&offload_thread_once)) { + xpthread_cond_init(&dp_flow_offload.cond, NULL); + ovs_thread_create("dp_netdev_flow_offload", + dp_netdev_flow_offload_main, NULL); + ovsthread_once_done(&offload_thread_once); + } + + offload = dp_netdev_alloc_flow_offload(pmd, flow, + DP_NETDEV_FLOW_OFFLOAD_OP_DEL); + dp_netdev_append_flow_offload(offload); +} + +static void +queue_netdev_flow_put(struct dp_netdev_pmd_thread *pmd, + struct dp_netdev_flow *flow, struct match *match, + const struct nlattr *actions, size_t actions_len) +{ + struct dp_flow_offload_item *offload; + int op; + + if (!netdev_is_flow_api_enabled()) { + return; + } + + if (ovsthread_once_start(&offload_thread_once)) { + xpthread_cond_init(&dp_flow_offload.cond, NULL); + ovs_thread_create("dp_netdev_flow_offload", + dp_netdev_flow_offload_main, NULL); + ovsthread_once_done(&offload_thread_once); + } + + if (flow->mark != INVALID_FLOW_MARK) { + op = DP_NETDEV_FLOW_OFFLOAD_OP_MOD; + } else { + op = DP_NETDEV_FLOW_OFFLOAD_OP_ADD; + } + offload = dp_netdev_alloc_flow_offload(pmd, flow, op); + offload->match = *match; + offload->actions = xmalloc(actions_len); + memcpy(offload->actions, actions, actions_len); + offload->actions_len = actions_len; + + dp_netdev_append_flow_offload(offload); +} + static void dp_netdev_pmd_remove_flow(struct dp_netdev_pmd_thread *pmd, struct dp_netdev_flow *flow) @@ -1954,6 +2400,9 @@ dp_netdev_pmd_remove_flow(struct dp_netdev_pmd_thread *pmd, ovs_assert(cls != NULL); dpcls_remove(cls, &flow->cr); cmap_remove(&pmd->flow_table, node, dp_netdev_flow_hash(&flow->ufid)); + if (flow->mark != INVALID_FLOW_MARK) { + queue_netdev_flow_del(pmd, flow); + } flow->dead = true; dp_netdev_flow_unref(flow); @@ -2534,6 +2983,19 @@ out: return error; } +static void +dp_netdev_get_mega_ufid(const struct match *match, ovs_u128 *mega_ufid) +{ + struct flow masked_flow; + size_t i; + + for (i = 0; i < sizeof(struct flow); i++) { + ((uint8_t *)&masked_flow)[i] = ((uint8_t *)&match->flow)[i] & + ((uint8_t *)&match->wc)[i]; + } + dpif_flow_hash(NULL, &masked_flow, sizeof(struct flow), mega_ufid); +} + static struct dp_netdev_flow * dp_netdev_flow_add(struct dp_netdev_pmd_thread *pmd, struct match *match, const ovs_u128 *ufid, @@ -2569,12 +3031,14 @@ dp_netdev_flow_add(struct dp_netdev_pmd_thread *pmd, memset(&flow->stats, 0, sizeof flow->stats); flow->dead = false; flow->batch = NULL; + flow->mark = INVALID_FLOW_MARK; *CONST_CAST(unsigned *, &flow->pmd_id) = pmd->core_id; *CONST_CAST(struct flow *, &flow->flow) = match->flow; *CONST_CAST(ovs_u128 *, &flow->ufid) = *ufid; ovs_refcount_init(&flow->ref_cnt); ovsrcu_set(&flow->actions, dp_netdev_actions_create(actions, actions_len)); + dp_netdev_get_mega_ufid(match, CONST_CAST(ovs_u128 *, &flow->mega_ufid)); netdev_flow_key_init_masked(&flow->cr.flow, &match->flow, &mask); /* Select dpcls for in_port. Relies on in_port to be exact match. */ @@ -2584,6 +3048,8 @@ dp_netdev_flow_add(struct dp_netdev_pmd_thread *pmd, cmap_insert(&pmd->flow_table, CONST_CAST(struct cmap_node *, &flow->node), dp_netdev_flow_hash(&flow->ufid)); + queue_netdev_flow_put(pmd, flow, match, actions, actions_len); + if (OVS_UNLIKELY(!VLOG_DROP_DBG((&upcall_rl)))) { struct ds ds = DS_EMPTY_INITIALIZER; struct ofpbuf key_buf, mask_buf; @@ -2671,6 +3137,9 @@ flow_put_on_pmd(struct dp_netdev_pmd_thread *pmd, old_actions = dp_netdev_flow_get_actions(netdev_flow); ovsrcu_set(&netdev_flow->actions, new_actions); + queue_netdev_flow_put(pmd, netdev_flow, match, + put->actions, put->actions_len); + if (stats) { get_dpif_flow_stats(netdev_flow, stats); } @@ -3792,6 +4261,7 @@ reload_affected_pmds(struct dp_netdev *dp) CMAP_FOR_EACH (pmd, node, &dp->poll_threads) { if (pmd->need_reload) { + flow_mark_flush(pmd); dp_netdev_reload_pmd__(pmd); pmd->need_reload = false; } @@ -5081,10 +5551,10 @@ struct packet_batch_per_flow { static inline void packet_batch_per_flow_update(struct packet_batch_per_flow *batch, struct dp_packet *packet, - const struct miniflow *mf) + uint16_t tcp_flags) { batch->byte_count += dp_packet_size(packet); - batch->tcp_flags |= miniflow_get_tcp_flags(mf); + batch->tcp_flags |= tcp_flags; batch->array.packets[batch->array.count++] = packet; } @@ -5118,7 +5588,7 @@ packet_batch_per_flow_execute(struct packet_batch_per_flow *batch, static inline void dp_netdev_queue_batches(struct dp_packet *pkt, - struct dp_netdev_flow *flow, const struct miniflow *mf, + struct dp_netdev_flow *flow, uint16_t tcp_flags, struct packet_batch_per_flow *batches, size_t *n_batches) { @@ -5129,7 +5599,7 @@ dp_netdev_queue_batches(struct dp_packet *pkt, packet_batch_per_flow_init(batch, flow); } - packet_batch_per_flow_update(batch, pkt, mf); + packet_batch_per_flow_update(batch, pkt, tcp_flags); } /* Try to process all ('cnt') the 'packets' using only the exact match cache @@ -5160,6 +5630,7 @@ emc_processing(struct dp_netdev_pmd_thread *pmd, const size_t cnt = dp_packet_batch_size(packets_); uint32_t cur_min; int i; + uint16_t tcp_flags; atomic_read_relaxed(&pmd->dp->emc_insert_min, &cur_min); pmd_perf_update_counter(&pmd->perf_stats, @@ -5168,6 +5639,7 @@ emc_processing(struct dp_netdev_pmd_thread *pmd, DP_PACKET_BATCH_REFILL_FOR_EACH (i, cnt, packet, packets_) { struct dp_netdev_flow *flow; + uint32_t mark; if (OVS_UNLIKELY(dp_packet_size(packet) < ETH_HEADER_LEN)) { dp_packet_delete(packet); @@ -5185,6 +5657,18 @@ emc_processing(struct dp_netdev_pmd_thread *pmd, if (!md_is_valid) { pkt_metadata_init(&packet->md, port_no); } + + if ((*recirc_depth_get() == 0) && + dp_packet_has_flow_mark(packet, &mark)) { + flow = mark_to_flow_find(pmd, mark); + if (flow) { + tcp_flags = parse_tcp_flags(packet); + dp_netdev_queue_batches(packet, flow, tcp_flags, batches, + n_batches); + continue; + } + } + miniflow_extract(packet, &key->mf); key->len = 0; /* Not computed yet. */ /* If EMC is disabled skip hash computation and emc_lookup */ @@ -5200,7 +5684,8 @@ emc_processing(struct dp_netdev_pmd_thread *pmd, flow = NULL; } if (OVS_LIKELY(flow)) { - dp_netdev_queue_batches(packet, flow, &key->mf, batches, + tcp_flags = miniflow_get_tcp_flags(&key->mf); + dp_netdev_queue_batches(packet, flow, tcp_flags, batches, n_batches); } else { /* Exact match cache missed. Group missed packets together at @@ -5387,7 +5872,9 @@ fast_path_processing(struct dp_netdev_pmd_thread *pmd, flow = dp_netdev_flow_cast(rules[i]); emc_probabilistic_insert(pmd, &keys[i], flow); - dp_netdev_queue_batches(packet, flow, &keys[i].mf, batches, n_batches); + dp_netdev_queue_batches(packet, flow, + miniflow_get_tcp_flags(&keys[i].mf), + batches, n_batches); } pmd_perf_update_counter(&pmd->perf_stats, PMD_STAT_MASKED_HIT, diff --git a/lib/flow.c b/lib/flow.c index 75ca45672..a785e63a8 100644 --- a/lib/flow.c +++ b/lib/flow.c @@ -624,6 +624,70 @@ flow_extract(struct dp_packet *packet, struct flow *flow) miniflow_expand(&m.mf, flow); } +static inline bool +ipv4_sanity_check(const struct ip_header *nh, size_t size, + int *ip_lenp, uint16_t *tot_lenp) +{ + int ip_len; + uint16_t tot_len; + + if (OVS_UNLIKELY(size < IP_HEADER_LEN)) { + return false; + } + ip_len = IP_IHL(nh->ip_ihl_ver) * 4; + + if (OVS_UNLIKELY(ip_len < IP_HEADER_LEN || size < ip_len)) { + return false; + } + + tot_len = ntohs(nh->ip_tot_len); + if (OVS_UNLIKELY(tot_len > size || ip_len > tot_len || + size - tot_len > UINT8_MAX)) { + return false; + } + + *ip_lenp = ip_len; + *tot_lenp = tot_len; + + return true; +} + +static inline uint8_t +ipv4_get_nw_frag(const struct ip_header *nh) +{ + uint8_t nw_frag = 0; + + if (OVS_UNLIKELY(IP_IS_FRAGMENT(nh->ip_frag_off))) { + nw_frag = FLOW_NW_FRAG_ANY; + if (nh->ip_frag_off & htons(IP_FRAG_OFF_MASK)) { + nw_frag |= FLOW_NW_FRAG_LATER; + } + } + + return nw_frag; +} + +static inline bool +ipv6_sanity_check(const struct ovs_16aligned_ip6_hdr *nh, size_t size) +{ + uint16_t plen; + + if (OVS_UNLIKELY(size < sizeof *nh)) { + return false; + } + + plen = ntohs(nh->ip6_plen); + if (OVS_UNLIKELY(plen > size)) { + return false; + } + /* Jumbo Payload option not supported yet. */ + if (OVS_UNLIKELY(size - plen > UINT8_MAX)) { + return false; + } + + return true; +} + /* Caller is responsible for initializing 'dst' with enough storage for * FLOW_U64S * 8 bytes. */ void @@ -748,22 +812,7 @@ miniflow_extract(struct dp_packet *packet, struct miniflow *dst) int ip_len; uint16_t tot_len; - if (OVS_UNLIKELY(size < IP_HEADER_LEN)) { - goto out; - } - ip_len = IP_IHL(nh->ip_ihl_ver) * 4; - - if (OVS_UNLIKELY(ip_len < IP_HEADER_LEN)) { - goto out; - } - if (OVS_UNLIKELY(size < ip_len)) { - goto out; - } - tot_len = ntohs(nh->ip_tot_len); - if (OVS_UNLIKELY(tot_len > size || ip_len > tot_len)) { - goto out; - } - if (OVS_UNLIKELY(size - tot_len > UINT8_MAX)) { + if (OVS_UNLIKELY(!ipv4_sanity_check(nh, size, &ip_len, &tot_len))) { goto out; } dp_packet_set_l2_pad_size(packet, size - tot_len); @@ -786,31 +835,19 @@ miniflow_extract(struct dp_packet *packet, struct miniflow *dst) nw_tos = nh->ip_tos; nw_ttl = nh->ip_ttl; nw_proto = nh->ip_proto; - if (OVS_UNLIKELY(IP_IS_FRAGMENT(nh->ip_frag_off))) { - nw_frag = FLOW_NW_FRAG_ANY; - if (nh->ip_frag_off & htons(IP_FRAG_OFF_MASK)) { - nw_frag |= FLOW_NW_FRAG_LATER; - } - } + nw_frag = ipv4_get_nw_frag(nh); data_pull(&data, &size, ip_len); } else if (dl_type == htons(ETH_TYPE_IPV6)) { - const struct ovs_16aligned_ip6_hdr *nh; + const struct ovs_16aligned_ip6_hdr *nh = data; ovs_be32 tc_flow; uint16_t plen; - if (OVS_UNLIKELY(size < sizeof *nh)) { + if (OVS_UNLIKELY(!ipv6_sanity_check(nh, size))) { goto out; } - nh = data_pull(&data, &size, sizeof *nh); + data_pull(&data, &size, sizeof *nh); plen = ntohs(nh->ip6_plen); - if (OVS_UNLIKELY(plen > size)) { - goto out; - } - /* Jumbo Payload option not supported yet. */ - if (OVS_UNLIKELY(size - plen > UINT8_MAX)) { - goto out; - } dp_packet_set_l2_pad_size(packet, size - plen); size = plen; /* Never pull padding. */ @@ -982,6 +1019,73 @@ parse_dl_type(const struct eth_header *data_, size_t size) return parse_ethertype(&data, &size); } +uint16_t +parse_tcp_flags(struct dp_packet *packet) +{ + const void *data = dp_packet_data(packet); + const char *frame = (const char *)data; + size_t size = dp_packet_size(packet); + ovs_be16 dl_type; + uint8_t nw_frag = 0, nw_proto = 0; + + if (packet->packet_type != htonl(PT_ETH)) { + return 0; + } + + dp_packet_reset_offsets(packet); + + data_pull(&data, &size, ETH_ADDR_LEN * 2); + dl_type = parse_ethertype(&data, &size); + if (OVS_UNLIKELY(eth_type_mpls(dl_type))) { + packet->l2_5_ofs = (char *)data - frame; + } + if (OVS_LIKELY(dl_type == htons(ETH_TYPE_IP))) { + const struct ip_header *nh = data; + int ip_len; + uint16_t tot_len; + + if (OVS_UNLIKELY(!ipv4_sanity_check(nh, size, &ip_len, &tot_len))) { + return 0; + } + dp_packet_set_l2_pad_size(packet, size - tot_len); + packet->l3_ofs = (uint16_t)((char *)nh - frame); + nw_proto = nh->ip_proto; + nw_frag = ipv4_get_nw_frag(nh); + + size = tot_len; /* Never pull padding. */ + data_pull(&data, &size, ip_len); + } else if (dl_type == htons(ETH_TYPE_IPV6)) { + const struct ovs_16aligned_ip6_hdr *nh = data; + uint16_t plen; + + if (OVS_UNLIKELY(!ipv6_sanity_check(nh, size))) { + return 0; + } + packet->l3_ofs = (uint16_t)((char *)nh - frame); + data_pull(&data, &size, sizeof *nh); + + plen = ntohs(nh->ip6_plen); /* Never pull padding. */ + dp_packet_set_l2_pad_size(packet, size - plen); + size = plen; + if (!parse_ipv6_ext_hdrs__(&data, &size, &nw_proto, &nw_frag)) { + return 0; + } + nw_proto = nh->ip6_nxt; + } else { + return 0; + } + + packet->l4_ofs = (uint16_t)((char *)data - frame); + if (!(nw_frag & FLOW_NW_FRAG_LATER) && nw_proto == IPPROTO_TCP && + size >= TCP_HEADER_LEN) { + const struct tcp_header *tcp = data; + + return TCP_FLAGS(tcp->tcp_ctl); + } + + return 0; +} + /* For every bit of a field that is wildcarded in 'wildcards', sets the * corresponding bit in 'flow' to zero. */ void diff --git a/lib/flow.h b/lib/flow.h index 5b6585f11..af7b5e921 100644 --- a/lib/flow.h +++ b/lib/flow.h @@ -133,6 +133,7 @@ bool parse_ipv6_ext_hdrs(const void **datap, size_t *sizep, uint8_t *nw_proto, uint8_t *nw_frag); ovs_be16 parse_dl_type(const struct eth_header *data_, size_t size); bool parse_nsh(const void **datap, size_t *sizep, struct ovs_key_nsh *key); +uint16_t parse_tcp_flags(struct dp_packet *packet); static inline uint64_t flow_get_xreg(const struct flow *flow, int idx) diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c index c223e5b52..bb4d60f26 100644 --- a/lib/netdev-dpdk.c +++ b/lib/netdev-dpdk.c @@ -38,7 +38,9 @@ #include <rte_pci.h> #include <rte_vhost.h> #include <rte_version.h> +#include <rte_flow.h> +#include "cmap.h" #include "dirs.h" #include "dp-packet.h" #include "dpdk.h" @@ -51,6 +53,7 @@ #include "openvswitch/list.h" #include "openvswitch/ofp-print.h" #include "openvswitch/vlog.h" +#include "openvswitch/match.h" #include "ovs-numa.h" #include "ovs-thread.h" #include "ovs-rcu.h" @@ -60,6 +63,7 @@ #include "sset.h" #include "unaligned.h" #include "timeval.h" +#include "uuid.h" #include "unixctl.h" enum {VIRTIO_RXQ, VIRTIO_TXQ, VIRTIO_QNUM}; @@ -91,13 +95,24 @@ static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20); #define NETDEV_DPDK_MBUF_ALIGN 1024 #define NETDEV_DPDK_MAX_PKT_LEN 9728 -/* Min number of packets in the mempool. OVS tries to allocate a mempool with - * roughly estimated number of mbufs: if this fails (because the system doesn't - * have enough hugepages) we keep halving the number until the allocation - * succeeds or we reach MIN_NB_MBUF */ +/* Max and min number of packets in the mempool. OVS tries to allocate a + * mempool with MAX_NB_MBUF: if this fails (because the system doesn't have + * enough hugepages) we keep halving the number until the allocation succeeds + * or we reach MIN_NB_MBUF */ + +#define MAX_NB_MBUF (4096 * 64) #define MIN_NB_MBUF (4096 * 4) #define MP_CACHE_SZ RTE_MEMPOOL_CACHE_MAX_SIZE +/* MAX_NB_MBUF can be divided by 2 many times, until MIN_NB_MBUF */ +BUILD_ASSERT_DECL(MAX_NB_MBUF % ROUND_DOWN_POW2(MAX_NB_MBUF / MIN_NB_MBUF) + == 0); + +/* The smallest possible NB_MBUF that we're going to try should be a multiple + * of MP_CACHE_SZ. This is advised by DPDK documentation. */ +BUILD_ASSERT_DECL((MAX_NB_MBUF / ROUND_DOWN_POW2(MAX_NB_MBUF / MIN_NB_MBUF)) + % MP_CACHE_SZ == 0); + /* * DPDK XSTATS Counter names definition */ @@ -171,6 +186,17 @@ static const struct rte_eth_conf port_conf = { }; /* + * A mapping from ufid to dpdk rte_flow. + */ +static struct cmap ufid_to_rte_flow = CMAP_INITIALIZER; + +struct ufid_to_rte_flow_data { + struct cmap_node node; + ovs_u128 ufid; + struct rte_flow *rte_flow; +}; + +/* * These callbacks allow virtio-net devices to be added to vhost ports when * configuration has been fully completed. */ @@ -297,12 +323,14 @@ static struct ovs_mutex dpdk_mp_mutex OVS_ACQ_AFTER(dpdk_mutex) = OVS_MUTEX_INITIALIZER; /* Contains all 'struct dpdk_mp's. */ -static struct ovs_list dpdk_mp_free_list OVS_GUARDED_BY(dpdk_mp_mutex) - = OVS_LIST_INITIALIZER(&dpdk_mp_free_list); +static struct ovs_list dpdk_mp_list OVS_GUARDED_BY(dpdk_mp_mutex) + = OVS_LIST_INITIALIZER(&dpdk_mp_list); -/* Wrapper for a mempool released but not yet freed. */ struct dpdk_mp { struct rte_mempool *mp; + int mtu; + int socket_id; + int refcount; struct ovs_list list_node OVS_GUARDED_BY(dpdk_mp_mutex); }; @@ -384,7 +412,7 @@ struct netdev_dpdk { PADDED_MEMBERS_CACHELINE_MARKER(CACHE_LINE_SIZE, cacheline1, struct ovs_mutex mutex OVS_ACQ_AFTER(dpdk_mutex); - struct rte_mempool *mp; + struct dpdk_mp *dpdk_mp; /* virtio identifier for vhost devices */ ovsrcu_index vid; @@ -550,68 +578,89 @@ dpdk_mp_full(const struct rte_mempool *mp) OVS_REQUIRES(dpdk_mp_mutex) /* Free unused mempools. */ static void -dpdk_mp_sweep(void) +dpdk_mp_sweep(void) OVS_REQUIRES(dpdk_mp_mutex) { struct dpdk_mp *dmp, *next; - ovs_mutex_lock(&dpdk_mp_mutex); - LIST_FOR_EACH_SAFE (dmp, next, list_node, &dpdk_mp_free_list) { - if (dpdk_mp_full(dmp->mp)) { + LIST_FOR_EACH_SAFE (dmp, next, list_node, &dpdk_mp_list) { + if (!dmp->refcount && dpdk_mp_full(dmp->mp)) { VLOG_DBG("Freeing mempool \"%s\"", dmp->mp->name); ovs_list_remove(&dmp->list_node); rte_mempool_free(dmp->mp); rte_free(dmp); } } - ovs_mutex_unlock(&dpdk_mp_mutex); } -/* Ensure a mempool will not be freed. */ -static void -dpdk_mp_do_not_free(struct rte_mempool *mp) OVS_REQUIRES(dpdk_mp_mutex) +/* Calculating the required number of mbufs differs depending on the + * mempool model being used. Check if per port memory is in use before + * calculating. + */ +static uint32_t +dpdk_calculate_mbufs(struct netdev_dpdk *dev, int mtu, bool per_port_mp) { - struct dpdk_mp *dmp, *next; + uint32_t n_mbufs; - LIST_FOR_EACH_SAFE (dmp, next, list_node, &dpdk_mp_free_list) { - if (dmp->mp == mp) { - VLOG_DBG("Removing mempool \"%s\" from free list", dmp->mp->name); - ovs_list_remove(&dmp->list_node); - rte_free(dmp); - break; + if (!per_port_mp) { + /* Shared memory are being used. + * XXX: this is a really rough method of provisioning memory. + * It's impossible to determine what the exact memory requirements are + * when the number of ports and rxqs that utilize a particular mempool + * can change dynamically at runtime. For now, use this rough + * heurisitic. + */ + if (mtu >= ETHER_MTU) { + n_mbufs = MAX_NB_MBUF; + } else { + n_mbufs = MIN_NB_MBUF; } + } else { + /* Per port memory is being used. + * XXX: rough estimation of number of mbufs required for this port: + * <packets required to fill the device rxqs> + * + <packets that could be stuck on other ports txqs> + * + <packets in the pmd threads> + * + <additional memory for corner cases> + */ + n_mbufs = dev->requested_n_rxq * dev->requested_rxq_size + + dev->requested_n_txq * dev->requested_txq_size + + MIN(RTE_MAX_LCORE, dev->requested_n_rxq) * NETDEV_MAX_BURST + + MIN_NB_MBUF; } + + return n_mbufs; } -/* Returns a valid pointer when either of the following is true: - * - a new mempool was just created; - * - a matching mempool already exists. */ -static struct rte_mempool * -dpdk_mp_create(struct netdev_dpdk *dev, int mtu) +static struct dpdk_mp * +dpdk_mp_create(struct netdev_dpdk *dev, int mtu, bool per_port_mp) { char mp_name[RTE_MEMPOOL_NAMESIZE]; const char *netdev_name = netdev_get_name(&dev->up); int socket_id = dev->requested_socket_id; uint32_t n_mbufs; uint32_t hash = hash_string(netdev_name, 0); - struct rte_mempool *mp = NULL; + struct dpdk_mp *dmp = NULL; + int ret; - /* - * XXX: rough estimation of number of mbufs required for this port: - * <packets required to fill the device rxqs> - * + <packets that could be stuck on other ports txqs> - * + <packets in the pmd threads> - * + <additional memory for corner cases> - */ - n_mbufs = dev->requested_n_rxq * dev->requested_rxq_size - + dev->requested_n_txq * dev->requested_txq_size - + MIN(RTE_MAX_LCORE, dev->requested_n_rxq) * NETDEV_MAX_BURST - + MIN_NB_MBUF; + dmp = dpdk_rte_mzalloc(sizeof *dmp); + if (!dmp) { + return NULL; + } + dmp->socket_id = socket_id; + dmp->mtu = mtu; + dmp->refcount = 1; + + n_mbufs = dpdk_calculate_mbufs(dev, mtu, per_port_mp); - ovs_mutex_lock(&dpdk_mp_mutex); do { /* Full DPDK memory pool name must be unique and cannot be - * longer than RTE_MEMPOOL_NAMESIZE. */ - int ret = snprintf(mp_name, RTE_MEMPOOL_NAMESIZE, + * longer than RTE_MEMPOOL_NAMESIZE. Note that for the shared + * mempool case this can result in one device using a mempool + * which references a different device in it's name. However as + * mempool names are hashed, the device name will not be readable + * so this is not an issue for tasks such as debugging. + */ + ret = snprintf(mp_name, RTE_MEMPOOL_NAMESIZE, "ovs%08x%02d%05d%07u", hash, socket_id, mtu, n_mbufs); if (ret < 0 || ret >= RTE_MEMPOOL_NAMESIZE) { @@ -627,96 +676,159 @@ dpdk_mp_create(struct netdev_dpdk *dev, int mtu) netdev_name, n_mbufs, socket_id, dev->requested_n_rxq, dev->requested_n_txq); - mp = rte_pktmbuf_pool_create(mp_name, n_mbufs, MP_CACHE_SZ, - sizeof (struct dp_packet) - sizeof (struct rte_mbuf), - MBUF_SIZE(mtu) - sizeof(struct dp_packet), socket_id); + dmp->mp = rte_pktmbuf_pool_create(mp_name, n_mbufs, + MP_CACHE_SZ, + sizeof (struct dp_packet) + - sizeof (struct rte_mbuf), + MBUF_SIZE(mtu) + - sizeof(struct dp_packet), + socket_id); - if (mp) { + if (dmp->mp) { VLOG_DBG("Allocated \"%s\" mempool with %u mbufs", mp_name, n_mbufs); /* rte_pktmbuf_pool_create has done some initialization of the - * rte_mbuf part of each dp_packet. Some OvS specific fields - * of the packet still need to be initialized by - * ovs_rte_pktmbuf_init. */ - rte_mempool_obj_iter(mp, ovs_rte_pktmbuf_init, NULL); + * rte_mbuf part of each dp_packet, while ovs_rte_pktmbuf_init + * initializes some OVS specific fields of dp_packet. + */ + rte_mempool_obj_iter(dmp->mp, ovs_rte_pktmbuf_init, NULL); + return dmp; } else if (rte_errno == EEXIST) { /* A mempool with the same name already exists. We just * retrieve its pointer to be returned to the caller. */ - mp = rte_mempool_lookup(mp_name); + dmp->mp = rte_mempool_lookup(mp_name); /* As the mempool create returned EEXIST we can expect the * lookup has returned a valid pointer. If for some reason * that's not the case we keep track of it. */ VLOG_DBG("A mempool with name \"%s\" already exists at %p.", - mp_name, mp); - /* Ensure this reused mempool will not be freed. */ - dpdk_mp_do_not_free(mp); + mp_name, dmp->mp); + return dmp; } else { - VLOG_ERR("Failed mempool \"%s\" create request of %u mbufs", - mp_name, n_mbufs); + VLOG_DBG("Failed to create mempool \"%s\" with a request of " + "%u mbufs, retrying with %u mbufs", + mp_name, n_mbufs, n_mbufs / 2); } - } while (!mp && rte_errno == ENOMEM && (n_mbufs /= 2) >= MIN_NB_MBUF); + } while (!dmp->mp && rte_errno == ENOMEM && (n_mbufs /= 2) >= MIN_NB_MBUF); - ovs_mutex_unlock(&dpdk_mp_mutex); - return mp; + VLOG_ERR("Failed to create mempool \"%s\" with a request of %u mbufs", + mp_name, n_mbufs); + + rte_free(dmp); + return NULL; } -/* Release an existing mempool. */ -static void -dpdk_mp_release(struct rte_mempool *mp) +static struct dpdk_mp * +dpdk_mp_get(struct netdev_dpdk *dev, int mtu, bool per_port_mp) { - if (!mp) { - return; - } + struct dpdk_mp *dmp, *next; + bool reuse = false; ovs_mutex_lock(&dpdk_mp_mutex); - if (dpdk_mp_full(mp)) { - VLOG_DBG("Freeing mempool \"%s\"", mp->name); - rte_mempool_free(mp); - } else { - struct dpdk_mp *dmp; + /* Check if shared memory is being used, if so check existing mempools + * to see if reuse is possible. */ + if (!per_port_mp) { + LIST_FOR_EACH (dmp, list_node, &dpdk_mp_list) { + if (dmp->socket_id == dev->requested_socket_id + && dmp->mtu == mtu) { + VLOG_DBG("Reusing mempool \"%s\"", dmp->mp->name); + dmp->refcount++; + reuse = true; + break; + } + } + } + /* Sweep mempools after reuse or before create. */ + dpdk_mp_sweep(); - dmp = dpdk_rte_mzalloc(sizeof *dmp); + if (!reuse) { + dmp = dpdk_mp_create(dev, mtu, per_port_mp); if (dmp) { - dmp->mp = mp; - ovs_list_push_back(&dpdk_mp_free_list, &dmp->list_node); + /* Shared memory will hit the reuse case above so will not + * request a mempool that already exists but we need to check + * for the EEXIST case for per port memory case. Compare the + * mempool returned by dmp to each entry in dpdk_mp_list. If a + * match is found, free dmp as a new entry is not required, set + * dmp to point to the existing entry and increment the refcount + * to avoid being freed at a later stage. + */ + if (per_port_mp && rte_errno == EEXIST) { + LIST_FOR_EACH (next, list_node, &dpdk_mp_list) { + if (dmp->mp == next->mp) { + rte_free(dmp); + dmp = next; + dmp->refcount++; + } + } + } else { + ovs_list_push_back(&dpdk_mp_list, &dmp->list_node); + } } } + + ovs_mutex_unlock(&dpdk_mp_mutex); + + return dmp; } -/* Tries to allocate a new mempool - or re-use an existing one where - * appropriate - on requested_socket_id with a size determined by - * requested_mtu and requested Rx/Tx queues. - * On success - or when re-using an existing mempool - the new configuration - * will be applied. +/* Decrement reference to a mempool. */ +static void +dpdk_mp_put(struct dpdk_mp *dmp) +{ + if (!dmp) { + return; + } + + ovs_mutex_lock(&dpdk_mp_mutex); + ovs_assert(dmp->refcount); + dmp->refcount--; + ovs_mutex_unlock(&dpdk_mp_mutex); +} + +/* Depending on the memory model being used this function tries to + * identify and reuse an existing mempool or tries to allocate a new + * mempool on requested_socket_id with mbuf size corresponding to the + * requested_mtu. On success, a new configuration will be applied. * On error, device will be left unchanged. */ static int netdev_dpdk_mempool_configure(struct netdev_dpdk *dev) OVS_REQUIRES(dev->mutex) { uint32_t buf_size = dpdk_buf_size(dev->requested_mtu); - struct rte_mempool *mp; + struct dpdk_mp *dmp; int ret = 0; + bool per_port_mp = dpdk_per_port_memory(); - dpdk_mp_sweep(); + /* With shared memory we do not need to configure a mempool if the MTU + * and socket ID have not changed, the previous configuration is still + * valid so return 0 */ + if (!per_port_mp && dev->mtu == dev->requested_mtu + && dev->socket_id == dev->requested_socket_id) { + return ret; + } - mp = dpdk_mp_create(dev, FRAME_LEN_TO_MTU(buf_size)); - if (!mp) { + dmp = dpdk_mp_get(dev, FRAME_LEN_TO_MTU(buf_size), per_port_mp); + if (!dmp) { VLOG_ERR("Failed to create memory pool for netdev " "%s, with MTU %d on socket %d: %s\n", dev->up.name, dev->requested_mtu, dev->requested_socket_id, rte_strerror(rte_errno)); ret = rte_errno; } else { - /* If a new MTU was requested and its rounded value equals the one - * that is currently used, then the existing mempool is returned. */ - if (dev->mp != mp) { - /* A new mempool was created, release the previous one. */ - dpdk_mp_release(dev->mp); - } else { - ret = EEXIST; + /* Check for any pre-existing dpdk_mp for the device before accessing + * the associated mempool. + */ + if (dev->dpdk_mp != NULL) { + /* A new MTU was requested, decrement the reference count for the + * devices current dpdk_mp. This is required even if a pointer to + * same dpdk_mp is returned by dpdk_mp_get. The refcount for dmp + * has already been incremented by dpdk_mp_get at this stage so it + * must be decremented to keep an accurate refcount for the + * dpdk_mp. + */ + dpdk_mp_put(dev->dpdk_mp); } - dev->mp = mp; + dev->dpdk_mp = dmp; dev->mtu = dev->requested_mtu; dev->socket_id = dev->requested_socket_id; dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu); @@ -855,7 +967,8 @@ dpdk_eth_dev_port_config(struct netdev_dpdk *dev, int n_rxq, int n_txq) for (i = 0; i < n_rxq; i++) { diag = rte_eth_rx_queue_setup(dev->port_id, i, dev->rxq_size, - dev->socket_id, NULL, dev->mp); + dev->socket_id, NULL, + dev->dpdk_mp->mp); if (diag) { VLOG_INFO("Interface %s unable to setup rxq(%d): %s", dev->up.name, i, rte_strerror(-diag)); @@ -950,7 +1063,7 @@ dpdk_eth_dev_init(struct netdev_dpdk *dev) memcpy(dev->hwaddr.ea, eth_addr.addr_bytes, ETH_ADDR_LEN); rte_eth_link_get_nowait(dev->port_id, &dev->link); - mbp_priv = rte_mempool_get_priv(dev->mp); + mbp_priv = rte_mempool_get_priv(dev->dpdk_mp->mp); dev->buf_size = mbp_priv->mbuf_data_room_size - RTE_PKTMBUF_HEADROOM; /* Get the Flow control configuration for DPDK-ETH */ @@ -1204,7 +1317,7 @@ common_destruct(struct netdev_dpdk *dev) OVS_EXCLUDED(dev->mutex) { rte_free(dev->tx_q); - dpdk_mp_release(dev->mp); + dpdk_mp_put(dev->dpdk_mp); ovs_list_remove(&dev->list_node); free(ovsrcu_get_protected(struct ingress_policer *, @@ -1957,7 +2070,7 @@ netdev_dpdk_vhost_rxq_recv(struct netdev_rxq *rxq, return EAGAIN; } - nb_rx = rte_vhost_dequeue_burst(vid, qid, dev->mp, + nb_rx = rte_vhost_dequeue_burst(vid, qid, dev->dpdk_mp->mp, (struct rte_mbuf **) batch->packets, NETDEV_MAX_BURST); if (!nb_rx) { @@ -2196,7 +2309,7 @@ dpdk_do_tx_copy(struct netdev *netdev, int qid, struct dp_packet_batch *batch) continue; } - pkts[txcnt] = rte_pktmbuf_alloc(dev->mp); + pkts[txcnt] = rte_pktmbuf_alloc(dev->dpdk_mp->mp); if (OVS_UNLIKELY(!pkts[txcnt])) { dropped += cnt - i; break; @@ -3075,7 +3188,7 @@ netdev_dpdk_get_mempool_info(struct unixctl_conn *conn, ovs_mutex_lock(&dev->mutex); ovs_mutex_lock(&dpdk_mp_mutex); - rte_mempool_dump(stream, dev->mp); + rte_mempool_dump(stream, dev->dpdk_mp->mp); ovs_mutex_unlock(&dpdk_mp_mutex); ovs_mutex_unlock(&dev->mutex); @@ -3772,7 +3885,7 @@ dpdk_vhost_reconfigure_helper(struct netdev_dpdk *dev) err = netdev_dpdk_mempool_configure(dev); if (!err) { - /* A new mempool was created. */ + /* A new mempool was created or re-used. */ netdev_change_seq_changed(&dev->up); } else if (err != EEXIST){ return err; @@ -3883,6 +3996,724 @@ unlock: return err; } + +/* Find rte_flow with @ufid */ +static struct rte_flow * +ufid_to_rte_flow_find(const ovs_u128 *ufid) { + size_t hash = hash_bytes(ufid, sizeof(*ufid), 0); + struct ufid_to_rte_flow_data *data; + + CMAP_FOR_EACH_WITH_HASH (data, node, hash, &ufid_to_rte_flow) { + if (ovs_u128_equals(*ufid, data->ufid)) { + return data->rte_flow; + } + } + + return NULL; +} + +static inline void +ufid_to_rte_flow_associate(const ovs_u128 *ufid, + struct rte_flow *rte_flow) { + size_t hash = hash_bytes(ufid, sizeof(*ufid), 0); + struct ufid_to_rte_flow_data *data = xzalloc(sizeof(*data)); + + /* + * We should not simply overwrite an existing rte flow. + * We should have deleted it first before re-adding it. + * Thus, if following assert triggers, something is wrong: + * the rte_flow is not destroyed. + */ + ovs_assert(ufid_to_rte_flow_find(ufid) == NULL); + + data->ufid = *ufid; + data->rte_flow = rte_flow; + + cmap_insert(&ufid_to_rte_flow, + CONST_CAST(struct cmap_node *, &data->node), hash); +} + +static inline void +ufid_to_rte_flow_disassociate(const ovs_u128 *ufid) { + size_t hash = hash_bytes(ufid, sizeof(*ufid), 0); + struct ufid_to_rte_flow_data *data; + + CMAP_FOR_EACH_WITH_HASH (data, node, hash, &ufid_to_rte_flow) { + if (ovs_u128_equals(*ufid, data->ufid)) { + cmap_remove(&ufid_to_rte_flow, + CONST_CAST(struct cmap_node *, &data->node), hash); + free(data); + return; + } + } + + VLOG_WARN("ufid "UUID_FMT" is not associated with an rte flow\n", + UUID_ARGS((struct uuid *)ufid)); +} + +/* + * To avoid individual xrealloc calls for each new element, a 'curent_max' + * is used to keep track of current allocated number of elements. Starts + * by 8 and doubles on each xrealloc call + */ +struct flow_patterns { + struct rte_flow_item *items; + int cnt; + int current_max; +}; + +struct flow_actions { + struct rte_flow_action *actions; + int cnt; + int current_max; +}; + +static void +dump_flow_pattern(struct rte_flow_item *item) +{ + if (item->type == RTE_FLOW_ITEM_TYPE_ETH) { + const struct rte_flow_item_eth *eth_spec = item->spec; + const struct rte_flow_item_eth *eth_mask = item->mask; + + VLOG_DBG("rte flow eth pattern:\n"); + if (eth_spec) { + VLOG_DBG(" Spec: src="ETH_ADDR_FMT", dst="ETH_ADDR_FMT", " + "type=0x%04" PRIx16"\n", + eth_spec->src.addr_bytes[0], eth_spec->src.addr_bytes[1], + eth_spec->src.addr_bytes[2], eth_spec->src.addr_bytes[3], + eth_spec->src.addr_bytes[4], eth_spec->src.addr_bytes[5], + eth_spec->dst.addr_bytes[0], eth_spec->dst.addr_bytes[1], + eth_spec->dst.addr_bytes[2], eth_spec->dst.addr_bytes[3], + eth_spec->dst.addr_bytes[4], eth_spec->dst.addr_bytes[5], + ntohs(eth_spec->type)); + } else { + VLOG_DBG(" Spec = null\n"); + } + if (eth_mask) { + VLOG_DBG(" Mask: src="ETH_ADDR_FMT", dst="ETH_ADDR_FMT", " + "type=0x%04"PRIx16"\n", + eth_mask->src.addr_bytes[0], eth_mask->src.addr_bytes[1], + eth_mask->src.addr_bytes[2], eth_mask->src.addr_bytes[3], + eth_mask->src.addr_bytes[4], eth_mask->src.addr_bytes[5], + eth_mask->dst.addr_bytes[0], eth_mask->dst.addr_bytes[1], + eth_mask->dst.addr_bytes[2], eth_mask->dst.addr_bytes[3], + eth_mask->dst.addr_bytes[4], eth_mask->dst.addr_bytes[5], + eth_mask->type); + } else { + VLOG_DBG(" Mask = null\n"); + } + } + + if (item->type == RTE_FLOW_ITEM_TYPE_VLAN) { + const struct rte_flow_item_vlan *vlan_spec = item->spec; + const struct rte_flow_item_vlan *vlan_mask = item->mask; + + VLOG_DBG("rte flow vlan pattern:\n"); + if (vlan_spec) { + VLOG_DBG(" Spec: tpid=0x%"PRIx16", tci=0x%"PRIx16"\n", + ntohs(vlan_spec->tpid), ntohs(vlan_spec->tci)); + } else { + VLOG_DBG(" Spec = null\n"); + } + + if (vlan_mask) { + VLOG_DBG(" Mask: tpid=0x%"PRIx16", tci=0x%"PRIx16"\n", + vlan_mask->tpid, vlan_mask->tci); + } else { + VLOG_DBG(" Mask = null\n"); + } + } + + if (item->type == RTE_FLOW_ITEM_TYPE_IPV4) { + const struct rte_flow_item_ipv4 *ipv4_spec = item->spec; + const struct rte_flow_item_ipv4 *ipv4_mask = item->mask; + + VLOG_DBG("rte flow ipv4 pattern:\n"); + if (ipv4_spec) { + VLOG_DBG(" Spec: tos=0x%"PRIx8", ttl=%"PRIx8", proto=0x%"PRIx8 + ", src="IP_FMT", dst="IP_FMT"\n", + ipv4_spec->hdr.type_of_service, + ipv4_spec->hdr.time_to_live, + ipv4_spec->hdr.next_proto_id, + IP_ARGS(ipv4_spec->hdr.src_addr), + IP_ARGS(ipv4_spec->hdr.dst_addr)); + } else { + VLOG_DBG(" Spec = null\n"); + } + if (ipv4_mask) { + VLOG_DBG(" Mask: tos=0x%"PRIx8", ttl=%"PRIx8", proto=0x%"PRIx8 + ", src="IP_FMT", dst="IP_FMT"\n", + ipv4_mask->hdr.type_of_service, + ipv4_mask->hdr.time_to_live, + ipv4_mask->hdr.next_proto_id, + IP_ARGS(ipv4_mask->hdr.src_addr), + IP_ARGS(ipv4_mask->hdr.dst_addr)); + } else { + VLOG_DBG(" Mask = null\n"); + } + } + + if (item->type == RTE_FLOW_ITEM_TYPE_UDP) { + const struct rte_flow_item_udp *udp_spec = item->spec; + const struct rte_flow_item_udp *udp_mask = item->mask; + + VLOG_DBG("rte flow udp pattern:\n"); + if (udp_spec) { + VLOG_DBG(" Spec: src_port=%"PRIu16", dst_port=%"PRIu16"\n", + ntohs(udp_spec->hdr.src_port), + ntohs(udp_spec->hdr.dst_port)); + } else { + VLOG_DBG(" Spec = null\n"); + } + if (udp_mask) { + VLOG_DBG(" Mask: src_port=0x%"PRIx16", dst_port=0x%"PRIx16"\n", + udp_mask->hdr.src_port, + udp_mask->hdr.dst_port); + } else { + VLOG_DBG(" Mask = null\n"); + } + } + + if (item->type == RTE_FLOW_ITEM_TYPE_SCTP) { + const struct rte_flow_item_sctp *sctp_spec = item->spec; + const struct rte_flow_item_sctp *sctp_mask = item->mask; + + VLOG_DBG("rte flow sctp pattern:\n"); + if (sctp_spec) { + VLOG_DBG(" Spec: src_port=%"PRIu16", dst_port=%"PRIu16"\n", + ntohs(sctp_spec->hdr.src_port), + ntohs(sctp_spec->hdr.dst_port)); + } else { + VLOG_DBG(" Spec = null\n"); + } + if (sctp_mask) { + VLOG_DBG(" Mask: src_port=0x%"PRIx16", dst_port=0x%"PRIx16"\n", + sctp_mask->hdr.src_port, + sctp_mask->hdr.dst_port); + } else { + VLOG_DBG(" Mask = null\n"); + } + } + + if (item->type == RTE_FLOW_ITEM_TYPE_ICMP) { + const struct rte_flow_item_icmp *icmp_spec = item->spec; + const struct rte_flow_item_icmp *icmp_mask = item->mask; + + VLOG_DBG("rte flow icmp pattern:\n"); + if (icmp_spec) { + VLOG_DBG(" Spec: icmp_type=%"PRIu8", icmp_code=%"PRIu8"\n", + ntohs(icmp_spec->hdr.icmp_type), + ntohs(icmp_spec->hdr.icmp_code)); + } else { + VLOG_DBG(" Spec = null\n"); + } + if (icmp_mask) { + VLOG_DBG(" Mask: icmp_type=0x%"PRIx8", icmp_code=0x%"PRIx8"\n", + icmp_spec->hdr.icmp_type, + icmp_spec->hdr.icmp_code); + } else { + VLOG_DBG(" Mask = null\n"); + } + } + + if (item->type == RTE_FLOW_ITEM_TYPE_TCP) { + const struct rte_flow_item_tcp *tcp_spec = item->spec; + const struct rte_flow_item_tcp *tcp_mask = item->mask; + + VLOG_DBG("rte flow tcp pattern:\n"); + if (tcp_spec) { + VLOG_DBG(" Spec: src_port=%"PRIu16", dst_port=%"PRIu16 + ", data_off=0x%"PRIx8", tcp_flags=0x%"PRIx8"\n", + ntohs(tcp_spec->hdr.src_port), + ntohs(tcp_spec->hdr.dst_port), + tcp_spec->hdr.data_off, + tcp_spec->hdr.tcp_flags); + } else { + VLOG_DBG(" Spec = null\n"); + } + if (tcp_mask) { + VLOG_DBG(" Mask: src_port=%"PRIx16", dst_port=%"PRIx16 + ", data_off=0x%"PRIx8", tcp_flags=0x%"PRIx8"\n", + tcp_mask->hdr.src_port, + tcp_mask->hdr.dst_port, + tcp_mask->hdr.data_off, + tcp_mask->hdr.tcp_flags); + } else { + VLOG_DBG(" Mask = null\n"); + } + } +} + +static void +add_flow_pattern(struct flow_patterns *patterns, enum rte_flow_item_type type, + const void *spec, const void *mask) { + int cnt = patterns->cnt; + + if (cnt == 0) { + patterns->current_max = 8; + patterns->items = xcalloc(patterns->current_max, + sizeof(struct rte_flow_item)); + } else if (cnt == patterns->current_max) { + patterns->current_max *= 2; + patterns->items = xrealloc(patterns->items, patterns->current_max * + sizeof(struct rte_flow_item)); + } + + patterns->items[cnt].type = type; + patterns->items[cnt].spec = spec; + patterns->items[cnt].mask = mask; + patterns->items[cnt].last = NULL; + dump_flow_pattern(&patterns->items[cnt]); + patterns->cnt++; +} + +static void +add_flow_action(struct flow_actions *actions, enum rte_flow_action_type type, + const void *conf) +{ + int cnt = actions->cnt; + + if (cnt == 0) { + actions->current_max = 8; + actions->actions = xcalloc(actions->current_max, + sizeof(struct rte_flow_action)); + } else if (cnt == actions->current_max) { + actions->current_max *= 2; + actions->actions = xrealloc(actions->actions, actions->current_max * + sizeof(struct rte_flow_action)); + } + + actions->actions[cnt].type = type; + actions->actions[cnt].conf = conf; + actions->cnt++; +} + +static struct rte_flow_action_rss * +add_flow_rss_action(struct flow_actions *actions, + struct netdev *netdev) { + int i; + struct rte_flow_action_rss *rss; + + rss = xmalloc(sizeof(*rss) + sizeof(uint16_t) * netdev->n_rxq); + /* + * Setting it to NULL will let the driver use the default RSS + * configuration we have set: &port_conf.rx_adv_conf.rss_conf. + */ + rss->rss_conf = NULL; + rss->num = netdev->n_rxq; + + for (i = 0; i < rss->num; i++) { + rss->queue[i] = i; + } + + add_flow_action(actions, RTE_FLOW_ACTION_TYPE_RSS, rss); + + return rss; +} + +static int +netdev_dpdk_add_rte_flow_offload(struct netdev *netdev, + const struct match *match, + struct nlattr *nl_actions OVS_UNUSED, + size_t actions_len OVS_UNUSED, + const ovs_u128 *ufid, + struct offload_info *info) { + struct netdev_dpdk *dev = netdev_dpdk_cast(netdev); + const struct rte_flow_attr flow_attr = { + .group = 0, + .priority = 0, + .ingress = 1, + .egress = 0 + }; + struct flow_patterns patterns = { .items = NULL, .cnt = 0 }; + struct flow_actions actions = { .actions = NULL, .cnt = 0 }; + struct rte_flow *flow; + struct rte_flow_error error; + uint8_t *ipv4_next_proto_mask = NULL; + int ret = 0; + + /* Eth */ + struct rte_flow_item_eth eth_spec; + struct rte_flow_item_eth eth_mask; + memset(ð_spec, 0, sizeof(eth_spec)); + memset(ð_mask, 0, sizeof(eth_mask)); + if (!eth_addr_is_zero(match->wc.masks.dl_src) || + !eth_addr_is_zero(match->wc.masks.dl_dst)) { + rte_memcpy(ð_spec.dst, &match->flow.dl_dst, sizeof(eth_spec.dst)); + rte_memcpy(ð_spec.src, &match->flow.dl_src, sizeof(eth_spec.src)); + eth_spec.type = match->flow.dl_type; + + rte_memcpy(ð_mask.dst, &match->wc.masks.dl_dst, + sizeof(eth_mask.dst)); + rte_memcpy(ð_mask.src, &match->wc.masks.dl_src, + sizeof(eth_mask.src)); + eth_mask.type = match->wc.masks.dl_type; + + add_flow_pattern(&patterns, RTE_FLOW_ITEM_TYPE_ETH, + ð_spec, ð_mask); + } else { + /* + * If user specifies a flow (like UDP flow) without L2 patterns, + * OVS will at least set the dl_type. Normally, it's enough to + * create an eth pattern just with it. Unluckily, some Intel's + * NIC (such as XL710) doesn't support that. Below is a workaround, + * which simply matches any L2 pkts. + */ + add_flow_pattern(&patterns, RTE_FLOW_ITEM_TYPE_ETH, NULL, NULL); + } + + /* VLAN */ + struct rte_flow_item_vlan vlan_spec; + struct rte_flow_item_vlan vlan_mask; + memset(&vlan_spec, 0, sizeof(vlan_spec)); + memset(&vlan_mask, 0, sizeof(vlan_mask)); + if (match->wc.masks.vlans[0].tci && match->flow.vlans[0].tci) { + vlan_spec.tci = match->flow.vlans[0].tci & ~htons(VLAN_CFI); + vlan_mask.tci = match->wc.masks.vlans[0].tci & ~htons(VLAN_CFI); + + /* match any protocols */ + vlan_mask.tpid = 0; + + add_flow_pattern(&patterns, RTE_FLOW_ITEM_TYPE_VLAN, + &vlan_spec, &vlan_mask); + } + + /* IP v4 */ + uint8_t proto = 0; + struct rte_flow_item_ipv4 ipv4_spec; + struct rte_flow_item_ipv4 ipv4_mask; + memset(&ipv4_spec, 0, sizeof(ipv4_spec)); + memset(&ipv4_mask, 0, sizeof(ipv4_mask)); + if (match->flow.dl_type == ntohs(ETH_TYPE_IP)) { + + ipv4_spec.hdr.type_of_service = match->flow.nw_tos; + ipv4_spec.hdr.time_to_live = match->flow.nw_ttl; + ipv4_spec.hdr.next_proto_id = match->flow.nw_proto; + ipv4_spec.hdr.src_addr = match->flow.nw_src; + ipv4_spec.hdr.dst_addr = match->flow.nw_dst; + + ipv4_mask.hdr.type_of_service = match->wc.masks.nw_tos; + ipv4_mask.hdr.time_to_live = match->wc.masks.nw_ttl; + ipv4_mask.hdr.next_proto_id = match->wc.masks.nw_proto; + ipv4_mask.hdr.src_addr = match->wc.masks.nw_src; + ipv4_mask.hdr.dst_addr = match->wc.masks.nw_dst; + + add_flow_pattern(&patterns, RTE_FLOW_ITEM_TYPE_IPV4, + &ipv4_spec, &ipv4_mask); + + /* Save proto for L4 protocol setup */ + proto = ipv4_spec.hdr.next_proto_id & + ipv4_mask.hdr.next_proto_id; + + /* Remember proto mask address for later modification */ + ipv4_next_proto_mask = &ipv4_mask.hdr.next_proto_id; + } + + if (proto != IPPROTO_ICMP && proto != IPPROTO_UDP && + proto != IPPROTO_SCTP && proto != IPPROTO_TCP && + (match->wc.masks.tp_src || + match->wc.masks.tp_dst || + match->wc.masks.tcp_flags)) { + VLOG_DBG("L4 Protocol (%u) not supported", proto); + ret = -1; + goto out; + } + + if ((match->wc.masks.tp_src && match->wc.masks.tp_src != 0xffff) || + (match->wc.masks.tp_dst && match->wc.masks.tp_dst != 0xffff)) { + ret = -1; + goto out; + } + + struct rte_flow_item_tcp tcp_spec; + struct rte_flow_item_tcp tcp_mask; + memset(&tcp_spec, 0, sizeof(tcp_spec)); + memset(&tcp_mask, 0, sizeof(tcp_mask)); + if (proto == IPPROTO_TCP) { + tcp_spec.hdr.src_port = match->flow.tp_src; + tcp_spec.hdr.dst_port = match->flow.tp_dst; + tcp_spec.hdr.data_off = ntohs(match->flow.tcp_flags) >> 8; + tcp_spec.hdr.tcp_flags = ntohs(match->flow.tcp_flags) & 0xff; + + tcp_mask.hdr.src_port = match->wc.masks.tp_src; + tcp_mask.hdr.dst_port = match->wc.masks.tp_dst; + tcp_mask.hdr.data_off = ntohs(match->wc.masks.tcp_flags) >> 8; + tcp_mask.hdr.tcp_flags = ntohs(match->wc.masks.tcp_flags) & 0xff; + + add_flow_pattern(&patterns, RTE_FLOW_ITEM_TYPE_TCP, + &tcp_spec, &tcp_mask); + + /* proto == TCP and ITEM_TYPE_TCP, thus no need for proto match */ + if (ipv4_next_proto_mask) { + *ipv4_next_proto_mask = 0; + } + goto end_proto_check; + } + + struct rte_flow_item_udp udp_spec; + struct rte_flow_item_udp udp_mask; + memset(&udp_spec, 0, sizeof(udp_spec)); + memset(&udp_mask, 0, sizeof(udp_mask)); + if (proto == IPPROTO_UDP) { + udp_spec.hdr.src_port = match->flow.tp_src; + udp_spec.hdr.dst_port = match->flow.tp_dst; + + udp_mask.hdr.src_port = match->wc.masks.tp_src; + udp_mask.hdr.dst_port = match->wc.masks.tp_dst; + + add_flow_pattern(&patterns, RTE_FLOW_ITEM_TYPE_UDP, + &udp_spec, &udp_mask); + + /* proto == UDP and ITEM_TYPE_UDP, thus no need for proto match */ + if (ipv4_next_proto_mask) { + *ipv4_next_proto_mask = 0; + } + goto end_proto_check; + } + + struct rte_flow_item_sctp sctp_spec; + struct rte_flow_item_sctp sctp_mask; + memset(&sctp_spec, 0, sizeof(sctp_spec)); + memset(&sctp_mask, 0, sizeof(sctp_mask)); + if (proto == IPPROTO_SCTP) { + sctp_spec.hdr.src_port = match->flow.tp_src; + sctp_spec.hdr.dst_port = match->flow.tp_dst; + + sctp_mask.hdr.src_port = match->wc.masks.tp_src; + sctp_mask.hdr.dst_port = match->wc.masks.tp_dst; + + add_flow_pattern(&patterns, RTE_FLOW_ITEM_TYPE_SCTP, + &sctp_spec, &sctp_mask); + + /* proto == SCTP and ITEM_TYPE_SCTP, thus no need for proto match */ + if (ipv4_next_proto_mask) { + *ipv4_next_proto_mask = 0; + } + goto end_proto_check; + } + + struct rte_flow_item_icmp icmp_spec; + struct rte_flow_item_icmp icmp_mask; + memset(&icmp_spec, 0, sizeof(icmp_spec)); + memset(&icmp_mask, 0, sizeof(icmp_mask)); + if (proto == IPPROTO_ICMP) { + icmp_spec.hdr.icmp_type = (uint8_t)ntohs(match->flow.tp_src); + icmp_spec.hdr.icmp_code = (uint8_t)ntohs(match->flow.tp_dst); + + icmp_mask.hdr.icmp_type = (uint8_t)ntohs(match->wc.masks.tp_src); + icmp_mask.hdr.icmp_code = (uint8_t)ntohs(match->wc.masks.tp_dst); + + add_flow_pattern(&patterns, RTE_FLOW_ITEM_TYPE_ICMP, + &icmp_spec, &icmp_mask); + + /* proto == ICMP and ITEM_TYPE_ICMP, thus no need for proto match */ + if (ipv4_next_proto_mask) { + *ipv4_next_proto_mask = 0; + } + goto end_proto_check; + } + +end_proto_check: + + add_flow_pattern(&patterns, RTE_FLOW_ITEM_TYPE_END, NULL, NULL); + + struct rte_flow_action_mark mark; + mark.id = info->flow_mark; + add_flow_action(&actions, RTE_FLOW_ACTION_TYPE_MARK, &mark); + + struct rte_flow_action_rss *rss; + rss = add_flow_rss_action(&actions, netdev); + add_flow_action(&actions, RTE_FLOW_ACTION_TYPE_END, NULL); + + flow = rte_flow_create(dev->port_id, &flow_attr, patterns.items, + actions.actions, &error); + free(rss); + if (!flow) { + VLOG_ERR("rte flow creat error: %u : message : %s\n", + error.type, error.message); + ret = -1; + goto out; + } + ufid_to_rte_flow_associate(ufid, flow); + VLOG_DBG("installed flow %p by ufid "UUID_FMT"\n", + flow, UUID_ARGS((struct uuid *)ufid)); + +out: + free(patterns.items); + free(actions.actions); + return ret; +} + +static bool +is_all_zero(const void *addr, size_t n) { + size_t i = 0; + const uint8_t *p = (uint8_t *)addr; + + for (i = 0; i < n; i++) { + if (p[i] != 0) { + return false; + } + } + + return true; +} + +/* + * Check if any unsupported flow patterns are specified. + */ +static int +netdev_dpdk_validate_flow(const struct match *match) { + struct match match_zero_wc; + + /* Create a wc-zeroed version of flow */ + match_init(&match_zero_wc, &match->flow, &match->wc); + + if (!is_all_zero(&match_zero_wc.flow.tunnel, + sizeof(match_zero_wc.flow.tunnel))) { + goto err; + } + + if (match->wc.masks.metadata || + match->wc.masks.skb_priority || + match->wc.masks.pkt_mark || + match->wc.masks.dp_hash) { + goto err; + } + + /* recirc id must be zero */ + if (match_zero_wc.flow.recirc_id) { + goto err; + } + + if (match->wc.masks.ct_state || + match->wc.masks.ct_nw_proto || + match->wc.masks.ct_zone || + match->wc.masks.ct_mark || + match->wc.masks.ct_label.u64.hi || + match->wc.masks.ct_label.u64.lo) { + goto err; + } + + if (match->wc.masks.conj_id || + match->wc.masks.actset_output) { + goto err; + } + + /* unsupported L2 */ + if (!is_all_zero(&match->wc.masks.mpls_lse, + sizeof(match_zero_wc.flow.mpls_lse))) { + goto err; + } + + /* unsupported L3 */ + if (match->wc.masks.ipv6_label || + match->wc.masks.ct_nw_src || + match->wc.masks.ct_nw_dst || + !is_all_zero(&match->wc.masks.ipv6_src, sizeof(struct in6_addr)) || + !is_all_zero(&match->wc.masks.ipv6_dst, sizeof(struct in6_addr)) || + !is_all_zero(&match->wc.masks.ct_ipv6_src, sizeof(struct in6_addr)) || + !is_all_zero(&match->wc.masks.ct_ipv6_dst, sizeof(struct in6_addr)) || + !is_all_zero(&match->wc.masks.nd_target, sizeof(struct in6_addr)) || + !is_all_zero(&match->wc.masks.nsh, sizeof(struct ovs_key_nsh)) || + !is_all_zero(&match->wc.masks.arp_sha, sizeof(struct eth_addr)) || + !is_all_zero(&match->wc.masks.arp_tha, sizeof(struct eth_addr))) { + goto err; + } + + /* If fragmented, then don't HW accelerate - for now */ + if (match_zero_wc.flow.nw_frag) { + goto err; + } + + /* unsupported L4 */ + if (match->wc.masks.igmp_group_ip4 || + match->wc.masks.ct_tp_src || + match->wc.masks.ct_tp_dst) { + goto err; + } + + return 0; + +err: + VLOG_ERR("cannot HW accelerate this flow due to unsupported protocols"); + return -1; +} + +static int +netdev_dpdk_destroy_rte_flow(struct netdev_dpdk *dev, + const ovs_u128 *ufid, + struct rte_flow *rte_flow) { + struct rte_flow_error error; + int ret; + + ret = rte_flow_destroy(dev->port_id, rte_flow, &error); + if (ret == 0) { + ufid_to_rte_flow_disassociate(ufid); + VLOG_DBG("removed rte flow %p associated with ufid " UUID_FMT "\n", + rte_flow, UUID_ARGS((struct uuid *)ufid)); + } else { + VLOG_ERR("rte flow destroy error: %u : message : %s\n", + error.type, error.message); + } + + return ret; +} + +static int +netdev_dpdk_flow_put(struct netdev *netdev, struct match *match, + struct nlattr *actions, size_t actions_len, + const ovs_u128 *ufid, struct offload_info *info, + struct dpif_flow_stats *stats OVS_UNUSED) { + struct rte_flow *rte_flow; + int ret; + + /* + * If an old rte_flow exists, it means it's a flow modification. + * Here destroy the old rte flow first before adding a new one. + */ + rte_flow = ufid_to_rte_flow_find(ufid); + if (rte_flow) { + ret = netdev_dpdk_destroy_rte_flow(netdev_dpdk_cast(netdev), + ufid, rte_flow); + if (ret < 0) { + return ret; + } + } + + ret = netdev_dpdk_validate_flow(match); + if (ret < 0) { + return ret; + } + + return netdev_dpdk_add_rte_flow_offload(netdev, match, actions, + actions_len, ufid, info); +} + +static int +netdev_dpdk_flow_del(struct netdev *netdev, const ovs_u128 *ufid, + struct dpif_flow_stats *stats OVS_UNUSED) { + + struct rte_flow *rte_flow = ufid_to_rte_flow_find(ufid); + + if (!rte_flow) { + return -1; + } + + return netdev_dpdk_destroy_rte_flow(netdev_dpdk_cast(netdev), + ufid, rte_flow); +} + +#define DPDK_FLOW_OFFLOAD_API \ + NULL, /* flow_flush */ \ + NULL, /* flow_dump_create */ \ + NULL, /* flow_dump_destroy */ \ + NULL, /* flow_dump_next */ \ + netdev_dpdk_flow_put, \ + NULL, /* flow_get */ \ + netdev_dpdk_flow_del, \ + NULL /* init_flow_api */ + + #define NETDEV_DPDK_CLASS(NAME, INIT, CONSTRUCT, DESTRUCT, \ SET_CONFIG, SET_TX_MULTIQ, SEND, \ GET_CARRIER, GET_STATS, \ @@ -3957,7 +4788,7 @@ unlock: RXQ_RECV, \ NULL, /* rx_wait */ \ NULL, /* rxq_drain */ \ - NO_OFFLOAD_API, \ + DPDK_FLOW_OFFLOAD_API, \ NULL /* get_block_id */ \ } diff --git a/lib/netdev.h b/lib/netdev.h index c941f1e1e..556676046 100644 --- a/lib/netdev.h +++ b/lib/netdev.h @@ -201,6 +201,12 @@ void netdev_send_wait(struct netdev *, int qid); struct offload_info { const struct dpif_class *dpif_class; ovs_be16 tp_dst_port; /* Destination port for tunnel in SET action */ + + /* + * The flow mark id assigened to the flow. If any pkts hit the flow, + * it will be in the pkt meta data. + */ + uint32_t flow_mark; }; struct dpif_class; struct netdev_flow_dump; diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml index 8f4263d16..63a3a2ed1 100644 --- a/vswitchd/vswitch.xml +++ b/vswitchd/vswitch.xml @@ -360,6 +360,23 @@ </p> </column> + <column name="other_config" key="per-port-memory" + type='{"type": "boolean"}'> + <p> + By default OVS DPDK uses a shared memory model wherein devices + that have the same MTU and socket values can share the same + mempool. Setting this value to <code>true</code> changes this + behaviour. Per port memory allow DPDK devices to use private + memory per device. This can provide greater transparency as + regards memory usage but potentially at the cost of greater memory + requirements. + </p> + <p> + Changing this value requires restarting the daemon if dpdk-init has + already been set to true. + </p> + </column> + <column name="other_config" key="tx-flush-interval" type='{"type": "integer", "minInteger": 0, "maxInteger": 1000000}'> |