summaryrefslogtreecommitdiff
Commit message (Collapse)AuthorAgeFilesLines
* datapath-windows, conntrack: Fix conntrack new stateRui Cao2020-06-274-5/+17
| | | | | | | | | | | | | | | | On windows, if we send a connection setup packet in one direction twice, it will make the connection to be in established state. The same issue happened in Linux userspace conntrack module and has been fixed. This patch port the following previous fixes to windows datapath to fix the issue: - a867c010ee9183885ee9d3eb76a0005c075c4d2e - ac23d20fc90da3b1c9b2117d1e22102e99fba006 Acked-by: Yi-Hung Wei <yihung.wei@gmail.com> Signed-off-by: Rui Cao <rcao@vmware.com> Signed-off-by: William Tu <u9012063@gmail.com>
* bridge: Fix null dereference on ct_timeout_policy recordYi-Hung Wei2020-06-273-5/+19
| | | | | | | | | | | | | | Accoridng to vswitch.ovsschema, each CT_Zone record may have zero or one associcated CT_Timeout_policy. Thus, this patch checks if ovsrec_ct_timeout_policy exist before accesses the record. VMWare-BZ: 2585825 Fixes: 45339539f69d ("ovs-vsctl: Add conntrack zone commands.") Fixes: 993cae678bca ("ofproto-dpif: Consume CT_Zone, and CT_Timeout_Policy tables") Reported-by: Yang Song <yangsong@vmware.com> Signed-off-by: Yi-Hung Wei <yihung.wei@gmail.com> Signed-off-by: William Tu <u9012063@gmail.com>
* rhel: Fix syntax error when matching version.William Tu2020-06-221-1/+0
| | | | | | | | | | Remove the extra 'fi' in the script. VMware-BZ: #2582834 Fixed: fecb28051b35 ("rhel: Support RHEL 7.8 kernel module rpm build.") Reported-by: Abhijeet Malawade <amalawade@vmware.com> Acked-by: Greg Rose <gvrose8192@gmail.com> Signed-off-by: William Tu <u9012063@gmail.com>
* AUTHORS: Add Sriharsha Basavapatna.Ilya Maximets2020-06-221-0/+1
| | | | Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* netdev-offload-dpdk: Support offload of VLAN PUSH/POP actions.Sriharsha Basavapatna2020-06-223-0/+69
| | | | | | | | Parse VLAN PUSH/POP OVS datapath actions and add respective RTE actions. Signed-off-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com> Acked-by: Eli Britstein <elibr@mellanox.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* ofproto-dpif.at: Add unit test for lb_output action.Matteo Croce2020-06-221-3/+28
| | | | | | | | | Extend the balance-tcp one so it tests lb-output action too. The test checks that that the option is shown in bond/show, and that the lb_output action is programmed in the datapath. Signed-off-by: Matteo Croce <mcroce@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* userspace: Avoid dp_hash recirculation for balance-tcp bond mode.Vishal Deep Ajmera2020-06-2220-66/+657
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Problem: In OVS, flows with output over a bond interface of type “balance-tcp” gets translated by the ofproto layer into "HASH" and "RECIRC" datapath actions. After recirculation, the packet is forwarded to the bond member port based on 8-bits of the datapath hash value computed through dp_hash. This causes performance degradation in the following ways: 1. The recirculation of the packet implies another lookup of the packet’s flow key in the exact match cache (EMC) and potentially Megaflow classifier (DPCLS). This is the biggest cost factor. 2. The recirculated packets have a new “RSS” hash and compete with the original packets for the scarce number of EMC slots. This implies more EMC misses and potentially EMC thrashing causing costly DPCLS lookups. 3. The 256 extra megaflow entries per bond for dp_hash bond selection put additional load on the revalidation threads. Owing to this performance degradation, deployments stick to “balance-slb” bond mode even though it does not do active-active load balancing for VXLAN- and GRE-tunnelled traffic because all tunnel packet have the same source MAC address. Proposed optimization: This proposal introduces a new load-balancing output action instead of recirculation. Maintain one table per-bond (could just be an array of uint16's) and program it the same way internal flows are created today for each possible hash value (256 entries) from ofproto layer. Use this table to load-balance flows as part of output action processing. Currently xlate_normal() -> output_normal() -> bond_update_post_recirc_rules() -> bond_may_recirc() and compose_output_action__() generate 'dp_hash(hash_l4(0))' and 'recirc(<RecircID>)' actions. In this case the RecircID identifies the bond. For the recirculated packets the ofproto layer installs megaflow entries that match on RecircID and masked dp_hash and send them to the corresponding output port. Instead, we will now generate action as 'lb_output(<bond id>)' This combines hash computation (only if needed, else re-use RSS hash) and inline load-balancing over the bond. This action is used *only* for balance-tcp bonds in userspace datapath (the OVS kernel datapath remains unchanged). Example: Current scheme: With 8 UDP flows (with random UDP src port): flow-dump from pmd on cpu core: 2 recirc_id(0),in_port(7),<...> actions:hash(hash_l4(0)),recirc(0x1) recirc_id(0x1),dp_hash(0xf8e02b7e/0xff),<...> actions:2 recirc_id(0x1),dp_hash(0xb236c260/0xff),<...> actions:1 recirc_id(0x1),dp_hash(0x7d89eb18/0xff),<...> actions:1 recirc_id(0x1),dp_hash(0xa78d75df/0xff),<...> actions:2 recirc_id(0x1),dp_hash(0xb58d846f/0xff),<...> actions:2 recirc_id(0x1),dp_hash(0x24534406/0xff),<...> actions:1 recirc_id(0x1),dp_hash(0x3cf32550/0xff),<...> actions:1 New scheme: We can do with a single flow entry (for any number of new flows): in_port(7),<...> actions:lb_output(1) A new CLI has been added to dump datapath bond cache as given below. # ovs-appctl dpif-netdev/bond-show [dp] Bond cache: bond-id 1 : bucket 0 - slave 2 bucket 1 - slave 1 bucket 2 - slave 2 bucket 3 - slave 1 Co-authored-by: Manohar Krishnappa Chidambaraswamy <manukc@gmail.com> Signed-off-by: Manohar Krishnappa Chidambaraswamy <manukc@gmail.com> Signed-off-by: Vishal Deep Ajmera <vishal.deep.ajmera@ericsson.com> Tested-by: Matteo Croce <mcroce@redhat.com> Tested-by: Adrian Moreno <amorenoz@redhat.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* netdev-offload-tc: Revert tunnel src/dst port masks handlingRoi Dayan2020-06-196-57/+6
| | | | | | | | | | | | | | | | The cited commit intended to add tc support for masking tunnel src/dst ips and ports. It's not possible to do tunnel ports masking with openflow rules and the default mask for tunnel ports set to 0 in tnl_wc_init(), unlike tunnel ports default mask which is full mask. So instead of never passing tunnel ports to tc, revert the changes to tunnel ports to always pass the tunnel port. In sw classification is done by the kernel, but for hw we must match the tunnel dst port. Fixes: 5f568d049130 ("netdev-offload-tc: Allow to match the IP and port mask of tunnel") Signed-off-by: Roi Dayan <roid@mellanox.com> Reviewed-by: Eli Britstein <elibr@mellanox.com> Signed-off-by: Simon Horman <simon.horman@netronome.com>
* rhel: Support RHEL 7.8 kernel module rpm build.William Tu2020-06-182-3/+11
| | | | | | | | Add support for RHEL7.8 GA release with kernel 3.10.0-1127. VMware-BZ: #2582834 Acked-by: Yi-Hung Wei <yihung.wei@gmail.com> Signed-off-by: William Tu <u9012063@gmail.com>
* docs: Add note for AF_XDP installationYi-Hung Wei2020-06-171-0/+9
| | | | | | | | Add notes about some configuration issues when enabling AF_XDP support. Signed-off-by: Yi-Hung Wei <yihung.wei@gmail.com> Signed-off-by: William Tu <u9012063@gmail.com>
* ofproto-dpif-trace: Improve NAT tracing.Dumitru Ceara2020-06-164-27/+131
| | | | | | | | | | | | | | | | | | | | | | | | | When ofproto/trace detects a recirc action it resumes execution at the specified next table. However, if the ct action performs SNAT/DNAT, e.g., ct(commit,nat(src=1.1.1.1:4000),table=42), the src/dst IPs and ports in the oftrace_recirc_node->flow field are not updated. This leads to misleading outputs from ofproto/trace as real packets would actually first get NATed and might match different flows when recirculated. Assume the first IP/port from the NAT src/dst action will be used by conntrack for the translation and update the oftrace_recirc_node->flow accordingly. This is not entirely correct as conntrack might choose a different IP/port but the result is more realistic than before. This fix covers new connections. However, for reply traffic that executes actions of the form ct(nat, table=42) we still don't update the flow as we don't have any information about conntrack state when tracing. Also move the oftrace_recirc_node processing out of ofproto_trace() and to its own function, ofproto_trace_recirc_node() for better readability/ Signed-off-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Ben Pfaff <blp@ovn.org>
* odp-util.c: Fix dp_hash execution with slowpath actions.Han Zhou2020-06-161-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When dp_hash is executed with slowpath actions, it results in endless recirc loop in kernel datapath, and finally drops the packet, with kernel logs: openvswitch: ovs-system: deferred action limit reached, drop recirc action The root cause is that the dp_hash value calculated by slowpath is not passed to datapath when executing the recirc action, thus when the recirced packet miss upcall comes to userspace again, it generates the dp_hash and recirc action again, with same recirc_id, which in turn generates a megaflow with recirc action with the recird_id same as the recirc_id in its match condition, which causes a loop in datapath. For example, this can be reproduced with below setup of OVN environment: LS1 LS2 | | |------R1------| VIF--LS0---R0-----| |------R3 |------R2------| Assume there is a route from the VIF to R3: R0 -> R1 -> R3, and there are two routes (ECMP) from R3 to the VIF: R3 -> R1 -> R0 R3 -> R2 -> R0 Now if we ping from the VIF to R3, the OVS flow execution on the HV of the VIF will hit the R3's datapath which has flows that responds to the ICMP packet by setting ICMP fields, which requires slowpath actions, and in later flow tables it will hit the "group" action that selects between the ECMP routes. By default OVN uses "dp_hash" method for the "group" action. For the first miss upcall packet, dp_hash value is empty, so the group action will be translated to "dp_hash" and "recirc". During action execution, because of the previous actions that sets ICMP fields, the whole execution requires slowpath, so it tries to execute all actions in userspace in odp_execute_actions(), including dp_hash action, except the recirc action, which can only be executed in datapath. So the dp_hash value is calculated in userspace, and then the packet is injected to datapath for recirc action execution. However, the dp_hash calculated by the userspace is not passed to datapath. Because of this, the packet recirc in datapath doesn't have dp_hash value, and the miss upcall for the recirced packet hits the same flow tables and triggers same "dp_hash" and "recirc" action again, with exactly same recirc_id! This time, the new upcall doesn't require any slowpath execution, so both the dp_hash and recirc actions are executed in datapath, after creating a datapath megaflow like: recirc_id(XYZ),..., actions:hash(l4(0)),recirc(XYZ) with match recirc_id equals the recirc id in the action, thus creating a loop. This patch fixes the problem by passing the calculated dp_hash value to datapath in odp_key_from_dp_packet(). Fixes: 572f732ab078 ("dpif-netdev: user space datapath recirculation") Signed-off-by: Han Zhou <hzhou@ovn.org> Signed-off-by: Ben Pfaff <blp@ovn.org>
* ovsdb idl: Try committing the pending txn in ovsdb_idl_loop_run.Numan Siddique2020-06-161-50/+93
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The function ovsdb_idl_loop_run(), after calling ovsdb_idl_run(), returns a transaction object (of type 'struct ovsdb_idl_txn'). The returned transaction object can be NULL if there is a pending transaction (loop->committing_txn) in the idl loop object. Normally the clients of idl library, first call ovsdb_idl_loop_run(), then do their own processing and create any idl transactions during this processing and then finally call ovsdb_idl_loop_commit_and_wait(). If ovsdb_idl_loop_run() returns NULL transaction object, then much of the processing done by the client gets wasted as in the case of ovn-controller. The client (in this case ovn-controller), can skip the processing and instead call ovsdb_idl_loop_commit_and_wait() if the transaction oject is NULL. But ovn-controller uses IDL tracking and it may loose the tracked changes in that run. This patch tries to improve this scenario, by checking if the pending transaction can be committed in the ovsdb_idl_loop_run() itself and if the pending transaction is cleared (because of the response messages from ovsdb-server due to a transaction message in the previous run), ovsdb_idl_loop_run() can return a valid transaction object. CC: Han Zhou <hzhou@ovn.org> Signed-off-by: Numan Siddique <numans@ovn.org> Signed-off-by: Ben Pfaff <blp@ovn.org>
* dpif-netlink: Fix Windows incompatibility when setting new featureRui Cao2020-06-161-0/+1
| | | | | | | | | | | | | | OVS_DP_ATTR_NAME field is required when sending OVS_DP_CMD_SET to windows kernel driver. The function "dpif_netlink_set_features" dose not set the OVS_DP_ATTR_NAME field which will cause set feature failure and ovs-vswitchd will exist. This patch fixes the issue by setting "request.name" in request. Reported-at: https://github.com/openvswitch/ovs-issues/issues/187 Submitted-at: https://github.com/openvswitch/ovs/pull/319 Signed-off-by: Rui Cao <rcao@vmware.com> Signed-off-by: Ben Pfaff <blp@ovn.org>
* ovsdb-idl: Avoid inconsistent IDL state with OVSDB_MONITOR_V3.Dumitru Ceara2020-06-153-25/+206
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Assuming an ovsdb client connected to a database using OVSDB_MONITOR_V3 (i.e., "monitor_cond_since" method) with the initial monitor condition MC1. Assuming the following two transactions are executed on the ovsdb-server: TXN1: "insert record R1 in table T1" TXN2: "insert record R2 in table T2" If the client's monitor condition MC1 for table T2 matches R2 then the client will receive the following update3 message: method="update3", "insert record R2 in table T2", last-txn-id=TXN2 At this point, if the presence of the new record R2 in the IDL triggers the client to update its monitor condition to MC2 and add a clause for table T1 which matches R1, a monitor_cond_change message is sent to the server: method="monitor_cond_change", "clauses from MC2" In normal operation the ovsdb-server will reply with a new update3 message of the form: method="update3", "insert record R1 in table T1", last-txn-id=TXN2 However, if the connection drops in the meantime, this last update might get lost. It might happen that during the reconnect a new transaction happens that modifies the original record R1: TXN3: "modify record R1 in table T1" When the client reconnects, it will try to perform a fast resync by sending: method="monitor_cond_since", "clauses from MC2", last-txn-id=TXN2 Because TXN2 is still in the ovsdb-server transaction history, the server replies with the changes from the most recent transactions only, i.e., TXN3: result="true", last-txbb-id=TXN3, "modify record R1 in table T1" This causes the IDL on the client in to end up in an inconsistent state because it has never seen the update that created R1. Such a scenario is described in: https://bugzilla.redhat.com/show_bug.cgi?id=1808580#c22 To avoid this issue, the IDL will now maintain (up to) 3 different types of conditions for each DB table: - new_cond: condition that has been set by the IDL client but has not yet been sent to the server through monitor_cond_change. - req_cond: condition that has been sent to the server but the reply acknowledging the change hasn't been received yet. - ack_cond: condition that has been acknowledged by the server. Whenever the IDL FSM is restarted (e.g., voluntary or involuntary disconnect): - if there is a known last_id txn-id the code ensures that new_cond will contain the most recent condition set by the IDL client (either req_cond if there was a request in flight, or new_cond if the IDL client set a condition while the IDL was disconnected) - if there is no known last_id txn-id the code ensures that ack_cond will contain the most recent conditions set by the IDL client regardless whether they were acked by the server or not. When monitor_cond_since/monitor_cond requests are sent they will always include ack_cond and if new_cond is not NULL a follow up monitor_cond_change will be generated afterwards. On the other hand ovsdb_idl_db_set_condition() will always modify new_cond. This ensures that updates of type "insert" that happened before the last transaction known by the IDL but didn't match old monitor conditions are sent upon reconnect if the monitor condition has changed to include them in the meantime. Fixes: 403a6a0cb003 ("ovsdb-idl: Fast resync from server when connection reset.") Signed-off-by: Dumitru Ceara <dceara@redhat.com> Acked-by: Han Zhou <hzhou@ovn.org> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* ovsdb-server.7: Mention update3 as replies to monitor_cond_change.Dumitru Ceara2020-06-151-3/+3
| | | | | | | | | | | Monitor_cond_change might trigger updates to be sent to clients as results to condition changes. These updates can be either update2 (for monitor_cond monitors) or update3 (for monitor_cond_since monitors). The documentation used to mention only update2. Signed-off-by: Dumitru Ceara <dceara@redhat.com> Acked-by: Han Zhou <hzhou@ovn.org> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* ovs-rcu: Avoid flushing callbacks during postponing.Ilya Maximets2020-06-102-5/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ovsrcu_flush_cbset() call during ovsrcu_postpone() could cause use after free in case the caller sets new pointer only after postponing free for the old one: ------------------ ------------------ ------------------- Thread 1 Thread 2 RCU Thread ------------------ ------------------ ------------------- pointer = A ovsrcu_quiesce(): thread->seqno = 30 global_seqno = 31 quiesced read pointer A postpone(free(A)): flush cbset pop flushed_cbsets ovsrcu_synchronize: target_seqno = 31 ovsrcu_quiesce(): thread->seqno = 31 global_seqno = 32 quiesced read pointer A use pointer A ovsrcu_quiesce(): thread->seqno = 32 global_seqno = 33 quiesced read pointer A pointer = B ovsrcu_quiesce(): thread->seqno = 33 global_seqno = 34 quiesced target_seqno exceeded by all threads call cbs to free A use pointer A (use after free) ----------------------------------------------------------- Fix that by using dynamically re-allocated array without flushing to the global flushed_cbsets until writer enters quiescent state. Fixes: 0f2ea84841e1 ("ovs-rcu: New library.") Reported-by: Linhaifeng <haifeng.lin@huawei.com> Reported-at: https://mail.openvswitch.org/pipermail/ovs-dev/2020-June/371265.html Acked-by: Ben Pfaff <blp@ovn.org> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* ovs-actions.xml: Fix a typo in the description of check_pkt_larger.Numan Siddique2020-06-101-1/+1
| | | | | | Signed-off-by: Numan Siddique <numans@ovn.org> Acked-by: William Tu <u9012063@gmail.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* netdev-offload-tc: Allow installing arp rules to TC dp.Tonghao Zhang2020-06-086-1/+146
| | | | | | | | | | | | | | | | | | | | | | | | | This patch allows to install arp rules to tc dp. In the future, arp will be offloaded to hardware to be processed. So OvS enable this now. $ ovs-appctl dpctl/add-flow 'recirc_id(0),in_port(3),eth(),\ eth_type(0x0806),arp(op=2,tha=00:50:56:e1:4b:ab,tip=10.255.1.116)' 2 $ ovs-appctl dpctl/dump-flows ... arp(tip=10.255.1.116,op=2,tha=00:50:56:e1:4b:ab) ... $ tc filter show dev <ethx> ingress ... eth_type arp arp_tip 10.255.1.116 arp_op reply arp_tha 00:50:56:e1:4b:ab not_in_hw action order 1: mirred (Egress Redirect to device <ethy>) stolen ... Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Signed-off-by: Simon Horman <simon.horman@netronome.com>
* tc: Support new terse dump kernel APIVlad Buslov2020-06-053-17/+51
| | | | | | | | | | | When dumping flows in terse mode set TCA_DUMP_FLAGS attribute to TCA_DUMP_FLAGS_TERSE flag to prevent unnecessary copying of data between kernel and user spaces. Only expect kernel to provide cookie, stats and flags when dumping filters in terse mode. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Reviewed-by: Roi Dayan <roid@mellanox.com> Signed-off-by: Simon Horman <simon.horman@netronome.com>
* netdev-offload: Implement terse dump supportVlad Buslov2020-06-056-58/+122
| | | | | | | | | | | | | | | In order to improve revalidator performance by minimizing unnecessary copying of data, extend netdev-offloads to support terse dump mode. Extend netdev_flow_api->flow_dump_create() with 'terse' bool argument. Implement support for terse dump in functions that convert netlink to flower and flower to match. Set flow stats "used" value based on difference in number of flow packets because lastuse timestamp is not included in TC terse dump. Kernel API support is implemented in following patch. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Reviewed-by: Roi Dayan <roid@mellanox.com> Signed-off-by: Simon Horman <simon.horman@netronome.com>
* netdev-offload-tc: Expand tunnel source IPs masked matchTonghao Zhang2020-06-033-4/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | To support more use case, for example, DDOS, which packets should be dropped in hardware, this patch allows users to match only the tunnel source IPs with masked value. $ ovs-appctl dpctl/add-flow "tunnel(src=2.2.2.0/255.255.255.0,tp_dst=4789,ttl=64),\ recirc_id(2),in_port(3),eth(),eth_type(0x0800),ipv4()" "" $ ovs-appctl dpctl/dump-flows tunnel(src=2.2.2.0/255.255.255.0,ttl=64,tp_dst=4789) ... actions:drop $ tc filter show dev vxlan_sys_4789 ingress ... eth_type ipv4 enc_src_ip 2.2.2.0/24 enc_dst_port 4789 enc_ttl 64 in_hw in_hw_count 2 action order 1: gact action drop ... Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Acked-by: Roi Dayan <roid@mellanox.com> Signed-off-by: Simon Horman <simon.horman@netronome.com>
* netdev-offload-tc: Allow to match the IP and port mask of tunnelTonghao Zhang2020-06-036-15/+123
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch allows users to offload the TC flower rules with tunnel mask. This patch allows masked match of the following, where previously supported an exact match was supported: * Remote (dst) tunnel endpoint address * Local (src) tunnel endpoint address * Remote (dst) tunnel endpoint UDP port And also allows masked match of the following, where previously no match was supported: * Local (src) tunnel endpoint UDP port In some case, mask is useful as wildcards. For example, DDOS, in that case, we don’t want to allow specified hosts IPs or only source Ports to access the targeted host. For example: $ ovs-appctl dpctl/add-flow "tunnel(dst=2.2.2.100,src=2.2.2.0/255.255.255.0,tp_dst=4789),\ recirc_id(0),in_port(3),eth(),eth_type(0x0800),ipv4()" "" $ tc filter show dev vxlan_sys_4789 ingress ... eth_type ipv4 enc_dst_ip 2.2.2.100 enc_src_ip 2.2.2.0/24 enc_dst_port 4789 enc_ttl 64 in_hw in_hw_count 2 action order 1: gact action drop ... Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Acked-by: Roi Dayan <roid@mellanox.com> Signed-off-by: Simon Horman <simon.horman@netronome.com>
* netdev-offload-tc: Use ipv6_addr_is_set instead of is_all_zerosTonghao Zhang2020-06-032-8/+4
| | | | | | | | Not bugfix, make the codes more readable. Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Acked-by: Roi Dayan <roid@mellanox.com> Signed-off-by: Simon Horman <simon.horman@netronome.com>
* dpif-netlink: Generate ufids for installing TC flowersTonghao Zhang2020-06-031-0/+45
| | | | | | | | | | | | | To support installing the TC flowers to HW, via "ovs-appctl dpctl/add-flow" command, there should be an ufid. This patch will check whether ufid exists, if not, generate an ufid. Should to know that when processing upcall packets, ufid is generated in parse_odp_packet for kernel datapath. Configuring the max-idle/max-revalidator, may help testing this patch. Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Acked-by: Roi Dayan <roid@mellanox.com> Signed-off-by: Simon Horman <simon.horman@netronome.com>
* ovsdb: Fix timeout type for wait operation.Ilya Maximets2020-06-011-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | According to RFC 7047, 'timeout' is an integer field: 5.2.6. Wait The "wait" object contains the following members: "op": "wait" required "timeout": <integer> optional ... For some reason initial implementation treated it as a real number. This causes a build issue with clang that complains that LLONG_MAX could not be represented as double: ovsdb/execution.c:733:32: error: implicit conversion from 'long long' to 'double' changes value from 9223372036854775807 to 9223372036854775808 timeout_msec = MIN(LLONG_MAX, json_real(timeout)); ~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ /usr/include/sys/limits.h:69:19: note: expanded from macro 'LLONG_MAX' #define LLONG_MAX __LLONG_MAX /* max for a long long */ ^~~~~~~~~~~ /usr/include/x86/_limits.h:74:21: note: expanded from macro '__LLONG_MAX' #define __LLONG_MAX 0x7fffffffffffffffLL /* max value for a long long */ ^~~~~~~~~~~~~~~~~~~~ ./lib/util.h:90:21: note: expanded from macro 'MIN' #define MIN(X, Y) ((X) < (Y) ? (X) : (Y)) ^ ~ Fix that by changing parser to treat 'timeout' as integer. Fixes clang build on FreeBSD 12.1 in CirrusCI. Fixes: f85f8ebbfac9 ("Initial implementation of OVSDB.") Acked-by: Han Zhou <hzhou@ovn.org> Acked-by: Numan Siddique <numans@ovn.org> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* ovsdb-idl: Add function to reset min_index.Mark Michelson2020-05-282-0/+11
| | | | | | | | | | | | | | | | | | | | | If an administrator removes all of the databases in a cluster from disk, then ovsdb IDL clients will have a problem. The databases will all reset their stored indexes to 0, so The IDL client's min_index will be higher than the indexes of all databases in the cluster. This results in the client constantly connecting to databases, detecting the data as "stale", and then attempting to connect to another. This function provides a way to reset the IDL to an initial state with min_index of 0. This way, the client will not wrongly detect the database data as stale and will recover properly. Notice that this function is not actually used anywhere in this patch. This will be used by OVN, though, since OVN is the primary user of clustered OVSDB. Signed-off-by: Mark Michelson <mmichels@redhat.com> Acked-by: Han Zhou <hzhou@ovn.org> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* datapath: Add hash info to upcall.Han Zhou2020-05-284-1/+79
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch backports below upstream patches, and add __skb_set_hash to compat for older kernels. commit b5ab1f1be6180a2e975eede18731804b5164a05d Author: Jakub Kicinski <kuba@kernel.org> Date: Mon Mar 2 21:05:18 2020 -0800 openvswitch: add missing attribute validation for hash Add missing attribute validation for OVS_PACKET_ATTR_HASH to the netlink policy. Fixes: bd1903b7c459 ("net: openvswitch: add hash info to upcall") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Greg Rose <gvrose8192@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> commit bd1903b7c4596ba6f7677d0dfefd05ba5876707d Author: Tonghao Zhang <xiangxia.m.yue@gmail.com> Date: Wed Nov 13 23:04:49 2019 +0800 net: openvswitch: add hash info to upcall When using the kernel datapath, the upcall don't include skb hash info relatived. That will introduce some problem, because the hash of skb is important in kernel stack. For example, VXLAN module uses it to select UDP src port. The tx queue selection may also use the hash in stack. Hash is computed in different ways. Hash is random for a TCP socket, and hash may be computed in hardware, or software stack. Recalculation hash is not easy. Hash of TCP socket is computed: tcp_v4_connect -> sk_set_txhash (is random) __tcp_transmit_skb -> skb_set_hash_from_sk There will be one upcall, without information of skb hash, to ovs-vswitchd, for the first packet of a TCP session. The rest packets will be processed in Open vSwitch modules, hash kept. If this tcp session is forward to VXLAN module, then the UDP src port of first tcp packet is different from rest packets. TCP packets may come from the host or dockers, to Open vSwitch. To fix it, we store the hash info to upcall, and restore hash when packets sent back. +---------------+ +-------------------------+ | Docker/VMs | | ovs-vswitchd | +----+----------+ +-+--------------------+--+ | ^ | | | | | | upcall v restore packet hash (not recalculate) | +-+--------------------+--+ | tap netdev | | vxlan module +---------------> +--> Open vSwitch ko +--> or internal type | | +-------------------------+ Reported-at: https://mail.openvswitch.org/pipermail/ovs-dev/2019-October/364062.html Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Acked-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net> Tested-by: Aliasgar Ginwala <aginwala@ebay.com> Acked-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Signed-off-by: Han Zhou <hzhou@ovn.org> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* AUTHORS: Add Eiichi Tsukata.Ilya Maximets2020-05-281-0/+1
| | | | Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* classifier: Prevent tries vs n_tries race leading to NULL dereference.Eiichi Tsukata2020-05-283-21/+35
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently classifier tries and n_tries can be updated not atomically, there is a race condition which can lead to NULL dereference. The race can happen when main thread updates a classifier tries and n_tries in classifier_set_prefix_fields() and at the same time revalidator or handler thread try to lookup them in classifier_lookup__(). Such race can be triggered when user changes prefixes of flow_table. Race(user changes flow_table prefixes: ip_dst,ip_src => none): [main thread] [revalidator/handler thread] =========================================================== /* cls->n_tries == 2 */ for (int i = 0; i < cls->n_tries; i++) { trie_init(cls, i, NULL); /* n_tries == 0 */ cls->n_tries = n_tries; /* cls->tries[i]->feild is NULL */ trie_ctx_init(&trie_ctx[i],&cls->tries[i]); /* trie->field is NULL */ ctx->be32ofs = trie->field->flow_be32ofs; To prevent the race, instead of re-introducing internal mutex implemented in the commit fccd7c092e09 ("classifier: Remove internal mutex."), this patch makes trie field RCU protected and checks it after read. Fixes: fccd7c092e09 ("classifier: Remove internal mutex.") Signed-off-by: Eiichi Tsukata <eiichi.tsukata@nutanix.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* raft: Avoid sending equal snapshots.Ilya Maximets2020-05-283-1/+43
| | | | | | | | | | | | | | | | | | | | | | | | | Snapshots are huge. In some cases we could receive several outdated append replies from the remote server. This could happen in high scale cases if the remote server is overloaded and not able to process all the raft requests in time. As an action to each outdated append reply we're sending full database snapshot. While remote server is already overloaded those snapshots will stuck in jsonrpc backlog for a long time making it grow up to few GB. Since remote server wasn't able to timely process incoming messages it will likely not able to process snapshots leading to the same situation with low chances to recover. Remote server will likely stuck in 'candidate' state, other servers will grow their memory consumption due to growing jsonrpc backlogs: jsonrpc|INFO|excessive sending backlog, jsonrpc: ssl:192.16.0.3:6644, num of msgs: 3795, backlog: 8838994624. This patch is trying to avoid that situation by avoiding sending of equal snapshot install requests. This helps maintain reasonable memory consumption and allows the cluster to recover on a larger scale. Acked-by: Han Zhou <hzhou@ovn.org> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* ovsdb-server: Fix schema leak while reading db.Ilya Maximets2020-05-281-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | parse_txn() function doesn't always take ownership of the 'schema' passed. So, if the schema of the clustered db has same version as the one that already in use, parse_txn() will not use it, resulting with a memory leak: 7,827 (56 direct, 7,771 indirect) bytes in 1 blocks are definitely lost at 0x483BB1A: calloc (vg_replace_malloc.c:762) by 0x44AD02: xcalloc (util.c:121) by 0x40E70E: ovsdb_schema_create (ovsdb.c:41) by 0x40EA6D: ovsdb_schema_from_json (ovsdb.c:217) by 0x415EDD: ovsdb_storage_read (storage.c:280) by 0x408968: read_db (ovsdb-server.c:607) by 0x40733D: main_loop (ovsdb-server.c:227) by 0x40733D: main (ovsdb-server.c:469) While we could put ovsdb_schema_destroy() in a few places inside 'parse_txn()', from the users' point of view it seems better to have a constant argument and just clone the 'schema' if needed. The caller will be responsible for destroying the 'schema' it owns. Fixes: 1b1d2e6daa56 ("ovsdb: Introduce experimental support for clustered databases.") Acked-by: Han Zhou <hzhou@ovn.org> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* pvector: Document that multiple elements with a given priority are allowed.Ben Pfaff2020-05-271-4/+6
| | | | | Acked-by: Greg Rose <gvrose8192@gmail.com> Signed-off-by: Ben Pfaff <blp@ovn.org>
* ovsdb: Add raft memory usage to memory report.Ilya Maximets2020-05-255-0/+35
| | | | | | | | | | | | | | | | | | Memory reports could be found in logs or by calling 'memory/show' appctl command. For ovsdb-server it includes information about db cells, monitor connections with their backlog size, etc. But it doesn't contain any information about memory consumed by raft. Backlogs of raft connections could be insanely large because of snapshot installation requests that simply contains the whole database. In not that healthy clusters where one of ovsdb servers is not able to timely handle all the incoming raft traffic, backlog on a sender's side could cause significant memory consumption issues. Adding new 'raft-connections' and 'raft-backlog' counters to the memory report to better track such conditions. Acked-by: Han Zhou <hzhou@ovn.org> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* compat: Backport ipv6_stub changeGreg Rose2020-05-243-2/+29
| | | | | | | | | | | | | | | | | A patch backported to the Linux stable 4.14 tree and present in the latest stable 4.14.181 kernel breaks ipv6_stub usage. The commit is 8ab8786f78c3 ("net ipv6_stub: use ip6_dst_lookup_flow instead of ip6_dst_lookup"). Create the compat layer define to check for it and fixup usage in vxlan and geneve modules. Passes Travis here: https://travis-ci.org/github/gvrose8192/ovs-experimental/builds/689798733 Signed-off-by: Greg Rose <gvrose8192@gmail.com> Signed-off-by: William Tu <u9012063@gmail.com>
* netdev-vport: Fix typo in log message.Ben Pfaff2020-05-211-1/+1
| | | | | Acked-by: Greg Rose <gvrose8192@gmail.com> Signed-off-by: Ben Pfaff <blp@ovn.org>
* ovsdb-idl: Fix NULL deref reported by Coverity.William Tu2020-05-201-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When 'datum.values' or 'datum.keys' is NULL, some code path calling into ovsdb_idl_txn_write__ triggers NULL deref. An example: ovsrec_open_vswitch_set_cur_cfg(const struct ovsrec_open_vswitch { struct ovsdb_datum datum; union ovsdb_atom key; datum.n = 1; datum.keys = &key; key.integer = cur_cfg; // 1. assign_zero: Assigning: datum.values = NULL. datum.values = NULL; // CID 1421356 (#1 of 1): Explicit null dereferenced (FORWARD_NULL) // 2. var_deref_model: Passing &datum to ovsdb_idl_txn_write_clone,\ // which dereferences null datum.values. ovsdb_idl_txn_write_clone(&row->header_, &ovsrec_open_vswitch_col } And with the following calls: ovsdb_idl_txn_write_clone ovsdb_idl_txn_write__ 6. deref_parm_in_call: Function ovsdb_datum_destroy dereferences datum->values ovsdb_datum_destroy Reviewed-by: Yifeng Sun <pkusunyifeng@gmail.com> Signed-off-by: William Tu <u9012063@gmail.com>
* ovs-bugtool: Add -m option to dump-flows.William Tu2020-05-201-0/+1
| | | | | | | | | | This patch adds 'ovs-appctl dpctl/dump-flows -m' to bugtool, the output will include wildcarded fields and the miniflow bits, such as 'dp-extra-info:miniflow_bits(4,1)'. Cc: Emma Finn <emma.finn@intel.com> Acked-by: Greg Rose <gvrose8192@gmail.com> Signed-off-by: William Tu <u9012063@gmail.com>
* Documentation: Fix kernel support matrixGreg Rose2020-05-201-4/+8
| | | | | | | | | | | | | | | The documentation matrix for OVS branches and which kernels they support is out of date. Update it to show that since 2.10 the lowest kernel that we test and support is Linux 3.16. RHEL and CentOS kernels based upon the original 3.10 kernel are still supported. Reported-by: Han Zhou <hzhou@ovn.org> Reported-at: https://mail.openvswitch.org/pipermail/ovs-dev/2020-May/370742.html Acked-by: Han Zhou <hzhou@ovn.org> Signed-off-by: Greg Rose <gvrose8192@gmail.com> Signed-off-by: William Tu <u9012063@gmail.com>
* netdev-offload-tc: Re-fetch block ID after probing.Aaron Conole2020-05-161-0/+2
| | | | | | | | | | | | | It's possible that block_id could changes after the probe for block support. Therefore, fetch the block_id again after the probe. Fixes: edc2055a2bf7 ("netdev-offload-tc: Flush rules on ingress block when init tc flow api") Cc: Dmytro Linkin <dmitrolin@mellanox.com> Acked-by: Roi Dayan <roid@mellanox.com> Co-authored-by: Marcelo Leitner <mleitner@redhat.com> Signed-off-by: Marcelo Leitner <mleitner@redhat.com> Signed-off-by: Aaron Conole <aconole@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* netdev-linux: Update LAG in all cases.Aaron Conole2020-05-161-6/+5
| | | | | | | | | | | | | | | | | | | | In some cases, when processing a netlink change event, it's possible for an alternate part of OvS (like the IPv6 endpoint processing) to hold an active netdev interface. This creates a race-condition, where sometimes the OvS change processing will take the normal path. This doesn't work because the netdev device object won't actually be enslaved to the ovs-system (for instance, a linux bond) and ingress qdisc entries will be missing. To address this, we update the LAG information in ALL cases where LAG information could come in. Fixes: d22f8927c3c9 ("netdev-linux: monitor and offload LAG slaves to TC") Cc: Marcelo Leitner <mleitner@redhat.com> Cc: John Hurley <john.hurley@netronome.com> Acked-by: Roi Dayan <roid@mellanox.com> Signed-off-by: Aaron Conole <aconole@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* debian: Add python3-sphinx to ovs build dependenciesAnsis Atteka2020-05-151-0/+1
| | | | | | | | | | | | | python3-sphinx has become mandatory build dependency since patch 39b5e46 ("Documentation: Convert multiple manpages to ReST."), because, otherwise, without this dependency installed, packaging of OVS debian packages fails with an error that generated man pages can't be found. Fixes: 39b5e46312 ("Documentation: Convert multiple manpages to ReST.") CC: Ben Pfaff <blp@ovn.org> Signed-off-by: Ansis Atteka <aatteka@ovn.org> Reported-by: Artem Teleshev <artem.teleshev@gmail.com> Acked-by: Greg Rose <gvrose8192@gmail.com>
* ofproto: Fix statistics of removed flow.Ilya Maximets2020-05-151-2/+2
| | | | | | | | | 'fr' is a new variable on the stack. '+=' here adds the real statistics to a random stack memory. Fixes: 164413156cf9 ("Add offload packets statistics") Acked-by: Roi Dayan <roid@mellanox.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* debian: Fix package dependenciesRoi Dayan2020-05-141-2/+2
| | | | | | | | | | In python2 package was python-twisted-conch but it looks like for python3 it's just python3-twisted. For zope interface the python3 package name is python3-zope.interface. Fixes: 1ca0323e7c29 ("Require Python 3 and remove support for Python 2.") Signed-off-by: Roi Dayan <roid@mellanox.com> Acked-by: Ansis Atteka <aatteka@ovn.org>
* oss-fuzz: Fix miniflow_target.c.William Tu2020-05-141-1/+2
| | | | | | | | | | | | | | Clang reports: tests/oss-fuzz/miniflow_target.c:209:26: error: suggest braces around \ initialization of subobject [-Werror,-Wmissing-braces] struct flow flow2 = {0}; Fix it by using memset. Cc: Bhargava Shastry <bshastry@sect.tu-berlin.de> Reviewed-by: Yifeng Sun <pkusunyifeng@gmail.com> Signed-off-by: William Tu <u9012063@gmail.com>
* metaflow: Fix maskable conntrack orig tuple fieldsYi-Hung Wei2020-05-142-6/+80
| | | | | | | | | | | | From man ovs-fields(7), the conntrack origin tuple fields ct_nw_src/dst, ct_ipv6_src/dst, and ct_tp_src/dst are supposed to be bitwise maskable, but they are not. This patch enables those fields to be maskable, and adds a regression test. Fixes: daf4d3c18da4 ("odp: Support conntrack orig tuple key.") Reported-by: Wenying Dong <wenyingd@vmware.com> Signed-off-by: Yi-Hung Wei <yihung.wei@gmail.com> Signed-off-by: William Tu <u9012063@gmail.com>
* tests: Add tests using tap device.William Tu2020-05-143-0/+36
| | | | | | | | | Similar to using veth across namespaces, this patch creates tap devices, assigns to namespaces, and allows traffic to go through different test cases. Acked-by: Flavio Leitner <fbl@sysclose.org> Signed-off-by: William Tu <u9012063@gmail.com>
* userspace: Enable TSO support for non-DPDK.William Tu2020-05-147-325/+340
| | | | | | | | | | | | | | | | | This patch enables TSO support for non-DPDK use cases, and also add check-system-tso testsuite. Before TSO, we have to disable checksum offload, allowing the kernel to calculate the TCP/UDP packet checsum. With TSO, we can skip the checksum validation by enabling checksum offload, and with large packet size, we see better performance. Consider container to container use cases: iperf3 -c (ns0) -> veth peer -> OVS -> veth peer -> iperf3 -s (ns1) And I got around 6Gbps, similar to TSO with DPDK-enabled. Acked-by: Flavio Leitner <fbl@sysclose.org> Acked-by: Ilya Maximets <i.maximets@ovn.org> Signed-off-by: William Tu <u9012063@gmail.com>
* debian: Fix broken build after some man pages became generated from RSTAnsis Atteka2020-05-133-7/+7
| | | | | | | | | | | | | | | | | | | | | | As far as I know, the official way to build debian packages is by invoking following command: > fakeroot debian/rules binary However, that command started to fail with these errors: dh_installman --language=C dh_installman: Cannot find (any matches for) "utilities/ovs-appctl.8" (tried in .) dh_installman: Cannot find (any matches for) "utilities/ovs-l3ping.8" (tried in .) dh_installman: Cannot find (any matches for) "utilities/ovs-tcpdump.8" (tried in .) because the generated manpages are not part of the source tree anymore. This patch updates debian *.manpages files to point to the generted files. Fixes: 39b5e46312 ("Documentation: Convert multiple manpages to ReST.") CC: Ben Pfaff <blp@ovn.org> Signed-off-by: Ansis Atteka <aatteka@ovn.org> Acked-by: Ben Pfaff <blp@ovn.org>
* RAFT: Add clarifying note for cluster/leave operation.Mark Michelson2020-05-121-0/+5
| | | | | | | | | | | | | | | | | | | | | We had a user express confusion about the state of a cluster after using cluster/leave. The user had a three server cluster and used cluster/leave to remove two servers from the cluster. The user expected that the single server left would not function since the quorum of two servers for a three server cluster was not met. In actuality, cluster/leave removes the server from the cluster and alters the cluster size in the process. Thus the single remaining server continued to function since quorum was reached. This documentation change makes it a bit more explicit that cluster/leave alters the size of the cluster and cites the three server down to one server case as an example. Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=1798158 Acked-by: Han Zhou <hzhou@ovn.org> Signed-off-by: Mark Michelson <mmichels@redhat.com> Signed-off-by: Ben Pfaff <blp@ovn.org>