summaryrefslogtreecommitdiff
Commit message (Collapse)AuthorAgeFilesLines
...
* ovsdb: Don't add ovsdb-server.c to libovsdb.Gurucharan Shetty2014-07-182-1/+1
| | | | | | | | | | Without this change, with shared libraries, VLOG constructor for ovsdb-server would get called twice corrupting the 'vlog_modules' list causing an infinite loop. Signed-off-by: Gurucharan Shetty <gshetty@nicira.com> Reported-by: Gur Stavi <gstavi@mrv.com> Acked-by: Ben Pfaff <blp@nicira.com>
* lib/classifier: Clarify subtable skipping.Jarno Rajahalme2014-07-181-14/+12
| | | | | | | | | | Clarify comments for trie-based subtable skipping. Perform the cheaper check first. Signed-off-by: Jarno Rajahalme <jrajahalme@nicira.com> Acked-by: Ben Pfaff <blp@nicira.com> Acked-by: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
* lib/classifier: Return all matching prefix lengths from trie lookup.Jarno Rajahalme2014-07-181-36/+48
| | | | | | | | | | | Previously we only returned the last matching prefix length encountered during a trie lookup, and skipped subtables that had prefixes longer than that. This patch changes the trie lookup functions to return all matching prefix lengths seen, so that all non-matching prefix lengths can be skipped. Signed-off-by: Jarno Rajahalme <jrajahalme@nicira.com> Acked-by: Ben Pfaff <blp@nicira.com>
* lib/classifier: Change local variable names.Jarno Rajahalme2014-07-181-12/+13
| | | | | | | These stylistic changes makes the following patch a bit simpler. Signed-off-by: Jarno Rajahalme <jrajahalme@nicira.com> Acked-by: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
* lib/classifier: Unify struct classifier and cls_classifier.Jarno Rajahalme2014-07-183-108/+76
| | | | | | | | | | | Now that it is clear that struct cls_classifier itself does not need RCU indirection and pvector is defined in its own header, it is possible get rid of the indirection from struct classifier to struct cls_classifier. Suggested-by: YAMAMOTO Takashi <yamamoto@valinux.co.jp> Signed-off-by: Jarno Rajahalme <jrajahalme@nicira.com> Acked-by: Ben Pfaff <blp@nicira.com>
* netdev-dpdk: Refactor dpdk_class_init()Daniele Di Proietto2014-07-171-24/+16
| | | | | | | | | | | | | | | | | | | The following changes were made: - Since we have two dpdk classes, we should split the initial operations needed by both classes from the initialization needed by each class. - The dpdk_ring class does not need an initialization function: it has been removed. This also prevents many testcase from failing, because dpdk_ring_class_init() was printing an unexpected log message (OVS_VSWITCHD_START at tests/ofproto-macros.at:54 check for a specific set of startup log messages) - If the user doesn't pass the --dpdk option we do not register the dpdk* classes - Do not call VLOG_ERR if there are 0 dpdk ethernet device. OVS can now be used with dpdk_ring devices. Signed-off-by: Daniele Di Proietto <ddiproietto@vmware.com> Acked-by: Pravin B Shelar <pshelar@nicira.com>
* ofproto: Report controller rate limiting statistics in database.Ben Pfaff2014-07-175-32/+136
| | | | | Signed-off-by: Ben Pfaff <blp@nicira.com> Acked-by: Gurucharan Shetty <gshetty@nicira.com>
* smap: New function smap_add_nocopy().Ben Pfaff2014-07-172-2/+12
| | | | | | | | An upcoming commit will add a caller that needs to format both the key and the value. That isn't cleanly possible with the current interface. Signed-off-by: Ben Pfaff <blp@nicira.com> Acked-by: Gurucharan Shetty <gshetty@nicira.com>
* Simplify ofproto_controller_info by using a struct smap in place of array.Ben Pfaff2014-07-173-34/+12
| | | | | | | | | The only client for ofproto_controller_info was transforming the array of pairs into an smap anyway. It's easy for the code that fills in the array to generate it as an smap directly, and it's also easier to extend later. Signed-off-by: Ben Pfaff <blp@nicira.com> Acked-by: Gurucharan Shetty <gshetty@nicira.com>
* pinsched: Report queued packet count correctly.Ben Pfaff2014-07-171-7/+3
| | | | | | | | | | | 'n_txq' was initialized to 0 and never modified, so pinsched_count_txqlen() always returned 0. Instead, return the correct number. This only affected the results of "ovs-appctl memory/show", and only if controller rate limiting was turned on, so it is not a serious bug. Signed-off-by: Ben Pfaff <blp@nicira.com> Acked-by: Gurucharan Shetty <gshetty@nicira.com>
* vswitch.xml: Fix typo in documentation.Ben Pfaff2014-07-171-1/+1
| | | | | Signed-off-by: Ben Pfaff <blp@nicira.com> Acked-by: Gurucharan Shetty <gshetty@nicira.com>
* stp: Make stp-disabled port forward stp bpdu packets.Alex Wang2014-07-163-15/+32
| | | | | | | | | | | | | | | | | | | | | Commit 0d1cee123a84 (stp: Fix bpdu tx problem in listening state) makes ovs drop the stp bpdu packets if stp is not enabled on the input port. However, when pif bridge is used and stp is enabled on the integration bridge. The flow translation of stp bpdu packets will go through a level of resubmission which changes the input port to the corresponding peer port. Since, the patch port on the pif bridge does not have stp enabled, the flow translation will drop the bpdu packets. This commit fixes the issue by making ovs forward stp bpdu packets on stp-disabled port. VMware-BZ: #1284695 Signed-off-by: Alex Wang <alexw@nicira.com> Acked-by: Ben Pfaff <blp@nicira.com> Acked-by: Joe Stringer <joestringer@nicira.com>
* cmap: Fix cmap_next_position()Daniele Di Proietto2014-07-162-2/+20
| | | | | | | | | cmap_next_position() didn't update the node pointer while iterating through a list of nodes with the same hash. This commit fixes the bug and improve test-cmap to detect it. Signed-off-by: Daniele Di Proietto <ddiproietto@vmware.com> Signed-off-by: Ben Pfaff <blp@nicira.com>
* netdev-dpdk: add dpdk rings to netdev-dpdkmaryam.tahhan2014-07-167-99/+545
| | | | | | | | | | | | | | | | | Shared memory ring patch This patch enables the client dpdk rings within the netdev-dpdk. It adds a new dpdk device called dpdkr (other naming suggestions?). This allows for the use of shared memory to communicate with other dpdk applications, on the host or within a virtual machine. Instructions for use are in INSTALL.DPDK. This has been tested on Intel multi-core platforms and with the client application within the host. Signed-off-by: Gerald Rogers <gerald.rogers@intel.com> Signed-off-by: Maryam Tahhan <maryam.tahhan@intel.com> Acked-by: Pravin B Shelar <pshelar@nicira.com>
* netlink-socket: Simplify multithreaded dumping to match Linux reality.Ben Pfaff2014-07-163-84/+107
| | | | | | | | | | | | | | | | | | | | | | Commit 0791315e4d (netlink-socket: Work around kernel Netlink dump thread races.) introduced a simple workaround for Linux kernel races in Netlink dumps. However, the code remained more complicated than needed. This commit simplifies it. The main reason for complication in the code was 'status_seq' in nl_dump. This member was there to allow a thread to wait for some other thread to refill the socket buffer with another dump message (although we did not understand the reason at the time it was introduced). Now that we know that Netlink dumps properly need to be serialized to work in existing Linux kernels, there's no additional value in having 'status_seq', because serialized recvmsg() calls always refill the socket buffer properly. This commit updates nl_msg_next() to clear its buffer argument on error. This is a more convenient interface for the new version of the Netlink dump code. nl_msg_next() doesn't have any other callers. Signed-off-by: Ben Pfaff <blp@nicira.com>
* Factor the ovsdb-server main loop into a new functionEric Sesterhenn2014-07-162-66/+79
| | | | | | | | This refactors the ovsdb-server main loop into a new function, which allows to call it from multiple places. Signed-off-by: Eric Sesterhenn <eric.sesterhenn@lsexperts.de> Signed-off-by: Ben Pfaff <blp@nicira.com>
* CodingStyle: Add suggested GNU indent options.Ben Pfaff2014-07-162-0/+8
| | | | | | Suggested-by: Gerald Rogers <gerald.rogers@intel.com> Signed-off-by: Ben Pfaff <blp@nicira.com> Acked-by: Andy Zhou <azhou@nicira.com>
* dpif: Update documentation for RCU-protected actions.Joe Stringer2014-07-163-6/+12
| | | | | | | | | | | The userspace datapath returns RCU-protected actions from flow_get() and flow_dump_next(). This doesn't cause any trouble for current users of these functions, but it imposes additional constraints on their use. This patch makes the dpif documentation more explicit about how the results of these functions can be used. Signed-off-by: Joe Stringer <joestringer@nicira.com> Acked-by: Ben Pfaff <blp@nicira.com>
* tests: Use the 'LARGE_MSECS' variation of time/warp at more places.Gurucharan Shetty2014-07-155-31/+18
| | | | | | | | | | | commit 8661af798(timeval: Provide a variation for time/warp command) provided a variation of time/warp command to prevent multiple invocations of time/warp. That commit changed only the users in bfd.at and cfm.at as they used it the most. Since we haven't had any negative confequences because of the change, add the remaining users. Signed-off-by: Gurucharan Shetty <gshetty@nicira.com> Acked-by: Ben Pfaff <blp@nicira.com>
* debian: Automatically start openvswitch before first invocation of ovs-vsctl.Gurucharan Shetty2014-07-152-0/+31
| | | | | | | | | | | | | | | | | | | In the 'interfaces' file, if an admin adds the openvswitch interface in 'auto', ifupdown will try to create OVS interfaces even before openvswitch has started. In a case like that, assume that the admin knows what he is doing and try to start openvswitch. The negatives I see are 1. /usr is NFS mounted, in which case expect the admin to mount it through initramfs. 2. syslog through network may not be accessible. This is similar to what rhel does with commit 602453000e28ec10 (rhel: Automatically start openvswitch service before bringing an ovs interface up or down) Signed-off-by: Gurucharan Shetty <gshetty@nicira.com> Acked-by: Ben Pfaff <blp@nicira.com>
* debian: Option to create patch ports through 'interfaces'Gurucharan Shetty2014-07-152-6/+40
| | | | | | | | This is a port of commit d7aab661 ( rhel: Add Patch Port support to initscripts.) from rhel to debian's ifupdown script. Signed-off-by: Gurucharan Shetty <gshetty@nicira.com> Acked-by: Ben Pfaff <blp@nicira.com>
* lib/coverage: Removed set but not used variablesDaniele Di Proietto2014-07-151-4/+0
| | | | | | | This removes a GCC 4.9 warning (unused-but-set-variable) Signed-off-by: Daniele Di Proietto <ddiproietto@vmware.com> Signed-off-by: Ben Pfaff <blp@nicira.com>
* datapath: Check for NULL upcall_portids.Pravin B Shelar2014-07-151-0/+3
| | | | | | | | Following patch adds NULL check for memory allocated by kmalloc. Reported-by: Nikolay Aleksandrov <nikolay@redhat.com> Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
* netlink-socket: Fix handling socket allocation failure in nl_dump_start().Ben Pfaff2014-07-151-7/+6
| | | | | | | | | | | If nl_pool_alloc() failed, then 'dump' was not initialized at all and further use of the dump would access uninitialized data, probably causing a crash. Found by inspection. Signed-off-by: Ben Pfaff <blp@nicira.com> Acked-by: Joe Stringer <joestringer@nicira.com>
* netlink-socket: Refill comment to fit within 79 columns.Ben Pfaff2014-07-151-8/+8
| | | | | Signed-off-by: Ben Pfaff <blp@nicira.com> Acked-by: Joe Stringer <joestringer@nicira.com>
* dpif-linux: Avoid null dereference if all ports disappear.Ben Pfaff2014-07-151-1/+1
| | | | | | | | | | | | | | | | | | | | | When dpif_linux_refresh_channels() refreshes the set of channels when the number of handlers changes, it destroys all the dpif's channels and sets dpif->uc_array_size to 0. If the port dump later in the function turns up no ports (which generally indicates a bug), then no channels will be allocated and thus dpif->uc_array_size will remain 0 and 'channels' will be null in each handler. This is self-consistent, at least, but dpif_linux_port_get_pid__() was still willing in this situation to try to access element 0 of the set of channels, dereferencing a null pointer. This fixes the problem. I encountered this while looking at a bug that I had introduced during development that caused the port dump to always be empty. It would be difficult to encounter in normal use. Signed-off-by: Ben Pfaff <blp@nicira.com> Acked-by: Joe Stringer <joestringer@nicira.com>
* ofp-msgs: Correct code for queue configuration messages in OpenFlow 1.0.Ben Pfaff2014-07-153-4/+5
| | | | | Reported-by: Simon Jouet <simon.jouet@gmail.com> Signed-off-by: Ben Pfaff <blp@nicira.com>
* Fix documentation error that led user to wrong file to install dependency ↵Kirkland Spector2014-07-152-7/+9
| | | | | | | | packages. Signed-off-by: Kirkland Spector <kspector@salesforce.com> Acked-by: Andrey Falko <afalko@salesforce.com> Signed-off-by: Ben Pfaff <blp@nicira.com>
* Drop assignments whose values are never used.Ben Pfaff2014-07-154-8/+7
| | | | | | | Found by clang-analyzer. Signed-off-by: Ben Pfaff <blp@nicira.com> Acked-by: Jarno Rajahalme <jrajahalme@nicira.com>
* coverage: Move m_idx, h_idx to an inner scope in coverage_run().Ben Pfaff2014-07-151-5/+3
| | | | | | | | | | These variables were initialized in an outer scope and then immediately changed in an inner one, so they might as well be farther in. Found by clang-analyzer. Signed-off-by: Ben Pfaff <blp@nicira.com> Acked-by: Jarno Rajahalme <jrajahalme@nicira.com>
* ofproto: Avoid theoretical double free of large rule collections.Ben Pfaff2014-07-151-0/+3
| | | | | | | | | | | | | | | | collect_rules_strict() and collect_rules_loose() destroy the rule collections that they create if they return an error, and some of their callers then go on to destroy them again. This could cause a double-free in the case where rule_collection_destroy() actually calls free(). That never happens in the current tree, because free() is only necessary if malloc() was called and there's a 64-entry stub that none of the current code in collect_rules_*() can fill up in their error cases. Still, it seems better to fix the problem. Found by clang-analyzer. Signed-off-by: Ben Pfaff <blp@nicira.com> Acked-by: Jarno Rajahalme <jrajahalme@nicira.com>
* ofp-util: Fix null pointer dereference in ofputil_pull_buckets().Ben Pfaff2014-07-151-0/+1
| | | | | | | Found by clang-analyzer. Signed-off-by: Ben Pfaff <blp@nicira.com> Acked-by: Jarno Rajahalme <jrajahalme@nicira.com>
* revalidator: Revalidate missed flows.Joe Stringer2014-07-151-5/+29
| | | | | | | | | | | If the datapath doesn't dump a flow for some reason, and the current dump is expected to revalidate all flows in the datapath, then perform revalidation for those flows by fetching them during the sweep phase. If revalidation is not required, then leave the flow in the datapath and don't revalidate it. Signed-off-by: Joe Stringer <joestringer@nicira.com> Acked-by: Ben Pfaff <blp@nicira.com>
* dpif: Support fetching flow mask via dpif_flow_get().Joe Stringer2014-07-155-41/+73
| | | | | | | | | Change the interface to allow implementations to pass back a buffer, and allow callers to specify which of actions, mask, and stats they wish to receive. This will be used in the next commit. Signed-off-by: Joe Stringer <joestringer@nicira.com> Acked-by: Ben Pfaff <blp@nicira.com>
* ovsdb-server.at: Skip tests that use ovsdb-server's "--run" on Windows.Gurucharan Shetty2014-07-141-0/+2
| | | | | | | | | | | ovsdb-server's port on Windows does not support the "--run" option. The two tests skipped in this commit make use of "--run" option to test ovsdb-server's truncating of corrupt log or bad transaction. It looks a little tricky to get this test running on Windows without the "--run" option implemented. Signed-off-by: Gurucharan Shetty <gshetty@nicira.com> Acked-by: Ben Pfaff <blp@nicira.com>
* datapath: Refactor ovs_flow_cmd_fill_info().Joe Stringer2014-07-141-29/+66
| | | | | | | | | Split up ovs_flow_cmd_fill_info() to make it easier to cache parts of a dump reply. This will be used to streamline flow_dump in a future patch. Signed-off-by: Joe Stringer <joestringer@nicira.com> Acked-by: Thomas Graf <tgraf@noironetworks.com> Acked-by: Pravin B Shelar <pshelar@nicira.com>
* tests: Disable glibc memory checking under glibc <= 2.11.Ben Pfaff2014-07-111-4/+18
| | | | | | | | | | | | | | We noticed that the unit tests sometimes fail on XenServer inside glibc's memory checker, in the free_check() function. It turns out that the glibc memory checker in glibc 2.11 and earlier had an internal race that caused false positives in multithreaded programs. This commit avoids the problem by disabling the glibc memory checker in glibc 2.11 and earlier. VMware-BZ: #1267127 Signed-off-by: Ben Pfaff <blp@nicira.com> Acked-by: Gurucharan Shetty <gshetty@nicira.com>
* datapath/flow_netlink: Create right mask with disabled megaflowsDaniele Di Proietto2014-07-111-23/+53
| | | | | | | | | | | | | | | | | | | | | | | | If megaflows are disabled, the userspace does not send the netlink attribute OVS_FLOW_ATTR_MASK, and the kernel must create an exact match mask. sw_flow_mask_set() sets every bytes (in 'range') of the mask to 0xff, even the bytes that represent padding for struct sw_flow, or the bytes that represent fields that may not be set during ovs_flow_extract(). This is a problem, because when we extract a flow from a packet, we do not memset() anymore the struct sw_flow to 0 (since commit 9cef26ac6a71). This commit gets rid of sw_flow_mask_set() and introduces mask_set_nlattr(), which operates on the netlink attributes rather than on the mask key. Using this approach we are sure that only the bytes that the user provided in the flow are matched. Also, if the parse_flow_mask_nlattrs() for the mask ENCAP attribute fails, we now return with an error. Reported-by: Alex Wang <alexw@nicira.com> Suggested-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: Daniele Di Proietto <ddiproietto@vmware.com> Acked-by: Pravin B Shelar <pshelar@nicira.com>
* datapath: Enable tunnel GSO features.Pravin B Shelar2014-07-116-1/+86
| | | | | | | | | Following patch enables all available tunnel GSO features for OVS bridge device so that ovs can use hardware offloads available to underling device. Signed-off-by: Pravin B Shelar <pshelar@nicira.com> Acked-by: Andy Zhou <azhou@nicira.com>
* lib/hash: Use CRC32 for hashing.Jarno Rajahalme2014-07-113-30/+216
| | | | | | | | | | | Use CRC32 intrinsics for hash computations when building for X86_64 with SSE4_2. Add a new hash_words64() and change hash_words() to be inlined when 'n_words' is a compile-time constant. Signed-off-by: Jarno Rajahalme <jrajahalme@nicira.com> Acked-by: Ben Pfaff <blp@nicira.com>
* lib/classifier: Lockless lookups.Jarno Rajahalme2014-07-119-149/+80
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now that all the relevant classifier structures use RCU and internal mutual exclusion for modifications, we can remove the fat-rwlock and thus make the classifier lookups lockless. As the readers are operating concurrently with the writers, a concurrent reader may or may not see a new rule being added by a writer, depending on how the concurrent events overlap with each other. Overall, this is no different from the former locked behavior, but there the visibility of the new rule only depended on the timing of the locking functions. A new rule is first added to the segment indices, so the readers may find the rule in the indices before the rule is visible in the subtables 'rules' map. This may result in us losing the opportunity to quit lookups earlier, resulting in sub-optimal wildcarding. This will be fixed by forthcoming revalidation always scheduled after flow table changes. Similar behavior may happen due to us removing the overlapping rule (if any) from the indices only after the corresponding new rule has been added. The subtable's max priority is updated only after a rule is inserted to the maps, so the concurrent readers may not see the rule, as the updated priority ordered subtable list will only be visible after the subtable's max priority is updated. Similarly, the classifier's partitions are updated by the caller after the rule is inserted to the maps, so the readers may keep skipping the subtable until they see the updated partitions. Signed-off-by: Jarno Rajahalme <jrajahalme@nicira.com> Acked-by: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
* lib/classifier: RCUify prefix trie code.Jarno Rajahalme2014-07-114-79/+208
| | | | | | | | cls_set_prefix_fields() now synchronizes explicitly with the readers, waiting them to finish using the old configuration before changing to the new configuration. Signed-off-by: Jarno Rajahalme <jrajahalme@nicira.com> Acked-by: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
* lib/classifier: Use internal mutex.Jarno Rajahalme2014-07-113-57/+118
| | | | | | | | | | | | | | | Add an internal mutex to struct cls_classifier, and reorganize classifier internal structures according to the user of each field, marking the fields that need to be protected by the mutex. This makes locking requirements easier to track, and may make lookup more memory efficient. After this patch there is some double locking, as callers are taking the fat-rwlock, and we take the mutex internally. A following patch will remove the classifier fat-rwlock, removing the (double) locking overhead. Signed-off-by: Jarno Rajahalme <jrajahalme@nicira.com> Acked-by: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
* lib/classifier: Stylistic change.Jarno Rajahalme2014-07-111-30/+30
| | | | | | | Rename 'nbits' as 'n_bits' to be more consistent with other count-like fields. Signed-off-by: Jarno Rajahalme <jrajahalme@nicira.com> Acked-by: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
* lib/ovs-rcu: Export ovsrcu_synchronize().Jarno Rajahalme2014-07-112-2/+5
| | | | | | A following patch will add the first external user. Signed-off-by: Jarno Rajahalme <jrajahalme@nicira.com> Acked-by: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
* lib/classifier: Simplify iteration with C99 declaration.Jarno Rajahalme2014-07-116-119/+156
| | | | | | | | Hide the cursor from the classifier iteration users and move locking to the iterators. This will make following RCU changes simpler, as the call sites of the iterators need not be changed at that point. Signed-off-by: Jarno Rajahalme <jrajahalme@nicira.com> Acked-by: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
* lib/classifier: Use cmap.Jarno Rajahalme2014-07-113-95/+112
| | | | | | | Use cmap instead of hmap & hindex in classifier. Signed-off-by: Jarno Rajahalme <jrajahalme@nicira.com> Acked-by: Ben Pfaff <blp@nicira.com> Acked-by: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
* netlink-socket: Work around kernel Netlink dump thread races.Ben Pfaff2014-07-102-0/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The Linux kernel Netlink implementation has two races that cause problems for processes that attempt to dump a table in a multithreaded manner. The first race is in the structure of the kernel netlink_recv() function. This function pulls a message from the socket queue and, if there is none, reports EAGAIN: skb = skb_recv_datagram(sk, flags, noblock, &err); if (skb == NULL) goto out; Only if a message is successfully read from the socket queue does the function, toward the end, try to queue up a new message to be dumped: if (nlk->cb && atomic_read(&sk->sk_rmem_alloc) <= sk->sk_rcvbuf / 2) { ret = netlink_dump(sk); if (ret) { sk->sk_err = ret; sk->sk_error_report(sk); } } This means that if thread A reads a message from a dump, then thread B attempts to read one before A queues up the next, B will get EAGAIN. This means that, following EAGAIN, B needs to wait until A returns to userspace before it tries to read the socket again. nl_dump_next() already does this, using 'dump->status_seq' (although the need for it has never been explained clearly, to my knowledge). The second race is more serious. Suppose thread X and thread Y both simultaneously attempt to queue up a new message to be dumped, using the call to netlink_dump() quoted above. netlink_dump() begins with: mutex_lock(nlk->cb_mutex); cb = nlk->cb; if (cb == NULL) { err = -EINVAL; goto errout_skb; } Suppose that X gets cb_mutex first and finds that the dump is complete. It will therefore, toward the end of netlink_dump(), clear nlk->cb to NULL to indicate that no dump is in progress and release the mutex: nlk->cb = NULL; mutex_unlock(nlk->cb_mutex); When Y grabs cb_mutex afterward, it will see that nlk->cb is NULL and return -EINVAL as quoted above. netlink_recv() stuffs -EINVAL in sk_err, but that error is not reported immediately; instead, it is saved for the next read from the socket. Since Open vSwitch maintains a pool of Netlink sockets, that next failure can crop up pretty much anywhere. One of the worst places for it to crop up is in the execution of a later transaction (e.g. in nl_sock_transact_multiple__()), because userspace treats Netlink transactions as idempotent and will re-execute them when socket errors occur. For a transaction that sends a packet, this causes packet duplication, which we actually observed in practice. (ENOBUFS should actually cause transactions to be re-executed in many cases, but EINVAL should not; this is a separate bug in the userspace netlink code.) VMware-BZ: #1283188 Reported-and-tested-by: Alex Wang <alexw@nicira.com> Signed-off-by: Ben Pfaff <blp@nicira.com> Acked-by: Alex Wang <alexw@nicira.com>
* netlink-socket: Fix sign of error code.Ben Pfaff2014-07-101-1/+1
| | | | | | | | | Commit 8f20fd98db (netlink-socket: Work around upstream kernel Netlink bug.) got the sign of the error code wrong, so that it reported e.g. -22 for EINVAL to nl_sock_recv__()'s caller, instead of 22. Signed-off-by: Ben Pfaff <blp@nicira.com> Acked-by: Alex Wang <alexw@nicira.com>
* datapath: refactor do_output() to move skb_clone NULL check out of fast pathAndy Zhou2014-07-101-14/+11
| | | | | | | | | | | | | skb_clone() NULL check is implemented in do_output(), as past of the common (fast) path. Refactoring so that NULL check is done in the slow path, immediately after skb_clone() is called. Besides optimization, this change also improves code readability by making the skb_clone() NULL check consistent within OVS datapath module. Signed-off-by: Andy Zhou <azhou@nicira.com> Acked-by: Pravin B Shelar <pshelar@nicira.com>