netlink-socket: Work around kernel Netlink dump thread races.

The Linux kernel Netlink implementation has two races that cause problems for processes that attempt to dump a table in a multithreaded manner. The first race is in the structure of the kernel netlink_recv() function. This function pulls a message from the socket queue and, if there is none, reports EAGAIN: skb = skb_recv_datagram(sk, flags, noblock, &err); if (skb == NULL) goto out; Only if a message is successfully read from the socket queue does the function, toward the end, try to queue up a new message to be dumped: if (nlk->cb && atomic_read(&sk->sk_rmem_alloc) <= sk->sk_rcvbuf / 2) { ret = netlink_dump(sk); if (ret) { sk->sk_err = ret; sk->sk_error_report(sk); } } This means that if thread A reads a message from a dump, then thread B attempts to read one before A queues up the next, B will get EAGAIN. This means that, following EAGAIN, B needs to wait until A returns to userspace before it tries to read the socket again. nl_dump_next() already does this, using 'dump->status_seq' (although the need for it has never been explained clearly, to my knowledge). The second race is more serious. Suppose thread X and thread Y both simultaneously attempt to queue up a new message to be dumped, using the call to netlink_dump() quoted above. netlink_dump() begins with: mutex_lock(nlk->cb_mutex); cb = nlk->cb; if (cb == NULL) { err = -EINVAL; goto errout_skb; } Suppose that X gets cb_mutex first and finds that the dump is complete. It will therefore, toward the end of netlink_dump(), clear nlk->cb to NULL to indicate that no dump is in progress and release the mutex: nlk->cb = NULL; mutex_unlock(nlk->cb_mutex); When Y grabs cb_mutex afterward, it will see that nlk->cb is NULL and return -EINVAL as quoted above. netlink_recv() stuffs -EINVAL in sk_err, but that error is not reported immediately; instead, it is saved for the next read from the socket. Since Open vSwitch maintains a pool of Netlink sockets, that next failure can crop up pretty much anywhere. One of the worst places for it to crop up is in the execution of a later transaction (e.g. in nl_sock_transact_multiple__()), because userspace treats Netlink transactions as idempotent and will re-execute them when socket errors occur. For a transaction that sends a packet, this causes packet duplication, which we actually observed in practice. (ENOBUFS should actually cause transactions to be re-executed in many cases, but EINVAL should not; this is a separate bug in the userspace netlink code.) VMware-BZ: #1283188 Reported-and-tested-by: Alex Wang <alexw@nicira.com> Signed-off-by: Ben Pfaff <blp@nicira.com> Acked-by: Alex Wang <alexw@nicira.com>
author: Ben Pfaff <blp@nicira.com> 2014-07-10 16:48:16 -0700
committer: Ben Pfaff <blp@nicira.com> 2014-07-10 16:50:44 -0700
commit: 8cad00cce10839844a827d727358f86a4e3ec1a3 (patch)
tree: 382a2e028991f6bbf8bd61e2bd51cdf09e9e0762
parent: bebdded84e5232e78d1cbb134257eb85ee049b40 (diff)
download: openvswitch-8cad00cce10839844a827d727358f86a4e3ec1a3.tar.gz
2 files changed, 12 insertions, 0 deletions
diff --git a/lib/netlink-socket.c b/lib/netlink-socket.c
index 2c3eadff0..a769de8f8 100644
--- a/lib/netlink-socket.c
+++ b/lib/netlink-socket.c
@@ -714,6 +714,7 @@ nl_dump_start(struct nl_dump *dump, int protocol, const struct ofpbuf *request)
     atomic_init(&dump->status, status << 1);
     dump->nl_seq = nl_msg_nlmsghdr(request)->nlmsg_seq;
     dump->status_seq = seq_create();
+    ovs_mutex_init(&dump->mutex);
 }
 
 /* Attempts to retrieve another reply from 'dump' into 'buffer'. 'dump' must
@@ -756,7 +757,15 @@ nl_dump_next(struct nl_dump *dump, struct ofpbuf *reply, struct ofpbuf *buffer)
             return false;
         }
 
+        /* Take the mutex here to avoid an in-kernel race.  If two threads try
+         * to read from a Netlink dump socket at once, then the socket error
+         * can be set to EINVAL, which will be encountered on the next recv on
+         * that socket, which could be anywhere due to the way that we pool
+         * Netlink sockets.  Serializing the recv calls avoids the issue. */
+        ovs_mutex_lock(&dump->mutex);
         retval = nl_sock_recv__(dump->sock, buffer, false);
+        ovs_mutex_unlock(&dump->mutex);
+
         if (retval) {
             ofpbuf_clear(buffer);
             if (retval == EAGAIN) {
@@ -848,6 +857,7 @@ nl_dump_done(struct nl_dump *dump)
     }
     nl_pool_release(dump->sock);
     seq_destroy(dump->status_seq);
+    ovs_mutex_destroy(&dump->mutex);
     return status >> 1;
 }
 
diff --git a/lib/netlink-socket.h b/lib/netlink-socket.h
index 1f06b97eb..8c3c1aa56 100644
--- a/lib/netlink-socket.h
+++ b/lib/netlink-socket.h
@@ -52,6 +52,7 @@
 #include <stdint.h>
 #include "ofpbuf.h"
 #include "ovs-atomic.h"
+#include "ovs-thread.h"
 
 struct nl_sock;
 
@@ -114,6 +115,7 @@ struct nl_dump {
     atomic_uint status;         /* Low bit set if we read final message.
                                  * Other bits hold an errno (0 for success). */
     struct seq *status_seq;     /* Tracks changes to the above 'status'. */
+    struct ovs_mutex mutex;
 };
 
 void nl_dump_start(struct nl_dump *, int protocol,
author	Ben Pfaff <blp@nicira.com>	2014-07-10 16:48:16 -0700
committer	Ben Pfaff <blp@nicira.com>	2014-07-10 16:50:44 -0700
commit	8cad00cce10839844a827d727358f86a4e3ec1a3 (patch)
tree	382a2e028991f6bbf8bd61e2bd51cdf09e9e0762
parent	bebdded84e5232e78d1cbb134257eb85ee049b40 (diff)
download	openvswitch-8cad00cce10839844a827d727358f86a4e3ec1a3.tar.gz