From 0791315e4dbf82e293a759086ee17c84b12b1d6e Mon Sep 17 00:00:00 2001
From: Ben Pfaff <blp@nicira.com>
Date: Thu, 10 Jul 2014 16:48:16 -0700
Subject: netlink-socket: Work around kernel Netlink dump thread races.

The Linux kernel Netlink implementation has two races that cause problems
for processes that attempt to dump a table in a multithreaded manner.

The first race is in the structure of the kernel netlink_recv() function.
This function pulls a message from the socket queue and, if there is none,
reports EAGAIN:
	skb = skb_recv_datagram(sk, flags, noblock, &err);
	if (skb == NULL)
		goto out;
Only if a message is successfully read from the socket queue does the
function, toward the end, try to queue up a new message to be dumped:
	if (nlk->cb && atomic_read(&sk->sk_rmem_alloc) <= sk->sk_rcvbuf / 2) {
		ret = netlink_dump(sk);
		if (ret) {
			sk->sk_err = ret;
			sk->sk_error_report(sk);
		}
	}
This means that if thread A reads a message from a dump, then thread B
attempts to read one before A queues up the next, B will get EAGAIN.  This
means that, following EAGAIN, B needs to wait until A returns to userspace
before it tries to read the socket again.  nl_dump_next() already does
this, using 'dump->status_seq' (although the need for it has never been
explained clearly, to my knowledge).

The second race is more serious.  Suppose thread X and thread Y both
simultaneously attempt to queue up a new message to be dumped, using the
call to netlink_dump() quoted above.  netlink_dump() begins with:
	mutex_lock(nlk->cb_mutex);

	cb = nlk->cb;
	if (cb == NULL) {
		err = -EINVAL;
		goto errout_skb;
	}
Suppose that X gets cb_mutex first and finds that the dump is complete.  It
will therefore, toward the end of netlink_dump(), clear nlk->cb to NULL to
indicate that no dump is in progress and release the mutex:
	nlk->cb = NULL;
	mutex_unlock(nlk->cb_mutex);
When Y grabs cb_mutex afterward, it will see that nlk->cb is NULL and
return -EINVAL as quoted above.  netlink_recv() stuffs -EINVAL in sk_err,
but that error is not reported immediately; instead, it is saved for the
next read from the socket.  Since Open vSwitch maintains a pool of Netlink
sockets, that next failure can crop up pretty much anywhere.  One of the
worst places for it to crop up is in the execution of a later transaction
(e.g. in nl_sock_transact_multiple__()), because userspace treats Netlink
transactions as idempotent and will re-execute them when socket errors
occur.  For a transaction that sends a packet, this causes packet
duplication, which we actually observed in practice.  (ENOBUFS should
actually cause transactions to be re-executed in many cases, but EINVAL
should not; this is a separate bug in the userspace netlink code.)

VMware-BZ: #1283188
Reported-and-tested-by: Alex Wang <alexw@nicira.com>
Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Alex Wang <alexw@nicira.com>
---
 lib/netlink-socket.h | 2 ++
 1 file changed, 2 insertions(+)

(limited to 'lib/netlink-socket.h')

diff --git a/lib/netlink-socket.h b/lib/netlink-socket.h
index dd3240907..8ac201aac 100644
--- a/lib/netlink-socket.h
+++ b/lib/netlink-socket.h
@@ -52,6 +52,7 @@
 #include <stdint.h>
 #include "ofpbuf.h"
 #include "ovs-atomic.h"
+#include "ovs-thread.h"
 
 struct nl_sock;
 
@@ -114,6 +115,7 @@ struct nl_dump {
     atomic_uint status;         /* Low bit set if we read final message.
                                  * Other bits hold an errno (0 for success). */
     struct seq *status_seq;     /* Tracks changes to the above 'status'. */
+    struct ovs_mutex mutex;
 };
 
 void nl_dump_start(struct nl_dump *, int protocol,
-- 
cgit v1.2.1