correct essay & comments regarding the 'delayed confirm' rationale

Matthew has confirmed that the "we don't know the msg_seq_no until we receive the msg from the channel" reason is bogus. The msg_seq_no is allocated by the channel prior to routing and thus is the same across the master and all slaves. Hence the 'publish' via gm contains all the information we need to issue a confirm. Nevertheless we cannot actually issue the confirm until we've received the message from the channel. The essay now explains the real reason.
author: Matthias Radestock <matthias@rabbitmq.com> 2012-10-12 23:31:15 +0100
committer: Matthias Radestock <matthias@rabbitmq.com> 2012-10-12 23:31:15 +0100
commit: 7562d6dbacb54e30e4307eff63632e0ca3ec25fa (patch)
tree: ce9d873ee27301aa1d0f99d75ca4f333a80cd682 /src
parent: 100a85a5a47f229624c9e82694d0aaa77649b391 (diff)
download: rabbitmq-server-7562d6dbacb54e30e4307eff63632e0ca3ec25fa.tar.gz
2 files changed, 22 insertions, 23 deletions
diff --git a/src/rabbit_mirror_queue_coordinator.erl b/src/rabbit_mirror_queue_coordinator.erl
index 72dcfc95..6cd71fc3 100644
--- a/src/rabbit_mirror_queue_coordinator.erl
+++ b/src/rabbit_mirror_queue_coordinator.erl
@@ -101,19 +101,25 @@
 %% channel during a publish, only some of the mirrors may receive that
 %% publish. As a result of this problem, the messages broadcast over
 %% the gm contain published content, and thus slaves can operate
-%% successfully on messages that they only receive via the gm. The key
-%% purpose of also sending messages directly from the channels to the
-%% slaves is that without this, in the event of the death of the
-%% master, messages could be lost until a suitable slave is promoted.
+%% successfully on messages that they only receive via the gm.
 %%
-%% However, that is not the only reason. For example, if confirms are
-%% in use, then there is no guarantee that every slave will see the
-%% delivery with the same msg_seq_no. As a result, the slaves have to
-%% wait until they've seen both the publish via gm, and the publish
-%% via the channel before they have enough information to be able to
-%% perform the publish to their own bq, and subsequently issue the
-%% confirm, if necessary. Either form of publish can arrive first, and
-%% a slave can be upgraded to the master at any point during this
+%% The key purpose of also sending messages directly from the channels
+%% to the slaves is that without this, in the event of the death of
+%% the master, messages could be lost until a suitable slave is
+%% promoted. However, that is not the only reason. A slave cannot send
+%% confirms for a message until it has seen it from the
+%% channel. Otherwise, it might send a confirm to a channel for a
+%% message that it might *never* receive from that channel. This can
+%% happen because new slaves join the gm ring (and thus receive
+%% messages from the master) before inserting themselves in the
+%% queue's mnesia record (which is what channels look at for routing).
+%% As it turns out, channels will simply ignore such bogus confirms,
+%% but relying on that would introduce a dangerously tight coupling.
+%%
+%% Hence the slaves have to wait until they've seen both the publish
+%% via gm, and the publish via the channel before they issue the
+%% confirm. Either form of publish can arrive first, and a slave can
+%% be upgraded to the master at any point during this
 %% process. Confirms continue to be issued correctly, however.
 %%
 %% Because the slave is a full process, it impersonates parts of the
diff --git a/src/rabbit_mirror_queue_slave.erl b/src/rabbit_mirror_queue_slave.erl
index 0530fa7f..f4679184 100644
--- a/src/rabbit_mirror_queue_slave.erl
+++ b/src/rabbit_mirror_queue_slave.erl
@@ -634,15 +634,11 @@ maybe_enqueue_message(
             SQ1 = dict:store(ChPid, {MQ1, PendingCh}, SQ),
             State1 #state { sender_queues = SQ1 };
         {ok, confirmed} ->
-            %% BQ has confirmed it but we didn't know what the
-            %% msg_seq_no was at the time. We do now!
             ok = rabbit_misc:confirm_to_sender(ChPid, [MsgSeqNo]),
             SQ1 = remove_from_pending_ch(MsgId, ChPid, SQ),
             State1 #state { msg_id_status = dict:erase(MsgId, MS),
                             sender_queues = SQ1 };
         {ok, published} ->
-            %% It was published to the BQ and we didn't know the
-            %% msg_seq_no so couldn't confirm it at the time.
             {MS1, SQ1} =
                 case needs_confirming(Delivery, State1) of
                     never       -> {dict:erase(MsgId, MS),
@@ -686,13 +682,10 @@ process_instruction(
                    msg_id_status       = MS }) ->
 
     %% We really are going to do the publish right now, even though we
-    %% may not have seen it directly from the channel. As a result, we
-    %% may know that it needs confirming without knowing its
-    %% msg_seq_no, which means that we can see the confirmation come
-    %% back from the backing queue without knowing the msg_seq_no,
-    %% which means that we're going to have to hang on to the fact
-    %% that we've seen the msg_id confirmed until we can associate it
-    %% with a msg_seq_no.
+    %% may not have seen it directly from the channel. But we cannot
+    %% issues confirms until the latter has happened. So we need to
+    %% keep track of the MsgId and its confirmation status in the
+    %% meantime.
     State1 = ensure_monitoring(ChPid, State),
     {MQ, PendingCh} = get_sender_queue(ChPid, SQ),
     {MQ1, PendingCh1, MS1} =
author	Matthias Radestock <matthias@rabbitmq.com>	2012-10-12 23:31:15 +0100
committer	Matthias Radestock <matthias@rabbitmq.com>	2012-10-12 23:31:15 +0100
commit	7562d6dbacb54e30e4307eff63632e0ca3ec25fa (patch)
tree	ce9d873ee27301aa1d0f99d75ca4f333a80cd682 /src
parent	100a85a5a47f229624c9e82694d0aaa77649b391 (diff)
download	rabbitmq-server-7562d6dbacb54e30e4307eff63632e0ca3ec25fa.tar.gz