1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
|
Introduction
------------
TMCast (stands for Transaction MultiCast) is an implementation of a
transactional multicast protocol. In essence, the idea is to represent
each message delivery to members of a multicast group as a transaction
- an atomic, consistent and isolated action. A multicast transaction
can be viewed as an atomic transition of the group members to a new
state. If we define [Mo] as a set of operational (non-faulty) members
of the group, [Mf] as a set of faulty members of the group, [Ma] as a
set of members that view transition [Tn] as aborted and [Mc] as a set
of members that view transition [Tn] as committed, then this atomic
transition [Tn] should satisfy one of the following equations:
Mo(Tn-1) = Ma(T) U Mf(T)
Mo(Tn-1) = Mc(T) U Mf(T)
Or, in other words, after transaction T has been committed (aborted),
all operational (before transaction T) members are either in the
committed (aborted) or failed state.
Thus, for each member of the group, outcome of the transaction can be
commit, abort or a member failure. It is important for a member to
exhibit a failfast (error latency is less than transaction cycle)
behavior. Or, in other words, if a member transitioned into a wrong
state, it is guaranteed to fail instead of delivering a wrong result.
In order to achieve such an error detection in a decentralized
environment, certain limitations were imposed. One of the most
user-visible limitation is the fact that the lifetime of the group
with only one member is very short. This is because there is not way
for a member to distinguish "no members yet" case from "my link to the
group is down". In such a situation, the member assumes the latter
case. There is also a military saying that puts it quite nicely: two
is one, one is nothing.
State of Implementation
-----------------------
The current implementation is in a prototypical stage. The following
parts are not implemented or still under development:
* Handling of network partitioning (TODO)
* Redundant network support (TODO)
* Member failure detection (partial implementation)
Examples
--------
There is a simple example available in examples/TMCast/Member with
the corresponding README.
Architecture
------------
Primary goals of the protocol are to (1) mask transient failures of the
underlying multicast protocol (or, more precisely, allow to recover
from transient failures) and (2) exhibit failfast behavior in cases of
permanent failures.
The distinction between transient and permanent failures is based on
timeouts thus what can be a transient failure in one configuration of
the protocol could be a permanent failure in the other.
[Maybe talk more about a transient/permanent threshold and its effect
on performance/resource utilization/etc.]
[Maybe add a terminology section.]
Each member of a multicast group has its unique (group-wise) id:
struct MemberId
{
char id[MEMBER_ID_LENGTH];
};
Each payload delivery is part of a transaction. Each transaction is
identified by TransactionId:
typedef unsigned short TransactionId;
Each transaction has a status code which identifies its state, as viewed by
a group member:
typedef unsigned char TransactionStatus;
TransactionStatus const TS_BEGIN = 1;
TransactionStatus const TS_COMMIT = 2;
TransactionStatus const TS_ABORT = 3;
TransactionStatus const TS_COMMITTED = 4;
TransactionStatus const TS_ABORTED = 5;
Thus each transaction is described by its id and status:
struct Transaction
{
TransactionId id;
TransactionStatus status;
};
The outcome of some predefined number of recent transactions is stored
in TransactionList:
typedef Transaction TransactionList[TL_LENGTH];
Each message sent to a multicast group has the following header:
struct MessageHeader
{
unsigned long length;
unsigned long check_sum;
MemberId member_id;
Transaction current;
TransactionList transaction_list;
};
[Maybe describe each field here.]
A new member joins the group with transaction id 0 and status
TS_COMMITTED.
Each member sends a periodic 'pulse' messages with some predefined interval
advertising its current view of the group. This includes the state of the
current transaction and the history of the recent transactions.
If a member of the group needs a payload delivery it starts a new
transaction by sending a message with current transaction set to
{++current_id, TS_BEGIN}
and payload appended after the header.
Each member joins a transaction in one of the following ways:
* A member that began the transaction joins it 'to commit' (TS_COMMIT)
* A member that received TS_BEGIN joins current transaction 'to commit'
(TS_COMMIT).
* A member that received TS_COMMIT or TS_ABORT but did not receive TS_BEGIN
joins current transaction 'to abort' (TS_ABORT).
After a member has joined the transaction it starts participating in the
transaction's voting phase. On this phase members of the group decide the
fate of the transaction. Each member sends a predefined number of messages
where it announces its vote. In between those messages the member is receiving
and processing votes from other members and can be influenced by their
'opinion'.
In their decision-making members follow the principle of the majority. As
the voting progresses (and comes close to an end) members become more and
more reluctant to deviate from the decision of the majority.
[Maybe add an equation that measures member's willingness to change
its mind.]
At the end of the voting phase each member declares the current transaction
either committed (TS_COMMITTED) or aborted (TS_ABORTED). If this decision does
not agree with the majority the member declares itself failed.
In addition, each member builds a 'majority view' of the transaction history
(based on transaction_list). If it deviates from the member's own history the
member declares itself failed.
Here are some example scenarios of how the protocol behaves in different
situations. Let's say we have three members of the group S, R1, R2. S
initiates a transaction. R1 and R2 join it.
Scenario 1. (two-step voting)
1. S initiates a transaction (TS_BEGIN)
2a. R1 receives TS_BEGIN, joins for commit
2b. R2 receives TS_BEGIN, joins for commit
3a. S announces TS_COMMIT (first vote)
3b. R1 announces TS_COMMIT (first vote)
3c. R2 announces TS_COMMIT (first vote)
4a. S announces TS_COMMIT (second vote)
4b. R1 announces TS_COMMIT (second vote)
4c. R2 announces TS_COMMIT (second vote)
5a. S announces TS_COMMITTED (end of vote)
5b. R1 announces TS_COMMITTED (end of vote)
5c. R2 announces TS_COMMITTED (end of vote)
Scenario 2. (two-step voting)
1. S initiates a transaction (TS_BEGIN)
2a. R1 receives TS_BEGIN, joins for commit
2b. R2 didn't receive TS_BEGIN
3a. S announces TS_COMMIT (first vote)
3b. R1 announces TS_COMMIT (first vote)
3c. R2 received R1's TS_COMMIT announces TS_ABORT (first vote)
4a. S received R2's TS_ABORT announces TS_ABORT (second vote)
4b. R1 received R2's TS_ABORT announces TS_ABORT (second vote)
4c. R2 announces TS_ABORT (second vote)
5a. S announces TS_ABORTED (end of vote)
5b. R1 announces TS_ABORTED (end of vote)
5c. R2 announces TS_ABORTED (end of vote)
Scenario 3. (three-step voting)
1. S initiates a transaction (TS_BEGIN)
2a. R1 receives TS_BEGIN, joins for commit
2b. R2 didn't receive TS_BEGIN
3a. S announces TS_COMMIT (first vote)
3b. R1 announces TS_COMMIT (first vote)
3c. R2 still didn't receive anything
4a. S announces TS_COMMIT (second vote)
4b. R1 announces TS_COMMIT (second vote)
4c. R2 received R1's TS_COMMIT, announces TS_ABORT (first vote)
5a. S received R2's TS_ABORT but it is the end of the voting phase and
majority (S and R1) vote for commit, announces TS_COMMIT (third vote)
5b. R1 received R2's TS_ABORT but it is the end of the voting phase and
majority (S and R1) vote for commit, announces TS_COMMIT (third vote)
5c. R2 announces TS_ABORT (second vote)
6a. S announces TS_COMMITTED (end of vote)
6b. R1 announces TS_COMMITTED (end of vote)
6c. R2 discovers that the the majority has declared current transaction
committed and thus declares itself failed.
--
Boris Kolpackov <boris@dre.vanderbilt.edu>
|