doc/source/ops_runbook/maintenance.rst


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330

==================
Server maintenance
==================

General assumptions
~~~~~~~~~~~~~~~~~~~

-  It is assumed that anyone attempting to replace hardware components
   will have already read and understood the appropriate maintenance and
   service guides.

-  It is assumed that where servers need to be taken off-line for
   hardware replacement, that this will be done in series, bringing the
   server back on-line before taking the next off-line.

-  It is assumed that the operations directed procedure will be used for
   identifying hardware for replacement.

Assessing the health of swift
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

You can run the swift-recon tool on a Swift proxy node to get a quick
check of how Swift is doing. Please note that the numbers below are
necessarily somewhat subjective. Sometimes parameters for which we
say 'low values are good' will have pretty high values for a time. Often
if you wait a while things get better.

For example:

.. code:: console

   $ sudo swift-recon -rla
   ===============================================================================
   [2012-03-10 12:57:21] Checking async pendings on 384 hosts...
   Async stats: low: 0, high: 1, avg: 0, total: 1
   ===============================================================================

   [2012-03-10 12:57:22] Checking replication times on 384 hosts...
   [Replication Times] shortest: 1.4113877813, longest: 36.8293570836, avg: 4.86278064749
   ===============================================================================

   [2012-03-10 12:57:22] Checking load avg's on 384 hosts...
   [5m load average] lowest: 2.22, highest: 9.5, avg: 4.59578125
   [15m load average] lowest: 2.36, highest: 9.45, avg: 4.62622395833
   [1m load average] lowest: 1.84, highest: 9.57, avg: 4.5696875
   ===============================================================================

In the example above we ask for information on replication times (-r),
load averages (-l) and async pendings (-a). This is a healthy Swift
system. Rules-of-thumb for 'good' recon output are:

-  Nodes that respond are up and running Swift. If all nodes respond,
   that is a good sign. But some nodes may time out. For example:

   .. code:: console

      -> [http://<redacted>.29:6200/recon/load:] <urlopen error [Errno 111] ECONNREFUSED>
      -> [http://<redacted>.31:6200/recon/load:] <urlopen error timed out>

-  That could be okay or could require investigation.

-  Low values (say < 10 for high and average) for async pendings are
   good. Higher values occur when disks are down and/or when the system
   is heavily loaded. Many simultaneous PUTs to the same container can
   drive async pendings up. This may be normal, and may resolve itself
   after a while. If it persists, one way to track down the problem is
   to find a node with high async pendings (with ``swift-recon -av | sort
   -n -k4``), then check its Swift logs, Often async pendings are high
   because a node cannot write to a container on another node. Often
   this is because the node or disk is offline or bad. This may be okay
   if we know about it.

-  Low values for replication times are good. These values rise when new
   rings are pushed, and when nodes and devices are brought back on
   line.

-  Our 'high' load average values are typically in the 9-15 range. If
   they are a lot bigger it is worth having a look at the systems
   pushing the average up. Run ``swift-recon -av`` to get the individual
   averages. To sort the entries with the highest at the end,
   run ``swift-recon -av | sort -n -k4``.

For comparison here is the recon output for the same system above when
two entire racks of Swift are down:

.. code:: console

   [2012-03-10 16:56:33] Checking async pendings on 384 hosts...
   -> http://<redacted>.22:6200/recon/async: <urlopen error timed out>
   -> http://<redacted>.18:6200/recon/async: <urlopen error timed out>
   -> http://<redacted>.16:6200/recon/async: <urlopen error timed out>
   -> http://<redacted>.13:6200/recon/async: <urlopen error timed out>
   -> http://<redacted>.30:6200/recon/async: <urlopen error timed out>
   -> http://<redacted>.6:6200/recon/async: <urlopen error timed out>
   .........
   -> http://<redacted>.5:6200/recon/async: <urlopen error timed out>
   -> http://<redacted>.15:6200/recon/async: <urlopen error timed out>
   -> http://<redacted>.9:6200/recon/async: <urlopen error timed out>
   -> http://<redacted>.27:6200/recon/async: <urlopen error timed out>
   -> http://<redacted>.4:6200/recon/async: <urlopen error timed out>
   -> http://<redacted>.8:6200/recon/async: <urlopen error timed out>
   Async stats: low: 243, high: 659, avg: 413, total: 132275
   ===============================================================================
   [2012-03-10 16:57:48] Checking replication times on 384 hosts...
   -> http://<redacted>.22:6200/recon/replication: <urlopen error timed out>
   -> http://<redacted>.18:6200/recon/replication: <urlopen error timed out>
   -> http://<redacted>.16:6200/recon/replication: <urlopen error timed out>
   -> http://<redacted>.13:6200/recon/replication: <urlopen error timed out>
   -> http://<redacted>.30:6200/recon/replication: <urlopen error timed out>
   -> http://<redacted>.6:6200/recon/replication: <urlopen error timed out>
   ............
   -> http://<redacted>.5:6200/recon/replication: <urlopen error timed out>
   -> http://<redacted>.15:6200/recon/replication: <urlopen error timed out>
   -> http://<redacted>.9:6200/recon/replication: <urlopen error timed out>
   -> http://<redacted>.27:6200/recon/replication: <urlopen error timed out>
   -> http://<redacted>.4:6200/recon/replication: <urlopen error timed out>
   -> http://<redacted>.8:6200/recon/replication: <urlopen error timed out>
   [Replication Times] shortest: 1.38144306739, longest: 112.620954418, avg: 10.285
   9475361
   ===============================================================================
   [2012-03-10 16:59:03] Checking load avg's on 384 hosts...
   -> http://<redacted>.22:6200/recon/load: <urlopen error timed out>
   -> http://<redacted>.18:6200/recon/load: <urlopen error timed out>
   -> http://<redacted>.16:6200/recon/load: <urlopen error timed out>
   -> http://<redacted>.13:6200/recon/load: <urlopen error timed out>
   -> http://<redacted>.30:6200/recon/load: <urlopen error timed out>
   -> http://<redacted>.6:6200/recon/load: <urlopen error timed out>
   ............
   -> http://<redacted>.15:6200/recon/load: <urlopen error timed out>
   -> http://<redacted>.9:6200/recon/load: <urlopen error timed out>
   -> http://<redacted>.27:6200/recon/load: <urlopen error timed out>
   -> http://<redacted>.4:6200/recon/load: <urlopen error timed out>
   -> http://<redacted>.8:6200/recon/load: <urlopen error timed out>
   [5m load average] lowest: 1.71, highest: 4.91, avg: 2.486375
   [15m load average] lowest: 1.79, highest: 5.04, avg: 2.506125
   [1m load average] lowest: 1.46, highest: 4.55, avg: 2.4929375
   ===============================================================================

.. note::

   The replication times and load averages are within reasonable
   parameters, even with 80 object stores down. Async pendings, however is
   quite high. This is due to the fact that the containers on the servers
   which are down cannot be updated. When those servers come back up, async
   pendings should drop. If async pendings were at this level without an
   explanation, we have a problem.

Recon examples
~~~~~~~~~~~~~~

Here is an example of noting and tracking down a problem with recon.

Running reccon shows some async pendings:

.. code:: console

   $ ssh -q <redacted>.132.7 sudo swift-recon -alr
   ===============================================================================
   [2012-03-14 17:25:55] Checking async pendings on 384 hosts...
   Async stats: low: 0, high: 23, avg: 8, total: 3356
   ===============================================================================
   [2012-03-14 17:25:55] Checking replication times on 384 hosts...
   [Replication Times] shortest: 1.49303831657, longest: 39.6982825994, avg: 4.2418222066
   ===============================================================================
   [2012-03-14 17:25:56] Checking load avg's on 384 hosts...
   [5m load average] lowest: 2.35, highest: 8.88, avg: 4.45911458333
   [15m load average] lowest: 2.41, highest: 9.11, avg: 4.504765625
   [1m load average] lowest: 1.95, highest: 8.56, avg: 4.40588541667
    ===============================================================================

Why? Running recon again with -av swift (not shown here) tells us that
the node with the highest (23) is <redacted>.72.61. Looking at the log
files on <redacted>.72.61 we see:

.. code:: console

   $ sudo tail -f /var/log/swift/background.log | - grep -i ERROR
   Mar 14 17:28:06 <redacted> container-replicator ERROR Remote drive not mounted
   {'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.119', 'id': 5481, 'meta': '', 'device': 'disk6', 'port': 6201}
   Mar 14 17:28:06 <redacted> container-replicator ERROR Remote drive not mounted
   {'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.119', 'id': 5481, 'meta': '', 'device': 'disk6', 'port': 6201}
   Mar 14 17:28:09 <redacted> container-replicator ERROR Remote drive not mounted
   {'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.20', 'id': 2311, 'meta': '', 'device': 'disk5', 'port': 6201}
   Mar 14 17:28:11 <redacted> container-replicator ERROR Remote drive not mounted
   {'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.20', 'id': 2311, 'meta': '', 'device': 'disk5', 'port': 6201}
   Mar 14 17:28:13 <redacted> container-replicator ERROR Remote drive not mounted
   {'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.119', 'id': 5481, 'meta': '', 'device': 'disk6', 'port': 6201}
   Mar 14 17:28:13 <redacted> container-replicator ERROR Remote drive not mounted
   {'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.119', 'id': 5481, 'meta': '', 'device': 'disk6', 'port': 6201}
   Mar 14 17:28:15 <redacted> container-replicator ERROR Remote drive not mounted
   {'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.20', 'id': 2311, 'meta': '', 'device': 'disk5', 'port': 6201}
   Mar 14 17:28:15 <redacted> container-replicator ERROR Remote drive not mounted
   {'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.20', 'id': 2311, 'meta': '', 'device': 'disk5', 'port': 6201}
   Mar 14 17:28:19 <redacted> container-replicator ERROR Remote drive not mounted
   {'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.20', 'id': 2311, 'meta': '', 'device': 'disk5', 'port': 6201}
   Mar 14 17:28:19 <redacted> container-replicator ERROR Remote drive not mounted
   {'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.20', 'id': 2311, 'meta': '', 'device': 'disk5', 'port': 6201}
   Mar 14 17:28:20 <redacted> container-replicator ERROR Remote drive not mounted
   {'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.119', 'id': 5481, 'meta': '', 'device': 'disk6', 'port': 6201}
   Mar 14 17:28:21 <redacted> container-replicator ERROR Remote drive not mounted
   {'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.20', 'id': 2311, 'meta': '', 'device': 'disk5', 'port': 6201}
   Mar 14 17:28:21 <redacted> container-replicator ERROR Remote drive not mounted
   {'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.20', 'id': 2311, 'meta': '', 'device': 'disk5', 'port': 6201}
   Mar 14 17:28:22 <redacted> container-replicator ERROR Remote drive not mounted
   {'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.20', 'id': 2311, 'meta': '', 'device': 'disk5', 'port': 6201}

That is why this node has a lot of async pendings: a bunch of disks that
are not mounted on <redacted> and <redacted>. There may be other issues,
but clearing this up will likely drop the async pendings a fair bit, as
other nodes will be having the same problem.

Assessing the availability risk when multiple storage servers are down
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. note::

   This procedure will tell you if you have a problem, however, in practice
   you will find that you will not use this procedure frequently.

If three storage nodes (or, more precisely, three disks on three
different storage nodes) are down, there is a small but nonzero
probability that user objects, containers, or accounts will not be
available.

Procedure
---------

.. note::

   swift has three rings: one each for objects, containers and accounts.
   This procedure should be run three times, each time specifying the
   appropriate ``*.builder`` file.

#. Determine whether all three nodes are in different Swift zones by
   running the ring builder on a proxy node to determine which zones
   the storage nodes are in. For example:

   .. code:: console

      % sudo swift-ring-builder /etc/swift/object.builder
      /etc/swift/object.builder, build version 1467
      2097152 partitions, 3 replicas, 5 zones, 1320 devices, 0.02 balance
      The minimum number of hours before a partition can be reassigned is 24
      Devices:    id  zone     ip address    port     name  weight  partitions balance meta
                   0     1     <redacted>.4  6200     disk0 1708.00       4259   -0.00
                   1     1     <redacted>.4  6200     disk1 1708.00       4260    0.02
                   2     1     <redacted>.4  6200     disk2 1952.00       4868    0.01
                   3     1     <redacted>.4  6200     disk3 1952.00       4868    0.01
                   4     1     <redacted>.4  6200     disk4 1952.00       4867   -0.01

#. Here, node <redacted>.4 is in zone 1. If two or more of the three
   nodes under consideration are in the same Swift zone, they do not
   have any ring partitions in common; there is little/no data
   availability risk if all three nodes are down.

#. If the nodes are in three distinct Swift zones it is necessary to
   whether the nodes have ring partitions in common. Run ``swift-ring``
   builder again, this time with the ``list_parts`` option and specify
   the nodes under consideration. For example:

   .. code:: console

      % sudo swift-ring-builder /etc/swift/object.builder list_parts <redacted>.8 <redacted>.15 <redacted>.72.2
      Partition   Matches
      91           2
      729          2
      3754         2
      3769         2
      3947         2
      5818         2
      7918         2
      8733         2
      9509         2
      10233        2

#. The ``list_parts`` option to the ring builder indicates how many ring
   partitions the nodes have in common. If, as in this case,  the
   first entry in the list has a 'Matches' column of 2 or less,  there
   is no data availability risk if all three nodes are down.

#. If the 'Matches' column has entries equal to 3, there is some data
   availability risk if all three nodes are down. The risk is generally
   small, and is proportional to the number of entries that have a 3 in
   the Matches column. For example:

   .. code:: console

      Partition   Matches
      26865          3
      362367         3
      745940         3
      778715         3
      797559         3
      820295         3
      822118         3
      839603         3
      852332         3
      855965         3
      858016         3

#. A quick way to count the number of rows with 3 matches is:

   .. code:: console

      % sudo swift-ring-builder /etc/swift/object.builder list_parts <redacted>.8 <redacted>.15 <redacted>.72.2 | grep "3$" | wc -l

      30

#. In this case the nodes have 30 out of a total of 2097152 partitions
   in common; about 0.001%. In this case the risk is small/nonzero.
   Recall that a partition is simply a portion of the ring mapping
   space, not actual data. So having partitions in common is a necessary
   but not sufficient condition for data unavailability.

   .. note::

      We should not bring down a node for repair if it shows
      Matches entries of 3 with other nodes that are also down.

      If three nodes that have 3 partitions in common are all down, there is
      a nonzero probability that data are unavailable and we should work to
      bring some or all of the nodes up ASAP.

Swift startup/shutdown
~~~~~~~~~~~~~~~~~~~~~~

-  Use reload - not stop/start/restart.

-  Try to roll sets of servers (especially proxy) in groups of less
   than 20% of your servers.