summaryrefslogtreecommitdiff
path: root/doc/source/ops_runbook/troubleshooting.rst
blob: cb7553fc669ca3f5ebeea36e382584e4c5b0de98 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
====================
Troubleshooting tips
====================

Diagnose: Customer complains they receive a HTTP status 500 when trying to browse containers
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

This entry is prompted by a real customer issue and exclusively focused on how
that problem was identified.
There are many reasons why a http status of 500 could be returned. If
there are no obvious problems with the swift object store, then it may
be necessary to take a closer look at the users transactions.
After finding the users swift account, you can
search the swift proxy logs on each swift proxy server for
transactions from this user. The linux ``bzgrep`` command can be used to
search all the proxy log files on a node including the ``.bz2`` compressed
files. For example:

.. code::

   $ PDSH_SSH_ARGS_APPEND="-o StrictHostKeyChecking=no" pdsh -l <yourusername> -R ssh \
     -w <redacted>.68.[4-11,132-139 4-11,132-139],<redacted>.132.[4-11,132-139] \
     'sudo bzgrep -w AUTH_redacted-4962-4692-98fb-52ddda82a5af /var/log/swift/proxy.log*' |  dshbak -c
    .
    .
    ----------------
    <redacted>.132.6
    ----------------
    Feb 29 08:51:57 sw-aw2az2-proxy011 proxy-server <redacted>.16.132
    <redacted>.66.8 29/Feb/2012/08/51/57 GET /v1.0/AUTH_redacted-4962-4692-98fb-52ddda82a5af
    /%3Fformat%3Djson HTTP/1.0 404 - - <REDACTED>_4f4d50c5e4b064d88bd7ab82 - - -
    tx429fc3be354f434ab7f9c6c4206c1dc3 - 0.0130

This shows a ``GET`` operation on the users account.

.. note::

   The HTTP status returned is 404, Not found, rather than 500 as reported by the user.

Using the transaction ID, ``tx429fc3be354f434ab7f9c6c4206c1dc3`` you can
search the swift object servers log files for this transaction ID:

.. code::

   $ PDSH_SSH_ARGS_APPEND="-o StrictHostKeyChecking=no" pdsh -l <yourusername> -R ssh \
     -w <redacted>.72.[4-67|4-67],<redacted>.[4-67|4-67],<redacted>.[4-67|4-67],<redacted>.204.[4-131] \
     'sudo bzgrep tx429fc3be354f434ab7f9c6c4206c1dc3 /var/log/swift/server.log*' | dshbak -c
   .
   .
   ----------------
   <redacted>.72.16
   ----------------
   Feb 29 08:51:57 sw-aw2az1-object013 account-server <redacted>.132.6 - -

   [29/Feb/2012:08:51:57 +0000|] "GET /disk9/198875/AUTH_redacted-4962-4692-98fb-52ddda82a5af"
   404 - "tx429fc3be354f434ab7f9c6c4206c1dc3" "-" "-"

   0.0016 ""
   ----------------
   <redacted>.31
   ----------------
   Feb 29 08:51:57 node-az2-object060 account-server <redacted>.132.6 - -
   [29/Feb/2012:08:51:57 +0000|] "GET /disk6/198875/AUTH_redacted-4962-
   4692-98fb-52ddda82a5af" 404 - "tx429fc3be354f434ab7f9c6c4206c1dc3" "-" "-" 0.0011 ""
   ----------------
   <redacted>.204.70
   ----------------

   Feb 29 08:51:57 sw-aw2az3-object0067 account-server <redacted>.132.6 - -
   [29/Feb/2012:08:51:57 +0000|] "GET /disk6/198875/AUTH_redacted-4962-
   4692-98fb-52ddda82a5af" 404 - "tx429fc3be354f434ab7f9c6c4206c1dc3" "-" "-" 0.0014 ""

.. note::

   The 3 GET operations to 3 different object servers that hold the 3
   replicas of this users account. Each ``GET`` returns a HTTP status of 404,
   Not found.

Next, use the ``swift-get-nodes`` command to determine exactly where the
user's account data is stored:

.. code::

   $ sudo swift-get-nodes /etc/swift/account.ring.gz AUTH_redacted-4962-4692-98fb-52ddda82a5af
   Account AUTH_redacted-4962-4692-98fb-52ddda82a5af
   Container None
   Object None

   Partition 198875
   Hash 1846d99185f8a0edaf65cfbf37439696

   Server:Port Device <redacted>.31:6202 disk6
   Server:Port Device <redacted>.204.70:6202 disk6
   Server:Port Device <redacted>.72.16:6202 disk9
   Server:Port Device <redacted>.204.64:6202 disk11 [Handoff]
   Server:Port Device <redacted>.26:6202 disk11 [Handoff]
   Server:Port Device <redacted>.72.27:6202 disk11 [Handoff]

   curl -I -XHEAD "`http://<redacted>.31:6202/disk6/198875/AUTH_redacted-4962-4692-98fb-52ddda82a5af"
   <http://15.185.138.31:6202/disk6/198875/AUTH_db0050ad-4962-4692-98fb-52ddda82a5af>`_
   curl -I -XHEAD "`http://<redacted>.204.70:6202/disk6/198875/AUTH_redacted-4962-4692-98fb-52ddda82a5af"
   <http://15.185.204.70:6202/disk6/198875/AUTH_db0050ad-4962-4692-98fb-52ddda82a5af>`_
   curl -I -XHEAD "`http://<redacted>.72.16:6202/disk9/198875/AUTH_redacted-4962-4692-98fb-52ddda82a5af"
   <http://15.185.72.16:6202/disk9/198875/AUTH_db0050ad-4962-4692-98fb-52ddda82a5af>`_
   curl -I -XHEAD "`http://<redacted>.204.64:6202/disk11/198875/AUTH_redacted-4962-4692-98fb-52ddda82a5af"
   <http://15.185.204.64:6202/disk11/198875/AUTH_db0050ad-4962-4692-98fb-52ddda82a5af>`_ # [Handoff]
   curl -I -XHEAD "`http://<redacted>.26:6202/disk11/198875/AUTH_redacted-4962-4692-98fb-52ddda82a5af"
   <http://15.185.136.26:6202/disk11/198875/AUTH_db0050ad-4962-4692-98fb-52ddda82a5af>`_ # [Handoff]
   curl -I -XHEAD "`http://<redacted>.72.27:6202/disk11/198875/AUTH_redacted-4962-4692-98fb-52ddda82a5af"
   <http://15.185.72.27:6202/disk11/198875/AUTH_db0050ad-4962-4692-98fb-52ddda82a5af>`_ # [Handoff]

   ssh <redacted>.31 "ls -lah /srv/node/disk6/accounts/198875/696/1846d99185f8a0edaf65cfbf37439696/"
   ssh <redacted>.204.70 "ls -lah /srv/node/disk6/accounts/198875/696/1846d99185f8a0edaf65cfbf37439696/"
   ssh <redacted>.72.16 "ls -lah /srv/node/disk9/accounts/198875/696/1846d99185f8a0edaf65cfbf37439696/"
   ssh <redacted>.204.64 "ls -lah /srv/node/disk11/accounts/198875/696/1846d99185f8a0edaf65cfbf37439696/" # [Handoff]
   ssh <redacted>.26 "ls -lah /srv/node/disk11/accounts/198875/696/1846d99185f8a0edaf65cfbf37439696/" # [Handoff]
   ssh <redacted>.72.27 "ls -lah /srv/node/disk11/accounts/198875/696/1846d99185f8a0edaf65cfbf37439696/" # [Handoff]

Check each of the primary servers, <redacted>.31, <redacted>.204.70  and <redacted>.72.16, for
this users account. For example on <redacted>.72.16:

.. code::

   $ ls -lah /srv/node/disk9/accounts/198875/696/1846d99185f8a0edaf65cfbf37439696/
   total 1.0M
   drwxrwxrwx 2 swift swift 98 2012-02-23 14:49 .
   drwxrwxrwx 3 swift swift 45 2012-02-03 23:28 ..
   -rw------- 1 swift swift 15K 2012-02-23 14:49 1846d99185f8a0edaf65cfbf37439696.db
   -rw-rw-rw- 1 swift swift 0 2012-02-23 14:49 1846d99185f8a0edaf65cfbf37439696.db.pending

So this users account db, an sqlite db is present. Use sqlite to
checkout the account:

.. code::

   $ sudo cp /srv/node/disk9/accounts/198875/696/1846d99185f8a0edaf65cfbf37439696/1846d99185f8a0edaf65cfbf37439696.db /tmp
   $ sudo sqlite3 /tmp/1846d99185f8a0edaf65cfbf37439696.db
   sqlite> .mode line
   sqlite> select * from account_stat;
   account = AUTH_redacted-4962-4692-98fb-52ddda82a5af
   created_at = 1328311738.42190
   put_timestamp = 1330000873.61411
   delete_timestamp = 1330001026.00514
   container_count = 0
   object_count = 0
   bytes_used = 0
   hash = eb7e5d0ea3544d9def940b19114e8b43
   id = 2de8c8a8-cef9-4a94-a421-2f845802fe90
   status = DELETED
   status_changed_at = 1330001026.00514
   metadata =

.. note:

   The status is ``DELETED``. So this account was deleted. This explains
   why the GET operations are returning 404, not found. Check the account
   delete date/time:

   .. code::

      $ python

      >>> import time
      >>> time.ctime(1330001026.00514)
      'Thu Feb 23 12:43:46 2012'

Next try and find the ``DELETE`` operation for this account in the proxy
server logs:

.. code::

   $ PDSH_SSH_ARGS_APPEND="-o StrictHostKeyChecking=no" pdsh -l <yourusername> -R ssh \
     -w <redacted>.68.[4-11,132-139 4-11,132-139],<redacted>.132.[4-11,132-139|4-11,132-139] \
     'sudo bzgrep AUTH_redacted-4962-4692-98fb-52ddda82a5af /var/log/swift/proxy.log* \
     | grep -w DELETE | awk "{print $3,$10,$12}"' |- dshbak -c
   .
   .
   Feb 23 12:43:46 sw-aw2az2-proxy001 proxy-server <redacted> <redacted>.66.7 23/Feb/2012/12/43/46 DELETE /v1.0/AUTH_redacted-4962-4692-98fb-
   52ddda82a5af/ HTTP/1.0 204 - Apache-HttpClient/4.1.2%20%28java%201.5%29 <REDACTED>_4f458ee4e4b02a869c3aad02 - - -
   tx4471188b0b87406899973d297c55ab53 - 0.0086

From this you can see the operation that resulted in the account being deleted.

Procedure: Deleting objects
~~~~~~~~~~~~~~~~~~~~~~~~~~~

Simple case - deleting small number of objects and containers
-------------------------------------------------------------

.. note::

   ``swift-direct`` is specific to the Hewlett Packard Enterprise Helion Public Cloud.
   Use ``swiftly`` as an alternative.

.. note::

   Object and container names are in UTF8. Swift direct accepts UTF8
   directly, not URL-encoded UTF8 (the REST API expects UTF8 and then
   URL-encoded). In practice cut and paste of foreign language strings to
   a terminal window will produce the right result.

   Hint: Use the ``head`` command before any destructive commands.

To delete a small number of objects, log into any proxy node and proceed
as follows:

Examine the object in question:

.. code::

   $ sudo -u swift /opt/hp/swift/bin/swift-direct head 132345678912345 container_name obj_name

See if ``X-Object-Manifest`` or ``X-Static-Large-Object`` is set,
then this is the manifest object and segment objects may be in another
container.

If the ``X-Object-Manifest`` attribute is set, you need to find the
name of the objects this means it is a DLO. For example,
if ``X-Object-Manifest`` is ``container2/seg-blah``, list the contents
of the container container2 as follows:

.. code::

   $ sudo -u swift /opt/hp/swift/bin/swift-direct show 132345678912345 container2

Pick out the objects whose names start with ``seg-blah``.
Delete the segment objects as follows:

.. code::

   $ sudo -u swift /opt/hp/swift/bin/swift-direct delete 132345678912345 container2 seg-blah01
   $ sudo -u swift /opt/hp/swift/bin/swift-direct delete 132345678912345 container2 seg-blah02
   etc

If ``X-Static-Large-Object`` is set, you need to read the contents. Do this by:

-  Using swift-get-nodes to get the details of the object's location.
-  Change the ``-X HEAD`` to ``-X GET`` and run ``curl`` against one copy.
-  This lists a JSON body listing containers and object names
-  Delete the objects as described above for DLO segments

Once the segments are deleted, you can delete the object using
``swift-direct`` as described above.

Finally, use ``swift-direct`` to delete the container.

Procedure: Decommissioning swift nodes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Should Swift nodes need to be decommissioned (e.g.,, where they are being
re-purposed), it is very important to follow the following steps.

#. In the case of object servers, follow the procedure for removing
   the node from the rings.
#. In the case of swift proxy servers, have the network team remove
   the node from the load balancers.
#. Open a network ticket to have the node removed from network
   firewalls.
#. Make sure that you remove the ``/etc/swift`` directory and everything in it.