summaryrefslogtreecommitdiff
path: root/doc/source/ops_runbook
diff options
context:
space:
mode:
Diffstat (limited to 'doc/source/ops_runbook')
-rw-r--r--doc/source/ops_runbook/diagnose.rst291
-rw-r--r--doc/source/ops_runbook/maintenance.rst24
-rw-r--r--doc/source/ops_runbook/procedures.rst78
-rw-r--r--doc/source/ops_runbook/troubleshooting.rst38
4 files changed, 217 insertions, 214 deletions
diff --git a/doc/source/ops_runbook/diagnose.rst b/doc/source/ops_runbook/diagnose.rst
index 2de368128..976cdb70d 100644
--- a/doc/source/ops_runbook/diagnose.rst
+++ b/doc/source/ops_runbook/diagnose.rst
@@ -36,11 +36,11 @@ External monitoring
We use pingdom.com to monitor the external Swift API. We suggest the
following:
- - Do a GET on ``/healthcheck``
+- Do a GET on ``/healthcheck``
- - Create a container, make it public (x-container-read:
- .r*,.rlistings), create a small file in the container; do a GET
- on the object
+- Create a container, make it public (``x-container-read:
+ .r*,.rlistings``), create a small file in the container; do a GET
+ on the object
Diagnose: General approach
--------------------------
@@ -82,11 +82,11 @@ if any servers are down. We suggest you run it regularly
to the last report without having to wait for a long-running command
to complete.
-Diagnose: Is system responding to /healthcheck?
------------------------------------------------
+Diagnose: Is system responding to ``/healthcheck``?
+---------------------------------------------------
When you want to establish if a swift endpoint is running, run ``curl -k``
-against https://*[ENDPOINT]*/healthcheck.
+against ``https://$ENDPOINT/healthcheck``.
.. _swift_logs:
@@ -209,11 +209,11 @@ Diagnose: Parted reports the backup GPT table is corrupt
- If a GPT table is broken, a message like the following should be
observed when the following command is run:
- .. code::
+ .. code:: console
$ sudo parted -l
- .. code::
+ .. code:: console
Error: The backup GPT table is corrupt, but the primary appears OK,
so that will be used.
@@ -232,40 +232,40 @@ invalid filesystem label. In such cases proceed as follows:
#. Verify that the disk labels are correct:
- .. code::
+ .. code:: console
- FS=/dev/sd#1
+ $ FS=/dev/sd#1
- sudo parted -l | grep object
+ $ sudo parted -l | grep object
#. If partition labels are inconsistent then, resolve the disk label issues
before proceeding:
- .. code::
+ .. code:: console
- sudo parted -s ${FS} name ${PART_NO} ${PART_NAME} #Partition Label
- #PART_NO is 1 for object disks and 3 for OS disks
- #PART_NAME follows the convention seen in "sudo parted -l | grep object"
+ $ sudo parted -s ${FS} name ${PART_NO} ${PART_NAME} #Partition Label
+ $ # PART_NO is 1 for object disks and 3 for OS disks
+ $ # PART_NAME follows the convention seen in "sudo parted -l | grep object"
#. If the Filesystem label is missing then create it with care:
- .. code::
+ .. code:: console
- sudo xfs_admin -l ${FS} #Filesystem label (12 Char limit)
+ $ sudo xfs_admin -l ${FS} #Filesystem label (12 Char limit)
- #Check for the existence of a FS label
+ $ # Check for the existence of a FS label
- OBJNO=<3 Length Object No.>
+ $ OBJNO=<3 Length Object No.>
- #I.E OBJNO for sw-stbaz3-object0007 would be 007
+ $ # I.E OBJNO for sw-stbaz3-object0007 would be 007
- DISKNO=<3 Length Disk No.>
+ $ DISKNO=<3 Length Disk No.>
- #I.E DISKNO for /dev/sdb would be 001, /dev/sdc would be 002 etc.
+ $ # I.E DISKNO for /dev/sdb would be 001, /dev/sdc would be 002 etc.
- sudo xfs_admin -L "obj${OBJNO}dsk${DISKNO}" ${FS}
+ $ sudo xfs_admin -L "obj${OBJNO}dsk${DISKNO}" ${FS}
- #Create a FS Label
+ $ # Create a FS Label
Diagnose: Failed LUNs
---------------------
@@ -293,9 +293,9 @@ Otherwise the lun can be re-enabled as follows:
LUN. You will come back later and grep this file for more details, but
just generate it for now.
- .. code::
+ .. code:: console
- sudo hpssacli controller all diag file=/tmp/hpacu.diag ris=on xml=off zip=off
+ $ sudo hpssacli controller all diag file=/tmp/hpacu.diag ris=on xml=off zip=off
Export the following variables using the below instructions before
proceeding further.
@@ -304,16 +304,16 @@ proceeding further.
failed drive's number and array value (example output: "array A
logicaldrive 1..." would be exported as LDRIVE=1):
- .. code::
+ .. code:: console
- sudo hpssacli controller slot=1 ld all show
+ $ sudo hpssacli controller slot=1 ld all show
#. Export the number of the logical drive that was retrieved from the
previous command into the LDRIVE variable:
- .. code::
+ .. code:: console
- export LDRIVE=<LogicalDriveNumber>
+ $ export LDRIVE=<LogicalDriveNumber>
#. Print the array value and Port:Box:Bay for all drives and take note of
the Port:Box:Bay for the failed drive (example output: " array A
@@ -324,9 +324,9 @@ proceeding further.
in the case of "array c"), but we will run a different command to be sure
we are operating on the correct device.
- .. code::
+ .. code:: console
- sudo hpssacli controller slot=1 pd all show
+ $ sudo hpssacli controller slot=1 pd all show
.. note::
@@ -339,24 +339,24 @@ proceeding further.
#. Export the Port:Box:Bay for the failed drive into the PBOX variable:
- .. code::
+ .. code:: console
- export PBOX=<Port:Box:Bay>
+ $ export PBOX=<Port:Box:Bay>
#. Print the physical device information and take note of the Disk Name
(example output: "Disk Name: /dev/sdk" would be exported as
DEV=/dev/sdk):
- .. code::
+ .. code:: console
- sudo hpssacli controller slot=1 ld ${LDRIVE} show detail | grep -i "Disk Name"
+ $ sudo hpssacli controller slot=1 ld ${LDRIVE} show detail | grep -i "Disk Name"
#. Export the device name variable from the preceding command (example:
/dev/sdk):
- .. code::
+ .. code:: console
- export DEV=<Device>
+ $ export DEV=<Device>
#. Export the filesystem variable. Disks that are split between the
operating system and data storage, typically sda and sdb, should only
@@ -367,39 +367,39 @@ proceeding further.
data filesystem for the device in question as the export. For example:
/dev/sdk1.
- .. code::
+ .. code:: console
- export FS=<Filesystem>
+ $ export FS=<Filesystem>
#. Verify the LUN is failed, and the device is not:
- .. code::
+ .. code:: console
- sudo hpssacli controller slot=1 ld all show
- sudo hpssacli controller slot=1 pd all show
- sudo hpssacli controller slot=1 ld ${LDRIVE} show detail
- sudo hpssacli controller slot=1 pd ${PBOX} show detail
+ $ sudo hpssacli controller slot=1 ld all show
+ $ sudo hpssacli controller slot=1 pd all show
+ $ sudo hpssacli controller slot=1 ld ${LDRIVE} show detail
+ $ sudo hpssacli controller slot=1 pd ${PBOX} show detail
#. Stop the swift and rsync service:
- .. code::
+ .. code:: console
- sudo service rsync stop
- sudo swift-init shutdown all
+ $ sudo service rsync stop
+ $ sudo swift-init shutdown all
#. Unmount the problem drive, fix the LUN and the filesystem:
- .. code::
+ .. code:: console
- sudo umount ${FS}
+ $ sudo umount ${FS}
#. If umount fails, you should run lsof search for the mountpoint and
kill any lingering processes before repeating the unpount:
- .. code::
+ .. code:: console
- sudo hpacucli controller slot=1 ld ${LDRIVE} modify reenable
- sudo xfs_repair ${FS}
+ $ sudo hpacucli controller slot=1 ld ${LDRIVE} modify reenable
+ $ sudo xfs_repair ${FS}
#. If the ``xfs_repair`` complains about possible journal data, use the
``xfs_repair -L`` option to zeroise the journal log.
@@ -407,21 +407,21 @@ proceeding further.
#. Once complete test-mount the filesystem, and tidy up its lost and
found area.
- .. code::
+ .. code:: console
- sudo mount ${FS} /mnt
- sudo rm -rf /mnt/lost+found/
- sudo umount /mnt
+ $ sudo mount ${FS} /mnt
+ $ sudo rm -rf /mnt/lost+found/
+ $ sudo umount /mnt
#. Mount the filesystem and restart swift and rsync.
#. Run the following to determine if a DC ticket is needed to check the
cables on the node:
- .. code::
+ .. code:: console
- grep -y media.exchanged /tmp/hpacu.diag
- grep -y hot.plug.count /tmp/hpacu.diag
+ $ grep -y media.exchanged /tmp/hpacu.diag
+ $ grep -y hot.plug.count /tmp/hpacu.diag
#. If the output reports any non 0x00 values, it suggests that the cables
should be checked. For example, log a DC ticket to check the sas cables
@@ -440,7 +440,7 @@ If the diagnostics report a message such as ``sda: drive is slow``, you
should log onto the node and run the following command (remove ``-c 1`` option to continuously monitor
the data):
-.. code::
+.. code:: console
$ /usr/bin/collectl -s D -c 1
waiting for 1 second sample...
@@ -475,7 +475,7 @@ otherwise hardware replacement is needed.
Another way to look at the data is as follows:
-.. code::
+.. code:: console
$ /opt/hp/syseng/disk-anal.pl -d
Disk: sda Wait: 54580 371 65 25 12 6 6 0 1 2 0 46
@@ -524,7 +524,7 @@ historical data. You can look at recent data as follows. It only looks
at data from 13:15 to 14:15. As you can see, this is a relatively clean
system (few if any long wait or service times):
-.. code::
+.. code:: console
$ /opt/hp/syseng/disk-anal.pl -d -t 13:15-14:15
Disk: sda Wait: 3600 0 0 0 0 0 0 0 0 0 0 0
@@ -582,21 +582,21 @@ Running tests
#. Prepare the ``target`` node as follows:
- .. code::
+ .. code:: console
- sudo iptables -I INPUT -p tcp -j ACCEPT
+ $ sudo iptables -I INPUT -p tcp -j ACCEPT
Or, do:
- .. code::
+ .. code:: console
- sudo ufw allow 12866/tcp
+ $ sudo ufw allow 12866/tcp
#. On the ``source`` node, run the following command to check
throughput. Note the double-dash before the -P option.
The command takes 10 seconds to complete. The ``target`` node is 192.168.245.5.
- .. code::
+ .. code:: console
$ netperf -H 192.168.245.5 -- -P 12866
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 12866 AF_INET to
@@ -609,7 +609,7 @@ Running tests
#. On the ``source`` node, run the following command to check latency:
- .. code::
+ .. code:: console
$ netperf -H 192.168.245.5 -t TCP_RR -- -P 12866
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 12866
@@ -644,21 +644,21 @@ Diagnose: Remapping sectors experiencing UREs
#. Set the environment variables SEC, DEV & FS, for example:
- .. code::
+ .. code:: console
- SEC=2930954256
- DEV=/dev/sdi
- FS=/dev/sdi1
+ $ SEC=2930954256
+ $ DEV=/dev/sdi
+ $ FS=/dev/sdi1
#. Verify that the sector is bad:
- .. code::
+ .. code:: console
- sudo dd if=${DEV} of=/dev/null bs=512 count=1 skip=${SEC}
+ $ sudo dd if=${DEV} of=/dev/null bs=512 count=1 skip=${SEC}
#. If the sector is bad this command will output an input/output error:
- .. code::
+ .. code:: console
dd: reading `/dev/sdi`: Input/output error
0+0 records in
@@ -667,28 +667,28 @@ Diagnose: Remapping sectors experiencing UREs
#. Prevent chef from attempting to re-mount the filesystem while the
repair is in progress:
- .. code::
+ .. code:: console
- sudo mv /etc/chef/client.pem /etc/chef/xx-client.xx-pem
+ $ sudo mv /etc/chef/client.pem /etc/chef/xx-client.xx-pem
#. Stop the swift and rsync service:
- .. code::
+ .. code:: console
- sudo service rsync stop
- sudo swift-init shutdown all
+ $ sudo service rsync stop
+ $ sudo swift-init shutdown all
#. Unmount the problem drive:
- .. code::
+ .. code:: console
- sudo umount ${FS}
+ $ sudo umount ${FS}
#. Overwrite/remap the bad sector:
- .. code::
+ .. code:: console
- sudo dd_rescue -d -A -m8b -s ${SEC}b ${DEV} ${DEV}
+ $ sudo dd_rescue -d -A -m8b -s ${SEC}b ${DEV} ${DEV}
#. This command should report an input/output error the first time
it is run. Run the command a second time, if it successfully remapped
@@ -696,9 +696,9 @@ Diagnose: Remapping sectors experiencing UREs
#. Verify the sector is now readable:
- .. code::
+ .. code:: console
- sudo dd if=${DEV} of=/dev/null bs=512 count=1 skip=${SEC}
+ $ sudo dd if=${DEV} of=/dev/null bs=512 count=1 skip=${SEC}
#. If the sector is now readable this command should not report an
input/output error.
@@ -706,24 +706,24 @@ Diagnose: Remapping sectors experiencing UREs
#. If more than one problem sector is listed, set the SEC environment
variable to the next sector in the list:
- .. code::
+ .. code:: console
- SEC=123456789
+ $ SEC=123456789
#. Repeat from step 8.
#. Repair the filesystem:
- .. code::
+ .. code:: console
- sudo xfs_repair ${FS}
+ $ sudo xfs_repair ${FS}
#. If ``xfs_repair`` reports that the filesystem has valuable filesystem
changes:
- .. code::
+ .. code:: console
- sudo xfs_repair ${FS}
+ $ sudo xfs_repair ${FS}
Phase 1 - find and verify superblock...
Phase 2 - using internal log
- zero log...
@@ -739,11 +739,11 @@ Diagnose: Remapping sectors experiencing UREs
#. You should attempt to mount the filesystem, and clear the lost+found
area:
- .. code::
+ .. code:: console
- sudo mount $FS /mnt
- sudo rm -rf /mnt/lost+found/*
- sudo umount /mnt
+ $ sudo mount $FS /mnt
+ $ sudo rm -rf /mnt/lost+found/*
+ $ sudo umount /mnt
#. If the filesystem fails to mount then you will need to use the
``xfs_repair -L`` option to force log zeroing.
@@ -752,16 +752,16 @@ Diagnose: Remapping sectors experiencing UREs
#. If ``xfs_repair`` reports that an additional input/output error has been
encountered, get the sector details as follows:
- .. code::
+ .. code:: console
- sudo grep "I/O error" /var/log/kern.log | grep sector | tail -1
+ $ sudo grep "I/O error" /var/log/kern.log | grep sector | tail -1
#. If new input/output error is reported then set the SEC environment
variable to the problem sector number:
- .. code::
+ .. code:: console
- SEC=234567890
+ $ SEC=234567890
#. Repeat from step 8
@@ -806,31 +806,31 @@ errors, it may well indicate a cable, switch, or network issue.
Get an overview of the interface with:
-.. code::
+.. code:: console
- sudo ifconfig eth{n}
- sudo ethtool eth{n}
+ $ sudo ifconfig eth{n}
+ $ sudo ethtool eth{n}
The ``Link Detected:`` indicator will read ``yes`` if the nic is
cabled.
Establish the adapter type with:
-.. code::
+.. code:: console
- sudo ethtool -i eth{n}
+ $ sudo ethtool -i eth{n}
Gather the interface statistics with:
-.. code::
+.. code:: console
- sudo ethtool -S eth{n}
+ $ sudo ethtool -S eth{n}
If the nick supports self test, this can be performed with:
-.. code::
+.. code:: console
- sudo ethtool -t eth{n}
+ $ sudo ethtool -t eth{n}
Self tests should read ``PASS`` if the nic is operating correctly.
@@ -853,9 +853,9 @@ A replicator reports in its log that remaining time exceeds
making progress. Another useful way to check this is with the
'swift-recon -r' command on a swift proxy server:
-.. code::
+.. code:: console
- sudo swift-recon -r
+ $ sudo swift-recon -r
===============================================================================
--> Starting reconnaissance on 384 hosts
@@ -877,9 +877,9 @@ You can further check if the object replicator is stuck by logging on
the object server and checking the object replicator progress with
the following command:
-.. code::
+.. code:: console
- # sudo grep object-rep /var/log/swift/background.log | grep -e "Starting object replication" -e "Object replication complete" -e "partitions rep"
+ $ sudo grep object-rep /var/log/swift/background.log | grep -e "Starting object replication" -e "Object replication complete" -e "partitions rep"
Jul 16 06:25:46 192.168.245.4 object-replicator 15344/16450 (93.28%) partitions replicated in 69018.48s (0.22/sec, 22h remaining)
Jul 16 06:30:46 192.168.245.4object-replicator 15344/16450 (93.28%) partitions replicated in 69318.58s (0.22/sec, 22h remaining)
Jul 16 06:35:46 192.168.245.4 object-replicator 15344/16450 (93.28%) partitions replicated in 69618.63s (0.22/sec, 23h remaining)
@@ -912,9 +912,9 @@ One of the reasons for the object replicator hanging like this is
filesystem corruption on the drive. The following is a typical log entry
of a corrupted filesystem detected by the object replicator:
-.. code::
+.. code:: console
- # sudo bzgrep "Remote I/O error" /var/log/swift/background.log* |grep srv | - tail -1
+ $ sudo bzgrep "Remote I/O error" /var/log/swift/background.log* |grep srv | - tail -1
Jul 12 03:33:30 192.168.245.4 object-replicator STDOUT: ERROR:root:Error hashing suffix#012Traceback (most recent call last):#012 File
"/usr/lib/python2.7/dist-packages/swift/obj/replicator.py", line 199, in get_hashes#012 hashes[suffix] = hash_suffix(suffix_dir,
reclaim_age)#012 File "/usr/lib/python2.7/dist-packages/swift/obj/replicator.py", line 84, in hash_suffix#012 path_contents =
@@ -922,9 +922,9 @@ of a corrupted filesystem detected by the object replicator:
An ``ls`` of the problem file or directory usually shows something like the following:
-.. code::
+.. code:: console
- # ls -l /srv/node/disk4/objects/1643763/b51
+ $ ls -l /srv/node/disk4/objects/1643763/b51
ls: cannot access /srv/node/disk4/objects/1643763/b51: Remote I/O error
If no entry with ``Remote I/O error`` occurs in the ``background.log`` it is
@@ -935,27 +935,27 @@ restart the object-replicator.
#. Stop the object-replicator:
- .. code::
+ .. code:: console
# sudo swift-init object-replicator stop
#. Make sure the object replicator has stopped, if it has hung, the stop
command will not stop the hung process:
- .. code::
+ .. code:: console
# ps auxww | - grep swift-object-replicator
#. If the previous ps shows the object-replicator is still running, kill
the process:
- .. code::
+ .. code:: console
# kill -9 <pid-of-swift-object-replicator>
#. Start the object-replicator:
- .. code::
+ .. code:: console
# sudo swift-init object-replicator start
@@ -964,14 +964,14 @@ to repair the problem filesystem.
#. Stop swift and rsync:
- .. code::
+ .. code:: console
# sudo swift-init all shutdown
# sudo service rsync stop
#. Make sure all swift process have stopped:
- .. code::
+ .. code:: console
# ps auxww | grep swift | grep python
@@ -979,13 +979,13 @@ to repair the problem filesystem.
#. Unmount the problem filesystem:
- .. code::
+ .. code:: console
# sudo umount /srv/node/disk4
#. Repair the filesystem:
- .. code::
+ .. code:: console
# sudo xfs_repair -P /dev/sde1
@@ -1002,7 +1002,7 @@ The CPU load average on an object server, as shown with the
'uptime' command, is typically under 10 when the server is
lightly-moderately loaded:
-.. code::
+.. code:: console
$ uptime
07:59:26 up 99 days, 5:57, 1 user, load average: 8.59, 8.39, 8.32
@@ -1014,7 +1014,7 @@ However, sometimes the CPU load average can increase significantly. The
following is an example of an object server that has extremely high CPU
load:
-.. code::
+.. code:: console
$ uptime
07:44:02 up 18:22, 1 user, load average: 407.12, 406.36, 404.59
@@ -1050,9 +1050,9 @@ Further issues and resolutions
given server.
- Run this command:
- .. code::
+ .. code:: console
- sudo swift-init all start
+ $ sudo swift-init all start
Examine messages in the swift log files to see if there are any
error messages related to any of the swift processes since the time you
@@ -1080,9 +1080,9 @@ Further issues and resolutions
- Restart the swift processes on the affected node:
- .. code::
+ .. code:: console
- % sudo swift-init all reload
+ $ sudo swift-init all reload
Urgency:
If known performance problem: Immediate
@@ -1135,18 +1135,18 @@ Further issues and resolutions
For example, it is running at 100 Mb/s and the NIC is a 1Ge NIC.
- 1. Try resetting the interface with:
- .. code::
+ .. code:: console
- sudo ethtool -s eth0 speed 1000
+ $ sudo ethtool -s eth0 speed 1000
- ... and then run:
+ ... and then run:
- .. code::
+ .. code:: console
- sudo lshw -class
+ $ sudo lshw -class
- See if size goes to the expected speed. Failing
- that, check hardware (NIC cable/switch port).
+ See if size goes to the expected speed. Failing
+ that, check hardware (NIC cable/switch port).
2. If persistent, consider shutting down the server (especially if a proxy)
until the problem is identified and resolved. If you leave this server
@@ -1183,9 +1183,11 @@ Further issues and resolutions
- Urgency: Medium
This may have been triggered by a recent restart of the rsyslog daemon.
Restart the service with:
- .. code::
- sudo swift-init <service> reload
+ .. code:: console
+
+ $ sudo swift-init <service> reload
+
* - Object replicator: Reports the remaining time and that time is more than 100 hours.
- Each replication cycle the object replicator writes a log message to its log
reporting statistics about the current cycle. This includes an estimate for the
@@ -1193,9 +1195,10 @@ Further issues and resolutions
100 hours, there is a problem with the replication process.
- Urgency: Medium
Restart the service with:
- .. code::
- sudo swift-init object-replicator reload
+ .. code:: console
+
+ $ sudo swift-init object-replicator reload
Check that the remaining replication time is going down.
diff --git a/doc/source/ops_runbook/maintenance.rst b/doc/source/ops_runbook/maintenance.rst
index a2a9cbb10..c63feb7bd 100644
--- a/doc/source/ops_runbook/maintenance.rst
+++ b/doc/source/ops_runbook/maintenance.rst
@@ -27,9 +27,9 @@ if you wait a while things get better.
For example:
-.. code::
+.. code:: console
- sudo swift-recon -rla
+ $ sudo swift-recon -rla
===============================================================================
[2012-03-10 12:57:21] Checking async pendings on 384 hosts...
Async stats: low: 0, high: 1, avg: 0, total: 1
@@ -52,7 +52,7 @@ system. Rules-of-thumb for 'good' recon output are:
- Nodes that respond are up and running Swift. If all nodes respond,
that is a good sign. But some nodes may time out. For example:
- .. code::
+ .. code:: console
-> [http://<redacted>.29:6200/recon/load:] <urlopen error [Errno 111] ECONNREFUSED>
-> [http://<redacted>.31:6200/recon/load:] <urlopen error timed out>
@@ -83,7 +83,7 @@ system. Rules-of-thumb for 'good' recon output are:
For comparison here is the recon output for the same system above when
two entire racks of Swift are down:
-.. code::
+.. code:: console
[2012-03-10 16:56:33] Checking async pendings on 384 hosts...
-> http://<redacted>.22:6200/recon/async: <urlopen error timed out>
@@ -152,9 +152,9 @@ Here is an example of noting and tracking down a problem with recon.
Running reccon shows some async pendings:
-.. code::
+.. code:: console
- bob@notso:~/swift-1.4.4/swift$ ssh -q <redacted>.132.7 sudo swift-recon -alr
+ $ ssh -q <redacted>.132.7 sudo swift-recon -alr
===============================================================================
[2012-03-14 17:25:55] Checking async pendings on 384 hosts...
Async stats: low: 0, high: 23, avg: 8, total: 3356
@@ -172,9 +172,9 @@ Why? Running recon again with -av swift (not shown here) tells us that
the node with the highest (23) is <redacted>.72.61. Looking at the log
files on <redacted>.72.61 we see:
-.. code::
+.. code:: console
- souzab@<redacted>:~$ sudo tail -f /var/log/swift/background.log | - grep -i ERROR
+ $ sudo tail -f /var/log/swift/background.log | - grep -i ERROR
Mar 14 17:28:06 <redacted> container-replicator ERROR Remote drive not mounted
{'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.119', 'id': 5481, 'meta': '', 'device': 'disk6', 'port': 6201}
Mar 14 17:28:06 <redacted> container-replicator ERROR Remote drive not mounted
@@ -235,7 +235,7 @@ Procedure
running the ring builder on a proxy node to determine which zones
the storage nodes are in. For example:
- .. code::
+ .. code:: console
% sudo swift-ring-builder /etc/swift/object.builder
/etc/swift/object.builder, build version 1467
@@ -258,7 +258,7 @@ Procedure
builder again, this time with the ``list_parts`` option and specify
the nodes under consideration. For example:
- .. code::
+ .. code:: console
% sudo swift-ring-builder /etc/swift/object.builder list_parts <redacted>.8 <redacted>.15 <redacted>.72.2
Partition Matches
@@ -283,7 +283,7 @@ Procedure
small, and is proportional to the number of entries that have a 3 in
the Matches column. For example:
- .. code::
+ .. code:: console
Partition Matches
26865 3
@@ -300,7 +300,7 @@ Procedure
#. A quick way to count the number of rows with 3 matches is:
- .. code::
+ .. code:: console
% sudo swift-ring-builder /etc/swift/object.builder list_parts <redacted>.8 <redacted>.15 <redacted>.72.2 | grep "3$" | wc -l
diff --git a/doc/source/ops_runbook/procedures.rst b/doc/source/ops_runbook/procedures.rst
index af28e020c..1d84d5969 100644
--- a/doc/source/ops_runbook/procedures.rst
+++ b/doc/source/ops_runbook/procedures.rst
@@ -10,13 +10,13 @@ Fix broken GPT table (broken disk partition)
- If a GPT table is broken, a message like the following should be
observed when the command...
- .. code::
+ .. code:: console
$ sudo parted -l
- ... is run.
- .. code::
+ .. code:: console
...
Error: The backup GPT table is corrupt, but the primary appears OK, so that will
@@ -25,13 +25,13 @@ Fix broken GPT table (broken disk partition)
#. To fix this, firstly install the ``gdisk`` program to fix this:
- .. code::
+ .. code:: console
$ sudo aptitude install gdisk
#. Run ``gdisk`` for the particular drive with the damaged partition:
- .. code:
+ .. code: console
$ sudo gdisk /dev/sd*a-l*
GPT fdisk (gdisk) version 0.6.14
@@ -57,7 +57,7 @@ Fix broken GPT table (broken disk partition)
and finally ``w`` (write table to disk and exit). Will also need to
enter ``Y`` when prompted in order to confirm actions.
- .. code::
+ .. code:: console
Command (? for help): r
@@ -92,7 +92,7 @@ Fix broken GPT table (broken disk partition)
#. Running the command:
- .. code::
+ .. code:: console
$ sudo parted /dev/sd#
@@ -100,7 +100,7 @@ Fix broken GPT table (broken disk partition)
#. Finally, uninstall ``gdisk`` from the node:
- .. code::
+ .. code:: console
$ sudo aptitude remove gdisk
@@ -112,20 +112,20 @@ Procedure: Fix broken XFS filesystem
#. A filesystem may be corrupt or broken if the following output is
observed when checking its label:
- .. code::
+ .. code:: console
$ sudo xfs_admin -l /dev/sd#
- cache_node_purge: refcount was 1, not zero (node=0x25d5ee0)
- xfs_admin: cannot read root inode (117)
- cache_node_purge: refcount was 1, not zero (node=0x25d92b0)
- xfs_admin: cannot read realtime bitmap inode (117)
- bad sb magic # 0 in AG 1
- failed to read label in AG 1
+ cache_node_purge: refcount was 1, not zero (node=0x25d5ee0)
+ xfs_admin: cannot read root inode (117)
+ cache_node_purge: refcount was 1, not zero (node=0x25d92b0)
+ xfs_admin: cannot read realtime bitmap inode (117)
+ bad sb magic # 0 in AG 1
+ failed to read label in AG 1
#. Run the following commands to remove the broken/corrupt filesystem and replace.
(This example uses the filesystem ``/dev/sdb2``) Firstly need to replace the partition:
- .. code::
+ .. code:: console
$ sudo parted
GNU Parted 2.3
@@ -167,7 +167,7 @@ Procedure: Fix broken XFS filesystem
#. Next step is to scrub the filesystem and format:
- .. code::
+ .. code:: console
$ sudo dd if=/dev/zero of=/dev/sdb2 bs=$((1024*1024)) count=1
1+0 records in
@@ -175,19 +175,19 @@ Procedure: Fix broken XFS filesystem
1048576 bytes (1.0 MB) copied, 0.00480617 s, 218 MB/s
$ sudo /sbin/mkfs.xfs -f -i size=1024 /dev/sdb2
meta-data=/dev/sdb2 isize=1024 agcount=4, agsize=106811524 blks
- = sectsz=512 attr=2, projid32bit=0
- data = bsize=4096 blocks=427246093, imaxpct=5
- = sunit=0 swidth=0 blks
- naming =version 2 bsize=4096 ascii-ci=0
- log =internal log bsize=4096 blocks=208616, version=2
- = sectsz=512 sunit=0 blks, lazy-count=1
- realtime =none extsz=4096 blocks=0, rtextents=0
+ = sectsz=512 attr=2, projid32bit=0
+ data = bsize=4096 blocks=427246093, imaxpct=5
+ = sunit=0 swidth=0 blks
+ naming =version 2 bsize=4096 ascii-ci=0
+ log =internal log bsize=4096 blocks=208616, version=2
+ = sectsz=512 sunit=0 blks, lazy-count=1
+ realtime =none extsz=4096 blocks=0, rtextents=0
#. You should now label and mount your filesystem.
#. Can now check to see if the filesystem is mounted using the command:
- .. code::
+ .. code:: console
$ mount
@@ -204,7 +204,7 @@ Procedure: Checking if an account is okay
You must know the tenant/project ID. You can check if the account is okay as follows from a proxy.
-.. code::
+.. code:: console
$ sudo -u swift /opt/hp/swift/bin/swift-direct show AUTH_<project-id>
@@ -214,7 +214,7 @@ containers, or an error indicating that the resource could not be found.
Alternatively, you can use ``swift-get-nodes`` to find the account database
files. Run the following on a proxy:
-.. code::
+.. code:: console
$ sudo swift-get-nodes /etc/swift/account.ring.gz AUTH_<project-id>
@@ -239,7 +239,7 @@ Log onto one of the swift proxy servers.
Use swift-direct to show this accounts usage:
-.. code::
+.. code:: console
$ sudo -u swift /opt/hp/swift/bin/swift-direct show AUTH_<project-id>
Status: 200
@@ -288,7 +288,7 @@ re-create the account as follows:
servers). The output has been truncated so we can focus on the import pieces
of data:
- .. code::
+ .. code:: console
$ sudo swift-get-nodes /etc/swift/account.ring.gz AUTH_4ebe3039674d4864a11fe0864ae4d905
...
@@ -308,7 +308,7 @@ re-create the account as follows:
#. Before proceeding check that the account is really deleted by using curl. Execute the
commands printed by ``swift-get-nodes``. For example:
- .. code::
+ .. code:: console
$ curl -I -XHEAD "http://192.168.245.5:6202/disk1/3934/AUTH_4ebe3039674d4864a11fe0864ae4d905"
HTTP/1.1 404 Not Found
@@ -323,7 +323,7 @@ re-create the account as follows:
#. Use the ssh commands printed by ``swift-get-nodes`` to check if database
files exist. For example:
- .. code::
+ .. code:: console
$ ssh 192.168.245.5 "ls -lah ${DEVICE:-/srv/node*}/disk1/accounts/3934/052/f5ecf8b40de3e1b0adb0dbe576874052"
total 20K
@@ -344,7 +344,7 @@ re-create the account as follows:
#. Delete the database files. For example:
- .. code::
+ .. code:: console
$ ssh 192.168.245.5
$ cd /srv/node/disk1/accounts/3934/052/f5ecf8b40de3e1b0adb0dbe576874052
@@ -374,9 +374,9 @@ balancers, customer's are not impacted by the misbehaving proxy.
#. Shut down Swift as follows:
- .. code::
+ .. code:: console
- sudo swift-init proxy shutdown
+ $ sudo swift-init proxy shutdown
.. note::
@@ -384,15 +384,15 @@ balancers, customer's are not impacted by the misbehaving proxy.
#. Create the ``/etc/swift/disabled-by-file`` file. For example:
- .. code::
+ .. code:: console
- sudo touch /etc/swift/disabled-by-file
+ $ sudo touch /etc/swift/disabled-by-file
#. Optional, restart Swift:
- .. code::
+ .. code:: console
- sudo swift-init proxy start
+ $ sudo swift-init proxy start
It works because the healthcheck middleware looks for /etc/swift/disabled-by-file.
If it exists, the middleware will return 503/error instead of 200/OK. This means the load balancer
@@ -403,9 +403,9 @@ Procedure: Ad-Hoc disk performance test
You can get an idea whether a disk drive is performing as follows:
-.. code::
+.. code:: console
- sudo dd bs=1M count=256 if=/dev/zero conv=fdatasync of=/srv/node/disk11/remember-to-delete-this-later
+ $ sudo dd bs=1M count=256 if=/dev/zero conv=fdatasync of=/srv/node/disk11/remember-to-delete-this-later
You can expect ~600MB/sec. If you get a low number, repeat many times as
Swift itself may also read or write to the disk, hence giving a lower
diff --git a/doc/source/ops_runbook/troubleshooting.rst b/doc/source/ops_runbook/troubleshooting.rst
index cb7553fc6..75511010c 100644
--- a/doc/source/ops_runbook/troubleshooting.rst
+++ b/doc/source/ops_runbook/troubleshooting.rst
@@ -16,20 +16,20 @@ transactions from this user. The linux ``bzgrep`` command can be used to
search all the proxy log files on a node including the ``.bz2`` compressed
files. For example:
-.. code::
+.. code:: console
$ PDSH_SSH_ARGS_APPEND="-o StrictHostKeyChecking=no" pdsh -l <yourusername> -R ssh \
-w <redacted>.68.[4-11,132-139 4-11,132-139],<redacted>.132.[4-11,132-139] \
'sudo bzgrep -w AUTH_redacted-4962-4692-98fb-52ddda82a5af /var/log/swift/proxy.log*' | dshbak -c
- .
- .
- ----------------
- <redacted>.132.6
- ----------------
- Feb 29 08:51:57 sw-aw2az2-proxy011 proxy-server <redacted>.16.132
- <redacted>.66.8 29/Feb/2012/08/51/57 GET /v1.0/AUTH_redacted-4962-4692-98fb-52ddda82a5af
- /%3Fformat%3Djson HTTP/1.0 404 - - <REDACTED>_4f4d50c5e4b064d88bd7ab82 - - -
- tx429fc3be354f434ab7f9c6c4206c1dc3 - 0.0130
+ .
+ .
+ ----------------
+ <redacted>.132.6
+ ----------------
+ Feb 29 08:51:57 sw-aw2az2-proxy011 proxy-server <redacted>.16.132
+ <redacted>.66.8 29/Feb/2012/08/51/57 GET /v1.0/AUTH_redacted-4962-4692-98fb-52ddda82a5af
+ /%3Fformat%3Djson HTTP/1.0 404 - - <REDACTED>_4f4d50c5e4b064d88bd7ab82 - - -
+ tx429fc3be354f434ab7f9c6c4206c1dc3 - 0.0130
This shows a ``GET`` operation on the users account.
@@ -40,7 +40,7 @@ This shows a ``GET`` operation on the users account.
Using the transaction ID, ``tx429fc3be354f434ab7f9c6c4206c1dc3`` you can
search the swift object servers log files for this transaction ID:
-.. code::
+.. code:: console
$ PDSH_SSH_ARGS_APPEND="-o StrictHostKeyChecking=no" pdsh -l <yourusername> -R ssh \
-w <redacted>.72.[4-67|4-67],<redacted>.[4-67|4-67],<redacted>.[4-67|4-67],<redacted>.204.[4-131] \
@@ -79,7 +79,7 @@ search the swift object servers log files for this transaction ID:
Next, use the ``swift-get-nodes`` command to determine exactly where the
user's account data is stored:
-.. code::
+.. code:: console
$ sudo swift-get-nodes /etc/swift/account.ring.gz AUTH_redacted-4962-4692-98fb-52ddda82a5af
Account AUTH_redacted-4962-4692-98fb-52ddda82a5af
@@ -119,7 +119,7 @@ user's account data is stored:
Check each of the primary servers, <redacted>.31, <redacted>.204.70 and <redacted>.72.16, for
this users account. For example on <redacted>.72.16:
-.. code::
+.. code:: console
$ ls -lah /srv/node/disk9/accounts/198875/696/1846d99185f8a0edaf65cfbf37439696/
total 1.0M
@@ -131,7 +131,7 @@ this users account. For example on <redacted>.72.16:
So this users account db, an sqlite db is present. Use sqlite to
checkout the account:
-.. code::
+.. code:: console
$ sudo cp /srv/node/disk9/accounts/198875/696/1846d99185f8a0edaf65cfbf37439696/1846d99185f8a0edaf65cfbf37439696.db /tmp
$ sudo sqlite3 /tmp/1846d99185f8a0edaf65cfbf37439696.db
@@ -156,7 +156,7 @@ checkout the account:
why the GET operations are returning 404, not found. Check the account
delete date/time:
- .. code::
+ .. code:: console
$ python
@@ -167,7 +167,7 @@ checkout the account:
Next try and find the ``DELETE`` operation for this account in the proxy
server logs:
-.. code::
+.. code:: console
$ PDSH_SSH_ARGS_APPEND="-o StrictHostKeyChecking=no" pdsh -l <yourusername> -R ssh \
-w <redacted>.68.[4-11,132-139 4-11,132-139],<redacted>.132.[4-11,132-139|4-11,132-139] \
@@ -206,7 +206,7 @@ as follows:
Examine the object in question:
-.. code::
+.. code:: console
$ sudo -u swift /opt/hp/swift/bin/swift-direct head 132345678912345 container_name obj_name
@@ -219,14 +219,14 @@ name of the objects this means it is a DLO. For example,
if ``X-Object-Manifest`` is ``container2/seg-blah``, list the contents
of the container container2 as follows:
-.. code::
+.. code:: console
$ sudo -u swift /opt/hp/swift/bin/swift-direct show 132345678912345 container2
Pick out the objects whose names start with ``seg-blah``.
Delete the segment objects as follows:
-.. code::
+.. code:: console
$ sudo -u swift /opt/hp/swift/bin/swift-direct delete 132345678912345 container2 seg-blah01
$ sudo -u swift /opt/hp/swift/bin/swift-direct delete 132345678912345 container2 seg-blah02