diff options
Diffstat (limited to 'doc/source/ops_runbook')
-rw-r--r-- | doc/source/ops_runbook/diagnose.rst | 291 | ||||
-rw-r--r-- | doc/source/ops_runbook/maintenance.rst | 24 | ||||
-rw-r--r-- | doc/source/ops_runbook/procedures.rst | 78 | ||||
-rw-r--r-- | doc/source/ops_runbook/troubleshooting.rst | 38 |
4 files changed, 217 insertions, 214 deletions
diff --git a/doc/source/ops_runbook/diagnose.rst b/doc/source/ops_runbook/diagnose.rst index 2de368128..976cdb70d 100644 --- a/doc/source/ops_runbook/diagnose.rst +++ b/doc/source/ops_runbook/diagnose.rst @@ -36,11 +36,11 @@ External monitoring We use pingdom.com to monitor the external Swift API. We suggest the following: - - Do a GET on ``/healthcheck`` +- Do a GET on ``/healthcheck`` - - Create a container, make it public (x-container-read: - .r*,.rlistings), create a small file in the container; do a GET - on the object +- Create a container, make it public (``x-container-read: + .r*,.rlistings``), create a small file in the container; do a GET + on the object Diagnose: General approach -------------------------- @@ -82,11 +82,11 @@ if any servers are down. We suggest you run it regularly to the last report without having to wait for a long-running command to complete. -Diagnose: Is system responding to /healthcheck? ------------------------------------------------ +Diagnose: Is system responding to ``/healthcheck``? +--------------------------------------------------- When you want to establish if a swift endpoint is running, run ``curl -k`` -against https://*[ENDPOINT]*/healthcheck. +against ``https://$ENDPOINT/healthcheck``. .. _swift_logs: @@ -209,11 +209,11 @@ Diagnose: Parted reports the backup GPT table is corrupt - If a GPT table is broken, a message like the following should be observed when the following command is run: - .. code:: + .. code:: console $ sudo parted -l - .. code:: + .. code:: console Error: The backup GPT table is corrupt, but the primary appears OK, so that will be used. @@ -232,40 +232,40 @@ invalid filesystem label. In such cases proceed as follows: #. Verify that the disk labels are correct: - .. code:: + .. code:: console - FS=/dev/sd#1 + $ FS=/dev/sd#1 - sudo parted -l | grep object + $ sudo parted -l | grep object #. If partition labels are inconsistent then, resolve the disk label issues before proceeding: - .. code:: + .. code:: console - sudo parted -s ${FS} name ${PART_NO} ${PART_NAME} #Partition Label - #PART_NO is 1 for object disks and 3 for OS disks - #PART_NAME follows the convention seen in "sudo parted -l | grep object" + $ sudo parted -s ${FS} name ${PART_NO} ${PART_NAME} #Partition Label + $ # PART_NO is 1 for object disks and 3 for OS disks + $ # PART_NAME follows the convention seen in "sudo parted -l | grep object" #. If the Filesystem label is missing then create it with care: - .. code:: + .. code:: console - sudo xfs_admin -l ${FS} #Filesystem label (12 Char limit) + $ sudo xfs_admin -l ${FS} #Filesystem label (12 Char limit) - #Check for the existence of a FS label + $ # Check for the existence of a FS label - OBJNO=<3 Length Object No.> + $ OBJNO=<3 Length Object No.> - #I.E OBJNO for sw-stbaz3-object0007 would be 007 + $ # I.E OBJNO for sw-stbaz3-object0007 would be 007 - DISKNO=<3 Length Disk No.> + $ DISKNO=<3 Length Disk No.> - #I.E DISKNO for /dev/sdb would be 001, /dev/sdc would be 002 etc. + $ # I.E DISKNO for /dev/sdb would be 001, /dev/sdc would be 002 etc. - sudo xfs_admin -L "obj${OBJNO}dsk${DISKNO}" ${FS} + $ sudo xfs_admin -L "obj${OBJNO}dsk${DISKNO}" ${FS} - #Create a FS Label + $ # Create a FS Label Diagnose: Failed LUNs --------------------- @@ -293,9 +293,9 @@ Otherwise the lun can be re-enabled as follows: LUN. You will come back later and grep this file for more details, but just generate it for now. - .. code:: + .. code:: console - sudo hpssacli controller all diag file=/tmp/hpacu.diag ris=on xml=off zip=off + $ sudo hpssacli controller all diag file=/tmp/hpacu.diag ris=on xml=off zip=off Export the following variables using the below instructions before proceeding further. @@ -304,16 +304,16 @@ proceeding further. failed drive's number and array value (example output: "array A logicaldrive 1..." would be exported as LDRIVE=1): - .. code:: + .. code:: console - sudo hpssacli controller slot=1 ld all show + $ sudo hpssacli controller slot=1 ld all show #. Export the number of the logical drive that was retrieved from the previous command into the LDRIVE variable: - .. code:: + .. code:: console - export LDRIVE=<LogicalDriveNumber> + $ export LDRIVE=<LogicalDriveNumber> #. Print the array value and Port:Box:Bay for all drives and take note of the Port:Box:Bay for the failed drive (example output: " array A @@ -324,9 +324,9 @@ proceeding further. in the case of "array c"), but we will run a different command to be sure we are operating on the correct device. - .. code:: + .. code:: console - sudo hpssacli controller slot=1 pd all show + $ sudo hpssacli controller slot=1 pd all show .. note:: @@ -339,24 +339,24 @@ proceeding further. #. Export the Port:Box:Bay for the failed drive into the PBOX variable: - .. code:: + .. code:: console - export PBOX=<Port:Box:Bay> + $ export PBOX=<Port:Box:Bay> #. Print the physical device information and take note of the Disk Name (example output: "Disk Name: /dev/sdk" would be exported as DEV=/dev/sdk): - .. code:: + .. code:: console - sudo hpssacli controller slot=1 ld ${LDRIVE} show detail | grep -i "Disk Name" + $ sudo hpssacli controller slot=1 ld ${LDRIVE} show detail | grep -i "Disk Name" #. Export the device name variable from the preceding command (example: /dev/sdk): - .. code:: + .. code:: console - export DEV=<Device> + $ export DEV=<Device> #. Export the filesystem variable. Disks that are split between the operating system and data storage, typically sda and sdb, should only @@ -367,39 +367,39 @@ proceeding further. data filesystem for the device in question as the export. For example: /dev/sdk1. - .. code:: + .. code:: console - export FS=<Filesystem> + $ export FS=<Filesystem> #. Verify the LUN is failed, and the device is not: - .. code:: + .. code:: console - sudo hpssacli controller slot=1 ld all show - sudo hpssacli controller slot=1 pd all show - sudo hpssacli controller slot=1 ld ${LDRIVE} show detail - sudo hpssacli controller slot=1 pd ${PBOX} show detail + $ sudo hpssacli controller slot=1 ld all show + $ sudo hpssacli controller slot=1 pd all show + $ sudo hpssacli controller slot=1 ld ${LDRIVE} show detail + $ sudo hpssacli controller slot=1 pd ${PBOX} show detail #. Stop the swift and rsync service: - .. code:: + .. code:: console - sudo service rsync stop - sudo swift-init shutdown all + $ sudo service rsync stop + $ sudo swift-init shutdown all #. Unmount the problem drive, fix the LUN and the filesystem: - .. code:: + .. code:: console - sudo umount ${FS} + $ sudo umount ${FS} #. If umount fails, you should run lsof search for the mountpoint and kill any lingering processes before repeating the unpount: - .. code:: + .. code:: console - sudo hpacucli controller slot=1 ld ${LDRIVE} modify reenable - sudo xfs_repair ${FS} + $ sudo hpacucli controller slot=1 ld ${LDRIVE} modify reenable + $ sudo xfs_repair ${FS} #. If the ``xfs_repair`` complains about possible journal data, use the ``xfs_repair -L`` option to zeroise the journal log. @@ -407,21 +407,21 @@ proceeding further. #. Once complete test-mount the filesystem, and tidy up its lost and found area. - .. code:: + .. code:: console - sudo mount ${FS} /mnt - sudo rm -rf /mnt/lost+found/ - sudo umount /mnt + $ sudo mount ${FS} /mnt + $ sudo rm -rf /mnt/lost+found/ + $ sudo umount /mnt #. Mount the filesystem and restart swift and rsync. #. Run the following to determine if a DC ticket is needed to check the cables on the node: - .. code:: + .. code:: console - grep -y media.exchanged /tmp/hpacu.diag - grep -y hot.plug.count /tmp/hpacu.diag + $ grep -y media.exchanged /tmp/hpacu.diag + $ grep -y hot.plug.count /tmp/hpacu.diag #. If the output reports any non 0x00 values, it suggests that the cables should be checked. For example, log a DC ticket to check the sas cables @@ -440,7 +440,7 @@ If the diagnostics report a message such as ``sda: drive is slow``, you should log onto the node and run the following command (remove ``-c 1`` option to continuously monitor the data): -.. code:: +.. code:: console $ /usr/bin/collectl -s D -c 1 waiting for 1 second sample... @@ -475,7 +475,7 @@ otherwise hardware replacement is needed. Another way to look at the data is as follows: -.. code:: +.. code:: console $ /opt/hp/syseng/disk-anal.pl -d Disk: sda Wait: 54580 371 65 25 12 6 6 0 1 2 0 46 @@ -524,7 +524,7 @@ historical data. You can look at recent data as follows. It only looks at data from 13:15 to 14:15. As you can see, this is a relatively clean system (few if any long wait or service times): -.. code:: +.. code:: console $ /opt/hp/syseng/disk-anal.pl -d -t 13:15-14:15 Disk: sda Wait: 3600 0 0 0 0 0 0 0 0 0 0 0 @@ -582,21 +582,21 @@ Running tests #. Prepare the ``target`` node as follows: - .. code:: + .. code:: console - sudo iptables -I INPUT -p tcp -j ACCEPT + $ sudo iptables -I INPUT -p tcp -j ACCEPT Or, do: - .. code:: + .. code:: console - sudo ufw allow 12866/tcp + $ sudo ufw allow 12866/tcp #. On the ``source`` node, run the following command to check throughput. Note the double-dash before the -P option. The command takes 10 seconds to complete. The ``target`` node is 192.168.245.5. - .. code:: + .. code:: console $ netperf -H 192.168.245.5 -- -P 12866 MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 12866 AF_INET to @@ -609,7 +609,7 @@ Running tests #. On the ``source`` node, run the following command to check latency: - .. code:: + .. code:: console $ netperf -H 192.168.245.5 -t TCP_RR -- -P 12866 MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 12866 @@ -644,21 +644,21 @@ Diagnose: Remapping sectors experiencing UREs #. Set the environment variables SEC, DEV & FS, for example: - .. code:: + .. code:: console - SEC=2930954256 - DEV=/dev/sdi - FS=/dev/sdi1 + $ SEC=2930954256 + $ DEV=/dev/sdi + $ FS=/dev/sdi1 #. Verify that the sector is bad: - .. code:: + .. code:: console - sudo dd if=${DEV} of=/dev/null bs=512 count=1 skip=${SEC} + $ sudo dd if=${DEV} of=/dev/null bs=512 count=1 skip=${SEC} #. If the sector is bad this command will output an input/output error: - .. code:: + .. code:: console dd: reading `/dev/sdi`: Input/output error 0+0 records in @@ -667,28 +667,28 @@ Diagnose: Remapping sectors experiencing UREs #. Prevent chef from attempting to re-mount the filesystem while the repair is in progress: - .. code:: + .. code:: console - sudo mv /etc/chef/client.pem /etc/chef/xx-client.xx-pem + $ sudo mv /etc/chef/client.pem /etc/chef/xx-client.xx-pem #. Stop the swift and rsync service: - .. code:: + .. code:: console - sudo service rsync stop - sudo swift-init shutdown all + $ sudo service rsync stop + $ sudo swift-init shutdown all #. Unmount the problem drive: - .. code:: + .. code:: console - sudo umount ${FS} + $ sudo umount ${FS} #. Overwrite/remap the bad sector: - .. code:: + .. code:: console - sudo dd_rescue -d -A -m8b -s ${SEC}b ${DEV} ${DEV} + $ sudo dd_rescue -d -A -m8b -s ${SEC}b ${DEV} ${DEV} #. This command should report an input/output error the first time it is run. Run the command a second time, if it successfully remapped @@ -696,9 +696,9 @@ Diagnose: Remapping sectors experiencing UREs #. Verify the sector is now readable: - .. code:: + .. code:: console - sudo dd if=${DEV} of=/dev/null bs=512 count=1 skip=${SEC} + $ sudo dd if=${DEV} of=/dev/null bs=512 count=1 skip=${SEC} #. If the sector is now readable this command should not report an input/output error. @@ -706,24 +706,24 @@ Diagnose: Remapping sectors experiencing UREs #. If more than one problem sector is listed, set the SEC environment variable to the next sector in the list: - .. code:: + .. code:: console - SEC=123456789 + $ SEC=123456789 #. Repeat from step 8. #. Repair the filesystem: - .. code:: + .. code:: console - sudo xfs_repair ${FS} + $ sudo xfs_repair ${FS} #. If ``xfs_repair`` reports that the filesystem has valuable filesystem changes: - .. code:: + .. code:: console - sudo xfs_repair ${FS} + $ sudo xfs_repair ${FS} Phase 1 - find and verify superblock... Phase 2 - using internal log - zero log... @@ -739,11 +739,11 @@ Diagnose: Remapping sectors experiencing UREs #. You should attempt to mount the filesystem, and clear the lost+found area: - .. code:: + .. code:: console - sudo mount $FS /mnt - sudo rm -rf /mnt/lost+found/* - sudo umount /mnt + $ sudo mount $FS /mnt + $ sudo rm -rf /mnt/lost+found/* + $ sudo umount /mnt #. If the filesystem fails to mount then you will need to use the ``xfs_repair -L`` option to force log zeroing. @@ -752,16 +752,16 @@ Diagnose: Remapping sectors experiencing UREs #. If ``xfs_repair`` reports that an additional input/output error has been encountered, get the sector details as follows: - .. code:: + .. code:: console - sudo grep "I/O error" /var/log/kern.log | grep sector | tail -1 + $ sudo grep "I/O error" /var/log/kern.log | grep sector | tail -1 #. If new input/output error is reported then set the SEC environment variable to the problem sector number: - .. code:: + .. code:: console - SEC=234567890 + $ SEC=234567890 #. Repeat from step 8 @@ -806,31 +806,31 @@ errors, it may well indicate a cable, switch, or network issue. Get an overview of the interface with: -.. code:: +.. code:: console - sudo ifconfig eth{n} - sudo ethtool eth{n} + $ sudo ifconfig eth{n} + $ sudo ethtool eth{n} The ``Link Detected:`` indicator will read ``yes`` if the nic is cabled. Establish the adapter type with: -.. code:: +.. code:: console - sudo ethtool -i eth{n} + $ sudo ethtool -i eth{n} Gather the interface statistics with: -.. code:: +.. code:: console - sudo ethtool -S eth{n} + $ sudo ethtool -S eth{n} If the nick supports self test, this can be performed with: -.. code:: +.. code:: console - sudo ethtool -t eth{n} + $ sudo ethtool -t eth{n} Self tests should read ``PASS`` if the nic is operating correctly. @@ -853,9 +853,9 @@ A replicator reports in its log that remaining time exceeds making progress. Another useful way to check this is with the 'swift-recon -r' command on a swift proxy server: -.. code:: +.. code:: console - sudo swift-recon -r + $ sudo swift-recon -r =============================================================================== --> Starting reconnaissance on 384 hosts @@ -877,9 +877,9 @@ You can further check if the object replicator is stuck by logging on the object server and checking the object replicator progress with the following command: -.. code:: +.. code:: console - # sudo grep object-rep /var/log/swift/background.log | grep -e "Starting object replication" -e "Object replication complete" -e "partitions rep" + $ sudo grep object-rep /var/log/swift/background.log | grep -e "Starting object replication" -e "Object replication complete" -e "partitions rep" Jul 16 06:25:46 192.168.245.4 object-replicator 15344/16450 (93.28%) partitions replicated in 69018.48s (0.22/sec, 22h remaining) Jul 16 06:30:46 192.168.245.4object-replicator 15344/16450 (93.28%) partitions replicated in 69318.58s (0.22/sec, 22h remaining) Jul 16 06:35:46 192.168.245.4 object-replicator 15344/16450 (93.28%) partitions replicated in 69618.63s (0.22/sec, 23h remaining) @@ -912,9 +912,9 @@ One of the reasons for the object replicator hanging like this is filesystem corruption on the drive. The following is a typical log entry of a corrupted filesystem detected by the object replicator: -.. code:: +.. code:: console - # sudo bzgrep "Remote I/O error" /var/log/swift/background.log* |grep srv | - tail -1 + $ sudo bzgrep "Remote I/O error" /var/log/swift/background.log* |grep srv | - tail -1 Jul 12 03:33:30 192.168.245.4 object-replicator STDOUT: ERROR:root:Error hashing suffix#012Traceback (most recent call last):#012 File "/usr/lib/python2.7/dist-packages/swift/obj/replicator.py", line 199, in get_hashes#012 hashes[suffix] = hash_suffix(suffix_dir, reclaim_age)#012 File "/usr/lib/python2.7/dist-packages/swift/obj/replicator.py", line 84, in hash_suffix#012 path_contents = @@ -922,9 +922,9 @@ of a corrupted filesystem detected by the object replicator: An ``ls`` of the problem file or directory usually shows something like the following: -.. code:: +.. code:: console - # ls -l /srv/node/disk4/objects/1643763/b51 + $ ls -l /srv/node/disk4/objects/1643763/b51 ls: cannot access /srv/node/disk4/objects/1643763/b51: Remote I/O error If no entry with ``Remote I/O error`` occurs in the ``background.log`` it is @@ -935,27 +935,27 @@ restart the object-replicator. #. Stop the object-replicator: - .. code:: + .. code:: console # sudo swift-init object-replicator stop #. Make sure the object replicator has stopped, if it has hung, the stop command will not stop the hung process: - .. code:: + .. code:: console # ps auxww | - grep swift-object-replicator #. If the previous ps shows the object-replicator is still running, kill the process: - .. code:: + .. code:: console # kill -9 <pid-of-swift-object-replicator> #. Start the object-replicator: - .. code:: + .. code:: console # sudo swift-init object-replicator start @@ -964,14 +964,14 @@ to repair the problem filesystem. #. Stop swift and rsync: - .. code:: + .. code:: console # sudo swift-init all shutdown # sudo service rsync stop #. Make sure all swift process have stopped: - .. code:: + .. code:: console # ps auxww | grep swift | grep python @@ -979,13 +979,13 @@ to repair the problem filesystem. #. Unmount the problem filesystem: - .. code:: + .. code:: console # sudo umount /srv/node/disk4 #. Repair the filesystem: - .. code:: + .. code:: console # sudo xfs_repair -P /dev/sde1 @@ -1002,7 +1002,7 @@ The CPU load average on an object server, as shown with the 'uptime' command, is typically under 10 when the server is lightly-moderately loaded: -.. code:: +.. code:: console $ uptime 07:59:26 up 99 days, 5:57, 1 user, load average: 8.59, 8.39, 8.32 @@ -1014,7 +1014,7 @@ However, sometimes the CPU load average can increase significantly. The following is an example of an object server that has extremely high CPU load: -.. code:: +.. code:: console $ uptime 07:44:02 up 18:22, 1 user, load average: 407.12, 406.36, 404.59 @@ -1050,9 +1050,9 @@ Further issues and resolutions given server. - Run this command: - .. code:: + .. code:: console - sudo swift-init all start + $ sudo swift-init all start Examine messages in the swift log files to see if there are any error messages related to any of the swift processes since the time you @@ -1080,9 +1080,9 @@ Further issues and resolutions - Restart the swift processes on the affected node: - .. code:: + .. code:: console - % sudo swift-init all reload + $ sudo swift-init all reload Urgency: If known performance problem: Immediate @@ -1135,18 +1135,18 @@ Further issues and resolutions For example, it is running at 100 Mb/s and the NIC is a 1Ge NIC. - 1. Try resetting the interface with: - .. code:: + .. code:: console - sudo ethtool -s eth0 speed 1000 + $ sudo ethtool -s eth0 speed 1000 - ... and then run: + ... and then run: - .. code:: + .. code:: console - sudo lshw -class + $ sudo lshw -class - See if size goes to the expected speed. Failing - that, check hardware (NIC cable/switch port). + See if size goes to the expected speed. Failing + that, check hardware (NIC cable/switch port). 2. If persistent, consider shutting down the server (especially if a proxy) until the problem is identified and resolved. If you leave this server @@ -1183,9 +1183,11 @@ Further issues and resolutions - Urgency: Medium This may have been triggered by a recent restart of the rsyslog daemon. Restart the service with: - .. code:: - sudo swift-init <service> reload + .. code:: console + + $ sudo swift-init <service> reload + * - Object replicator: Reports the remaining time and that time is more than 100 hours. - Each replication cycle the object replicator writes a log message to its log reporting statistics about the current cycle. This includes an estimate for the @@ -1193,9 +1195,10 @@ Further issues and resolutions 100 hours, there is a problem with the replication process. - Urgency: Medium Restart the service with: - .. code:: - sudo swift-init object-replicator reload + .. code:: console + + $ sudo swift-init object-replicator reload Check that the remaining replication time is going down. diff --git a/doc/source/ops_runbook/maintenance.rst b/doc/source/ops_runbook/maintenance.rst index a2a9cbb10..c63feb7bd 100644 --- a/doc/source/ops_runbook/maintenance.rst +++ b/doc/source/ops_runbook/maintenance.rst @@ -27,9 +27,9 @@ if you wait a while things get better. For example: -.. code:: +.. code:: console - sudo swift-recon -rla + $ sudo swift-recon -rla =============================================================================== [2012-03-10 12:57:21] Checking async pendings on 384 hosts... Async stats: low: 0, high: 1, avg: 0, total: 1 @@ -52,7 +52,7 @@ system. Rules-of-thumb for 'good' recon output are: - Nodes that respond are up and running Swift. If all nodes respond, that is a good sign. But some nodes may time out. For example: - .. code:: + .. code:: console -> [http://<redacted>.29:6200/recon/load:] <urlopen error [Errno 111] ECONNREFUSED> -> [http://<redacted>.31:6200/recon/load:] <urlopen error timed out> @@ -83,7 +83,7 @@ system. Rules-of-thumb for 'good' recon output are: For comparison here is the recon output for the same system above when two entire racks of Swift are down: -.. code:: +.. code:: console [2012-03-10 16:56:33] Checking async pendings on 384 hosts... -> http://<redacted>.22:6200/recon/async: <urlopen error timed out> @@ -152,9 +152,9 @@ Here is an example of noting and tracking down a problem with recon. Running reccon shows some async pendings: -.. code:: +.. code:: console - bob@notso:~/swift-1.4.4/swift$ ssh -q <redacted>.132.7 sudo swift-recon -alr + $ ssh -q <redacted>.132.7 sudo swift-recon -alr =============================================================================== [2012-03-14 17:25:55] Checking async pendings on 384 hosts... Async stats: low: 0, high: 23, avg: 8, total: 3356 @@ -172,9 +172,9 @@ Why? Running recon again with -av swift (not shown here) tells us that the node with the highest (23) is <redacted>.72.61. Looking at the log files on <redacted>.72.61 we see: -.. code:: +.. code:: console - souzab@<redacted>:~$ sudo tail -f /var/log/swift/background.log | - grep -i ERROR + $ sudo tail -f /var/log/swift/background.log | - grep -i ERROR Mar 14 17:28:06 <redacted> container-replicator ERROR Remote drive not mounted {'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.119', 'id': 5481, 'meta': '', 'device': 'disk6', 'port': 6201} Mar 14 17:28:06 <redacted> container-replicator ERROR Remote drive not mounted @@ -235,7 +235,7 @@ Procedure running the ring builder on a proxy node to determine which zones the storage nodes are in. For example: - .. code:: + .. code:: console % sudo swift-ring-builder /etc/swift/object.builder /etc/swift/object.builder, build version 1467 @@ -258,7 +258,7 @@ Procedure builder again, this time with the ``list_parts`` option and specify the nodes under consideration. For example: - .. code:: + .. code:: console % sudo swift-ring-builder /etc/swift/object.builder list_parts <redacted>.8 <redacted>.15 <redacted>.72.2 Partition Matches @@ -283,7 +283,7 @@ Procedure small, and is proportional to the number of entries that have a 3 in the Matches column. For example: - .. code:: + .. code:: console Partition Matches 26865 3 @@ -300,7 +300,7 @@ Procedure #. A quick way to count the number of rows with 3 matches is: - .. code:: + .. code:: console % sudo swift-ring-builder /etc/swift/object.builder list_parts <redacted>.8 <redacted>.15 <redacted>.72.2 | grep "3$" | wc -l diff --git a/doc/source/ops_runbook/procedures.rst b/doc/source/ops_runbook/procedures.rst index af28e020c..1d84d5969 100644 --- a/doc/source/ops_runbook/procedures.rst +++ b/doc/source/ops_runbook/procedures.rst @@ -10,13 +10,13 @@ Fix broken GPT table (broken disk partition) - If a GPT table is broken, a message like the following should be observed when the command... - .. code:: + .. code:: console $ sudo parted -l - ... is run. - .. code:: + .. code:: console ... Error: The backup GPT table is corrupt, but the primary appears OK, so that will @@ -25,13 +25,13 @@ Fix broken GPT table (broken disk partition) #. To fix this, firstly install the ``gdisk`` program to fix this: - .. code:: + .. code:: console $ sudo aptitude install gdisk #. Run ``gdisk`` for the particular drive with the damaged partition: - .. code: + .. code: console $ sudo gdisk /dev/sd*a-l* GPT fdisk (gdisk) version 0.6.14 @@ -57,7 +57,7 @@ Fix broken GPT table (broken disk partition) and finally ``w`` (write table to disk and exit). Will also need to enter ``Y`` when prompted in order to confirm actions. - .. code:: + .. code:: console Command (? for help): r @@ -92,7 +92,7 @@ Fix broken GPT table (broken disk partition) #. Running the command: - .. code:: + .. code:: console $ sudo parted /dev/sd# @@ -100,7 +100,7 @@ Fix broken GPT table (broken disk partition) #. Finally, uninstall ``gdisk`` from the node: - .. code:: + .. code:: console $ sudo aptitude remove gdisk @@ -112,20 +112,20 @@ Procedure: Fix broken XFS filesystem #. A filesystem may be corrupt or broken if the following output is observed when checking its label: - .. code:: + .. code:: console $ sudo xfs_admin -l /dev/sd# - cache_node_purge: refcount was 1, not zero (node=0x25d5ee0) - xfs_admin: cannot read root inode (117) - cache_node_purge: refcount was 1, not zero (node=0x25d92b0) - xfs_admin: cannot read realtime bitmap inode (117) - bad sb magic # 0 in AG 1 - failed to read label in AG 1 + cache_node_purge: refcount was 1, not zero (node=0x25d5ee0) + xfs_admin: cannot read root inode (117) + cache_node_purge: refcount was 1, not zero (node=0x25d92b0) + xfs_admin: cannot read realtime bitmap inode (117) + bad sb magic # 0 in AG 1 + failed to read label in AG 1 #. Run the following commands to remove the broken/corrupt filesystem and replace. (This example uses the filesystem ``/dev/sdb2``) Firstly need to replace the partition: - .. code:: + .. code:: console $ sudo parted GNU Parted 2.3 @@ -167,7 +167,7 @@ Procedure: Fix broken XFS filesystem #. Next step is to scrub the filesystem and format: - .. code:: + .. code:: console $ sudo dd if=/dev/zero of=/dev/sdb2 bs=$((1024*1024)) count=1 1+0 records in @@ -175,19 +175,19 @@ Procedure: Fix broken XFS filesystem 1048576 bytes (1.0 MB) copied, 0.00480617 s, 218 MB/s $ sudo /sbin/mkfs.xfs -f -i size=1024 /dev/sdb2 meta-data=/dev/sdb2 isize=1024 agcount=4, agsize=106811524 blks - = sectsz=512 attr=2, projid32bit=0 - data = bsize=4096 blocks=427246093, imaxpct=5 - = sunit=0 swidth=0 blks - naming =version 2 bsize=4096 ascii-ci=0 - log =internal log bsize=4096 blocks=208616, version=2 - = sectsz=512 sunit=0 blks, lazy-count=1 - realtime =none extsz=4096 blocks=0, rtextents=0 + = sectsz=512 attr=2, projid32bit=0 + data = bsize=4096 blocks=427246093, imaxpct=5 + = sunit=0 swidth=0 blks + naming =version 2 bsize=4096 ascii-ci=0 + log =internal log bsize=4096 blocks=208616, version=2 + = sectsz=512 sunit=0 blks, lazy-count=1 + realtime =none extsz=4096 blocks=0, rtextents=0 #. You should now label and mount your filesystem. #. Can now check to see if the filesystem is mounted using the command: - .. code:: + .. code:: console $ mount @@ -204,7 +204,7 @@ Procedure: Checking if an account is okay You must know the tenant/project ID. You can check if the account is okay as follows from a proxy. -.. code:: +.. code:: console $ sudo -u swift /opt/hp/swift/bin/swift-direct show AUTH_<project-id> @@ -214,7 +214,7 @@ containers, or an error indicating that the resource could not be found. Alternatively, you can use ``swift-get-nodes`` to find the account database files. Run the following on a proxy: -.. code:: +.. code:: console $ sudo swift-get-nodes /etc/swift/account.ring.gz AUTH_<project-id> @@ -239,7 +239,7 @@ Log onto one of the swift proxy servers. Use swift-direct to show this accounts usage: -.. code:: +.. code:: console $ sudo -u swift /opt/hp/swift/bin/swift-direct show AUTH_<project-id> Status: 200 @@ -288,7 +288,7 @@ re-create the account as follows: servers). The output has been truncated so we can focus on the import pieces of data: - .. code:: + .. code:: console $ sudo swift-get-nodes /etc/swift/account.ring.gz AUTH_4ebe3039674d4864a11fe0864ae4d905 ... @@ -308,7 +308,7 @@ re-create the account as follows: #. Before proceeding check that the account is really deleted by using curl. Execute the commands printed by ``swift-get-nodes``. For example: - .. code:: + .. code:: console $ curl -I -XHEAD "http://192.168.245.5:6202/disk1/3934/AUTH_4ebe3039674d4864a11fe0864ae4d905" HTTP/1.1 404 Not Found @@ -323,7 +323,7 @@ re-create the account as follows: #. Use the ssh commands printed by ``swift-get-nodes`` to check if database files exist. For example: - .. code:: + .. code:: console $ ssh 192.168.245.5 "ls -lah ${DEVICE:-/srv/node*}/disk1/accounts/3934/052/f5ecf8b40de3e1b0adb0dbe576874052" total 20K @@ -344,7 +344,7 @@ re-create the account as follows: #. Delete the database files. For example: - .. code:: + .. code:: console $ ssh 192.168.245.5 $ cd /srv/node/disk1/accounts/3934/052/f5ecf8b40de3e1b0adb0dbe576874052 @@ -374,9 +374,9 @@ balancers, customer's are not impacted by the misbehaving proxy. #. Shut down Swift as follows: - .. code:: + .. code:: console - sudo swift-init proxy shutdown + $ sudo swift-init proxy shutdown .. note:: @@ -384,15 +384,15 @@ balancers, customer's are not impacted by the misbehaving proxy. #. Create the ``/etc/swift/disabled-by-file`` file. For example: - .. code:: + .. code:: console - sudo touch /etc/swift/disabled-by-file + $ sudo touch /etc/swift/disabled-by-file #. Optional, restart Swift: - .. code:: + .. code:: console - sudo swift-init proxy start + $ sudo swift-init proxy start It works because the healthcheck middleware looks for /etc/swift/disabled-by-file. If it exists, the middleware will return 503/error instead of 200/OK. This means the load balancer @@ -403,9 +403,9 @@ Procedure: Ad-Hoc disk performance test You can get an idea whether a disk drive is performing as follows: -.. code:: +.. code:: console - sudo dd bs=1M count=256 if=/dev/zero conv=fdatasync of=/srv/node/disk11/remember-to-delete-this-later + $ sudo dd bs=1M count=256 if=/dev/zero conv=fdatasync of=/srv/node/disk11/remember-to-delete-this-later You can expect ~600MB/sec. If you get a low number, repeat many times as Swift itself may also read or write to the disk, hence giving a lower diff --git a/doc/source/ops_runbook/troubleshooting.rst b/doc/source/ops_runbook/troubleshooting.rst index cb7553fc6..75511010c 100644 --- a/doc/source/ops_runbook/troubleshooting.rst +++ b/doc/source/ops_runbook/troubleshooting.rst @@ -16,20 +16,20 @@ transactions from this user. The linux ``bzgrep`` command can be used to search all the proxy log files on a node including the ``.bz2`` compressed files. For example: -.. code:: +.. code:: console $ PDSH_SSH_ARGS_APPEND="-o StrictHostKeyChecking=no" pdsh -l <yourusername> -R ssh \ -w <redacted>.68.[4-11,132-139 4-11,132-139],<redacted>.132.[4-11,132-139] \ 'sudo bzgrep -w AUTH_redacted-4962-4692-98fb-52ddda82a5af /var/log/swift/proxy.log*' | dshbak -c - . - . - ---------------- - <redacted>.132.6 - ---------------- - Feb 29 08:51:57 sw-aw2az2-proxy011 proxy-server <redacted>.16.132 - <redacted>.66.8 29/Feb/2012/08/51/57 GET /v1.0/AUTH_redacted-4962-4692-98fb-52ddda82a5af - /%3Fformat%3Djson HTTP/1.0 404 - - <REDACTED>_4f4d50c5e4b064d88bd7ab82 - - - - tx429fc3be354f434ab7f9c6c4206c1dc3 - 0.0130 + . + . + ---------------- + <redacted>.132.6 + ---------------- + Feb 29 08:51:57 sw-aw2az2-proxy011 proxy-server <redacted>.16.132 + <redacted>.66.8 29/Feb/2012/08/51/57 GET /v1.0/AUTH_redacted-4962-4692-98fb-52ddda82a5af + /%3Fformat%3Djson HTTP/1.0 404 - - <REDACTED>_4f4d50c5e4b064d88bd7ab82 - - - + tx429fc3be354f434ab7f9c6c4206c1dc3 - 0.0130 This shows a ``GET`` operation on the users account. @@ -40,7 +40,7 @@ This shows a ``GET`` operation on the users account. Using the transaction ID, ``tx429fc3be354f434ab7f9c6c4206c1dc3`` you can search the swift object servers log files for this transaction ID: -.. code:: +.. code:: console $ PDSH_SSH_ARGS_APPEND="-o StrictHostKeyChecking=no" pdsh -l <yourusername> -R ssh \ -w <redacted>.72.[4-67|4-67],<redacted>.[4-67|4-67],<redacted>.[4-67|4-67],<redacted>.204.[4-131] \ @@ -79,7 +79,7 @@ search the swift object servers log files for this transaction ID: Next, use the ``swift-get-nodes`` command to determine exactly where the user's account data is stored: -.. code:: +.. code:: console $ sudo swift-get-nodes /etc/swift/account.ring.gz AUTH_redacted-4962-4692-98fb-52ddda82a5af Account AUTH_redacted-4962-4692-98fb-52ddda82a5af @@ -119,7 +119,7 @@ user's account data is stored: Check each of the primary servers, <redacted>.31, <redacted>.204.70 and <redacted>.72.16, for this users account. For example on <redacted>.72.16: -.. code:: +.. code:: console $ ls -lah /srv/node/disk9/accounts/198875/696/1846d99185f8a0edaf65cfbf37439696/ total 1.0M @@ -131,7 +131,7 @@ this users account. For example on <redacted>.72.16: So this users account db, an sqlite db is present. Use sqlite to checkout the account: -.. code:: +.. code:: console $ sudo cp /srv/node/disk9/accounts/198875/696/1846d99185f8a0edaf65cfbf37439696/1846d99185f8a0edaf65cfbf37439696.db /tmp $ sudo sqlite3 /tmp/1846d99185f8a0edaf65cfbf37439696.db @@ -156,7 +156,7 @@ checkout the account: why the GET operations are returning 404, not found. Check the account delete date/time: - .. code:: + .. code:: console $ python @@ -167,7 +167,7 @@ checkout the account: Next try and find the ``DELETE`` operation for this account in the proxy server logs: -.. code:: +.. code:: console $ PDSH_SSH_ARGS_APPEND="-o StrictHostKeyChecking=no" pdsh -l <yourusername> -R ssh \ -w <redacted>.68.[4-11,132-139 4-11,132-139],<redacted>.132.[4-11,132-139|4-11,132-139] \ @@ -206,7 +206,7 @@ as follows: Examine the object in question: -.. code:: +.. code:: console $ sudo -u swift /opt/hp/swift/bin/swift-direct head 132345678912345 container_name obj_name @@ -219,14 +219,14 @@ name of the objects this means it is a DLO. For example, if ``X-Object-Manifest`` is ``container2/seg-blah``, list the contents of the container container2 as follows: -.. code:: +.. code:: console $ sudo -u swift /opt/hp/swift/bin/swift-direct show 132345678912345 container2 Pick out the objects whose names start with ``seg-blah``. Delete the segment objects as follows: -.. code:: +.. code:: console $ sudo -u swift /opt/hp/swift/bin/swift-direct delete 132345678912345 container2 seg-blah01 $ sudo -u swift /opt/hp/swift/bin/swift-direct delete 132345678912345 container2 seg-blah02 |