From 64a3bade2c8a515da82d3394cb8940316d34a85e Mon Sep 17 00:00:00 2001
From: Mike Christie <michael.christie@oracle.com>
Date: Tue, 14 Jun 2022 10:32:37 -0500
Subject: Update README's error handler/timeout section

The README's error handling and timeout section is out dated or not
correct. This patch updates it.
---
 README | 214 ++++++++++++++++++++++++++++++++++++++++++++++++++---------------
 1 file changed, 166 insertions(+), 48 deletions(-)

diff --git a/README b/README
index 08b2419..5008b36 100644
--- a/README
+++ b/README
@@ -4,7 +4,7 @@
 
 =================================================================
 
-                                                   Mar 30, 2022
+                                                   Jun 6, 2022
 Contents
 ========
 
@@ -1431,11 +1431,11 @@ queued if all paths are failed in the multipath layer.
 =================================
 To quickly detect problems in the network, the iSCSI layer will send iSCSI
 pings (iSCSI NOP-Out requests) to the target. If a NOP-Out times out, the
-iSCSI layer will respond by failing running commands and asking the SCSI
-layer to requeue them if possible (SCSI disk commands get 5 retries if not
-using multipath). If dm-multipath is being used the SCSI layer will fail
-the command to the multipath layer instead of retrying. The multipath layer
-will then retry the command on another path.
+iSCSI layer will respond by failing the connection and starting the
+replacement_timeout. It will then tell the SCSI layer to stop the device queues
+so no new IO will be sent to the iSCSI layer and to requeue and retry the
+commands that were running if possible (see the next section on retrying
+commands and the replacement_timeout).
 
 To control how often a NOP-Out is sent, the following value can be set:
 	node.conn[0].timeo.noop_out_interval = X
@@ -1451,41 +1451,178 @@ Normally for these values you can use:
 	node.conn[0].timeo.noop_out_interval = 5
 	node.conn[0].timeo.noop_out_timeout = 10
 
-If there are a lot of IO error messages, then the above values may be too
-aggressive and you may need to increase the values for your network conditions
-and workload, or you may need to check your network for possible problems.
+If there are a lot of IO error messages like
 
+detected conn error (22)
 
-8.1.2 replacement_timeout
-=========================
+in the kernel log then the above values may be too aggressive. You may need to
+increase the values for your network conditions and workload, or you may need
+to check your network for possible problems.
 
-The next iSCSI timer that will need to be tweaked is:
+8.1.2 SCSI command retries
+==========================
 
-node.session.timeo.replacement_timeout = X
+SCSI disk commands get 5 retries by default. In newer kernels this can be
+controlled via the sysfs file:
+
+/sys/block/$sdX/device/scsi_disk/$host:$bus:$target:LUN/max_retries
 
-Here X is in seconds.
+by writing a integer lower than 5 to reduce retries or setting to -1 for
+infinite retries.
 
-replacement_timeout will control how long to wait for session re-establishment
-before failing pending SCSI commands and commands that are being operated on by
-the SCSI layer's error handler up to a higher level like multipath or to
-an application if multipath is not being used.
+The number of actual retries a command gets may be less than 5 or what is
+requested in max_retries if the replacement timeout expires. When that timer
+expires it tells the SCSI layer to fail all new and queued commands.
 
 
-8.1.2.1 Running Commands, the SCSI Error Handler, and replacement_timeout
-=========================================================================
+8.1.3 replacement_timeout
+=========================
+
+The iSCSI layer timer:
 
-Remember from the Nop-out discussion that if a network problem is detected,
-the running commands are failed immediately. There is one exception to this,
-and that is when the SCSI layer's error handler is running. To check if
-the SCSI error handler is running, iscsiadm can be run as:
+node.session.timeo.replacement_timeout = X
+
+controls how long to wait for session re-establishment before failing all SCSI
+commands:
+
+1. commands that have been requeued and awaiting a retry
+2. commands that are being operated on by the SCSI layer's error handler
+3. all new commands that are queued to the device
+
+up to a higher level like multipath, filesystem layer, or to the application.
+
+The setting is in seconds. zero means to fail immediately. -1 means an infinite
+timeout which will wait until iscsid does a relogin, the user runs the iscsiadm
+logout command or until the node.session.reopen_max limit is hit.
+
+When this timer is started, the iSCSI layer will stop new IO from executing
+and requeue running commands to the Block/SCSI layer. The new and requeued
+commands will then sit in the Block/SCSI layer queue until the timeout has
+expired, there is userspace intervention like a iscsiadm logout command, or
+there is a successful relogin. If the command has run out of retries, the
+command will be failed instead of being requeued.
+
+After this timer has expired iscsid can continue to try to relogin. By default
+iscsid will continue to try to relogin until there is a successful relogin or
+until the user runs the iscsiadm logout command. The number of relogin retries
+is controlled by the open-iscsi setting node.session.reopen_max. If that is set
+too low, iscsid may give up and forcefully logout the session (equivalent to
+running the iscsiadm logout command on a failed session) before replacement
+timeout seconds. This will result in all commands being failed at that time.
+The user would then have to manually relogin.
+
+This timer starts when you see the connection error messsage:
+
+detected conn error (%d)
+
+in the kernel log. The %d will be a integer with the following mappings
+and meanings:
+
+Int     Kernel define           Description
+value
+------------------------------------------------------------------------------
+1	ISCSI_ERR_DATASN	Low level iSCSI protocol error where a data
+				sequence value did not match the expected value.
+2	ISCSI_ERR_DATA_OFFSET	There was an error where we were asked to
+				read/write past a buffer's length.
+3	ISCSI_ERR_MAX_CMDSN	Low level iSCSI protocol error where we got an
+				invalid MaxCmdSN value.
+4	ISCSI_ERR_EXP_CMDSN	Low level iSCSI protocol error where the
+				ExpCmdSN from the target didn't match the
+				expected value.
+5	ISCSI_ERR_BAD_OPCODE	The iSCSI Target has sent an invalid or unknown
+				opcode.
+6	ISCSI_ERR_DATALEN	The iSCSI target has send a PDU with a data
+				length that is invalid.
+7	ISCSI_ERR_AHSLEN	The iSCSI target has sent a PDU with an invalid
+				Additional Header Length.
+8	ISCSI_ERR_PROTO		The iSCSI target has performed an operation that
+				violated the iSCSI RFC.
+9	ISCSI_ERR_LUN		The iSCSI target has requested an invalid LUN.
+10	ISCSI_ERR_BAD_ITT       The iSCSI target has sent an invalid Initiator
+				Task Tag.
+11	ISCSI_ERR_CONN_FAILED   Generic error that can indicate the transmission
+				of a PDU, like a SCSI cmd or task management
+				function, has timed out. Or, we are not able to
+				transmit a PDU because the network layer has
+				returned an error, or we have detected a
+				network error like a link down. It can
+				sometimes be an error that does not fit the
+				other error codes like a kernel function has
+				returned a failure and there no other way to
+				recovery from it except to try and kill the
+				existing session and relogin.
+12	ISCSI_ERR_R2TSN		Low level iSCSI protocol error where the R2T
+				sequence numbers to not match.
+13	ISCSI_ERR_SESSION_FAILED
+				Unused.
+14	ISCSI_ERR_HDR_DGST	iSCSI Header Digest error.
+15	ISCSI_ERR_DATA_DGST	iSCSI Data Digest error.
+16	ISCSI_ERR_PARAM_NOT_FOUND
+				Userspace has passed the kernel an unknown
+				setting.
+17	ISCSI_ERR_NO_SCSI_CMD	The iSCSI target has sent a ITT for an unknown
+				task.
+18	ISCSI_ERR_INVALID_HOST	The iSCSI Host is no longer present or being
+				removed.
+19	ISCSI_ERR_XMIT_FAILED	The software iSCSI initiator or cxgb was not
+				able to transmit a PDU becuase of a network
+				layer error.
+20	ISCSI_ERR_TCP_CONN_CLOSE
+				The iSCSI target has closed the connection.
+21	ISCSI_ERR_SCSI_EH_SESSION_RST
+				The SCSI layer's Error Handler has timed out
+				the SCSI cmd, tried to abort it and possibly
+				tried to send a LUN RESET, and it's now
+				going to drop the session.
+22	ISCSI_ERR_NOP_TIMEDOUT	An iSCSI Nop as a ping has timed out.
+
+
+8.1.4 Running Commands, the SCSI Error Handler, and replacement_timeout
+=======================================================================
+
+Each SCSI command has a timer controlled by
+
+/sys/block/sdX/device/timeout
+
+The value is in seconds and the default ranges from 30 - 60 seconds
+depending on the distro's udev scripts.
+
+When a command is sent to the iSCSI layer the timer is started, and when it's
+returned to the SCSI layer the timer is stopped. This could be for successful
+completion or due to a retry/requeue due to a conn error like described
+previously. If a command is retried the timer is reset.
+
+When the command timer fires, the SCSI layer will ask the iSCSI layer to abort
+the command by sending an ABORT_TASK task management request. If the abort
+is successful the SCSI layer retries the command if it has enough retries left.
+If the abort times out, the iSCSI layer will report failure to the SCSI layer
+and will fire a ISCSI_ERR_SCSI_EH_SESSION_RST error. In the logs you will see
+a:
+
+detected conn error (21)
+
+The ISCSI_ERR_SCSI_EH_SESSION_RST will cause the connection/session to be
+dropped and the iSCSI layer will start the replacement_timeout operations
+described in that section.
+
+The SCSI layer will then eventually call the iSCSI layer's target/session reset
+callout which will wait for the replacement timeout to expire, a successful
+relogin to occur, or for userspace to logout the session.
+
+- If the replacement timeout fires, then commands will be failed upwards as
+described in the replacement timeout section. The SCSI devices will be put
+into an offline state until iscsid performs a relogin.
+
+- If a relogin occurs before the timer fires, commands will be retried if
+possible.
+
+To check if the SCSI error handler is running, iscsiadm can be run as:
 	iscsiadm -m session -P 3
 
-You will then see:
+and you will see:
 	Host Number: X State: Recovery
 
-When the SCSI EH is running, commands will not be failed until
-node.session.timeo.replacement_timeout seconds.
-
 To modify the timer that starts the SCSI EH, you can either write
 directly to the device's sysfs file:
 	echo X > /sys/block/sdX/device/timeout
@@ -1506,26 +1643,7 @@ is not being used. If udev is used the default is the above value which
 is normally 60 seconds.
 
 
-8.1.2.2 Pending Commands and replacement_timeout
-================================================
-
-Commonly, the SCSI/BLOCK layer will queue 256 commands, but the path can
-only take 32. When a network problem is detected, the 32 commands
-in flight will be sent back to the SCSI layer immediately and because
-multipath is being used, this will cause the commands to be sent to the multipath
-layer for execution on another path. However, the other 96 commands that were
-still in the SCSI/BLOCK queue will remain there until the session is
-re-established or until node.session.timeo.replacement_timeout seconds has
-gone by. After replacement_timeout seconds, the pending commands will be
-failed to the multipath layer, and all new incoming commands will be
-immediately failed back to the multipath layer. If a session is later
-re-established, then new commands will be queued and executed. Normally,
-multipathd's path tester mechanism will detect that the session has been
-re-established and the path is accessible again, and it will inform
-dm-multipath.
-
-
-8.1.3 Optimal replacement_timeout Value
+8.1.4 Optimal replacement_timeout Value
 =======================================
 
 The default value for replacement_timeout is 120 seconds, but because
-- 
cgit v1.2.1