device: Basic config and setup to support async I/O.

author: Alasdair G Kergon <agk@redhat.com> 2018-01-22 17:45:12 +0000
committer: Alasdair G Kergon <agk@redhat.com> 2018-02-08 20:15:14 +0000
commit: 8c7bbcfb0f6771a10acd238057fdb7931c3fbb12 (patch)
tree: 89b9d4bf11569d2da9f59396074adde5a6fdcecc /doc
parent: 7a9af3cd0e609e10840202ae1754ab76b6add511 (diff)
download: lvm2-8c7bbcfb0f6771a10acd238057fdb7931c3fbb12.tar.gz
1 files changed, 215 insertions, 0 deletions
diff --git a/doc/aio_design.txt b/doc/aio_design.txt
new file mode 100644
index 000000000..c6eb44352
--- /dev/null
+++ b/doc/aio_design.txt
@@ -0,0 +1,215 @@
+Introducing asynchronous I/O to LVM
+===================================
+
+Issuing I/O asynchronously means instructing the kernel to perform specific
+I/O and return immediately without waiting for it to complete.  The data
+is collected from the kernel later.
+
+Advantages
+----------
+
+A1. While waiting for the I/O to happen, the program could perform other
+operations.
+
+A2. When LVM is searching for its Physical Volumes, it issues a small amount of
+I/O to a large number of disks.  If this was issued in parallel the overall
+runtime might be shorter while there should be little effect on the cpu time.
+
+A3. If more than one timeout occurs when accessing any devices, these can be
+taken in parallel, again reducing the runtime.  This applies globally,
+not just while the code is searching for Physical Volumes, so reading,
+writing and committing the metadata may occasionally benefit too to some
+extent and there are probably maintenance advantages in using the same
+method of I/O throughout the main body of the code.
+
+A4. By introducing a simple callback function mechanism, the conversion can be
+performed largely incrementally by first refactoring and continuing to
+use synchronous I/O with the callbacks performed immediately.  This allows the
+callbacks to be introduced without changing the running sequence of the code
+initially.  Future projects could refactor some of the calling sites to
+simplify the code structure and even eliminate some of the nesting.
+This allows each part of what might ultimately amount to a large change to be
+introduced and tested independently.
+
+
+Disadvantages
+-------------
+
+D1. The resulting code may be more complex with more failure modes to
+handle.  Mitigate by thorough auditing and testing, rolling out
+gradually, and offering a simple switch to revert to the old behaviour.
+
+D2. The linux asynchronous I/O implementation is less mature than
+its synchronous I/O implementation and might show up problems that
+depend on the version of the kernel or library used.  Fixes or
+workarounds for some of these might require kernel changes.  For
+example, there are suggestions that despite being supposedly async,
+there are still cases where system calls can block.  There might be
+resource dependencies on other processes running on the system that make
+it unsuitable for use while any devices are suspended.  Mitigation
+as for D1.
+
+D3. The error handling within callbacks becomes more complicated.
+However we know that existing call paths can already sometimes discard
+errors, sometimes deliberately, sometimes not, so this aspect is in need
+of a complete review anyway and the new approach will make the error
+handling more transparent.  Aim initially for overall behaviour that is
+no worse than that of the existing code, then work on improving it
+later.
+
+D4. The work will take a few weeks to code and test.  This leads to a
+significant opportunity cost when compared against other enhancements
+that could be achieved in that time.  However, the proof-of-concept work
+performed while writing this design has satisfied me that the work could
+proceed and be committed incrementally as a background task.
+
+
+Observations regarding LVM's I/O Architecture 
+---------------------------------------------
+
+H1. All device, metadata and config file I/O is constrained to pass through a
+single route in lib/device.
+
+H2. The first step of the analysis was to instrument this code path with
+log_debug messages.  I/O is split into the following categories:
+
+        "dev signatures",
+        "PV labels",
+        "VG metadata header",
+        "VG metadata content",
+        "extra VG metadata header",
+        "extra VG metadata content",
+        "LVM1 metadata",
+        "pool metadata",
+        "LV content",
+        "logging",
+
+H3. A bounce buffer is used for most I/O.
+
+H4. Most callers finish using the supplied data before any further I/O is
+issued.  The few that don't could be converted trivially to do so.
+
+H5. There is one stream of I/O per metadata area on each device.
+
+H6. Some reads fall at offsets close to immediately preceding reads, so it's
+possible to avoid these by caching one "block" per metadata area I/O stream.
+
+H7. Simple analysis suggests a minimum aligned read size of 8k would deliver
+immediate gains from this caching.  A larger size might perform worse because
+almost all the time the extra data read would not be used, but this can be
+re-examined and tuned after the code is in place.
+
+
+Proposal
+--------
+
+P1. Retain the "single I/O path" but offer an asynchronous option.
+
+P2. Eliminate the bounce buffer in most cases by improving alignment.
+
+P3. Reduce the number of reads by always reading a minimum of an aligned
+8k block.  
+
+P4. Eliminate repeated reads by caching the last block read and changing
+the lib/device interface to return a pointer to read-only data within
+this block.
+
+P5. Only perform these interface changes for code on the critical path
+for now by converting other code sites to use wrappers around the new
+interface.
+
+P6. Treat asynchronous I/O as the interface of choice and optimise only
+for this case.
+
+P7. Convert the callers on the critical path to pass callback functions
+to the device layer.  These functions will be called later with the
+read-only data, a context pointer and a success/failure indicator.
+Where an existing function performs a sequence of I/O, this has the
+advantage of breaking up the large function into smaller ones and
+wrapping the parameters used into structures.  While this might look
+rather messy and ad-hoc in the short-term, it's a first step towards
+breaking up confusingly long functions into component parts and wrapping
+the existing long parameter lists into more appropriate structures and
+refactoring these parts of the code.
+
+P8. Limit the resources used by the asynchronous I/O by using two
+tunable parameters, one limiting the number of outstanding I/Os issued
+and another limiting the total amount of memory used.
+
+P9. Provide a fallback option if asynchronous I/O is unavailable by
+sharing the code paths but issuing the I/O synchronously and calling the
+callback immediately.
+
+P10. Only allocate the buffer for the I/O at the point where the I/O is
+about to be issued.
+
+P11. If the thresholds are exceeded, add the request to a simple queue,
+and process it later after some I/O has completed.
+
+
+Future work
+-----------
+F1. Perform a complete review of the error tracking so that device
+failures are handled and reported more cleanly, extending the existing
+basic error counting mechanism.
+
+F2. Consider whether some of the nested callbacks can be eliminated,
+which would allow for additional simplifications.
+
+F3. Adjust the contents of the adhoc context structs into more logical
+arrangements and use them more widely.
+
+F4. Perform wider refactoring of these areas of code.
+
+
+Testing considerations
+----------------------
+T1. The changes touch code on the device path, so a thorough re-test of
+the device layer is required.  The new code needs a full audit down
+through the library layer into the kernel to check that all the error
+conditions that are currently implemented (such as EAGAIN) are handled
+sensibly. (LVM's I/O layer needs to remain as solid as we can make it.)
+
+T2. The current test suite provides a reasonably broad range of coverage
+of this area but is far from comprehensive.
+
+
+Acceptance criteria
+-------------------
+A1. The current test suite should pass to the same extent as before the
+changes.
+
+A2. When all debugging and logging is disabled, strace -c must show
+improvements e.g. the expected fewer number of reads.
+
+A3. Running a range of commands under valgrind must not reveal any
+new leaks due to the changes.
+
+A4. All new coverity reports from the change must be addressed.
+
+A5. CPU time should be similar to that before, as the same work
+is being done overall, just in a different order.
+
+A6. Tests need to show improved behaviour in targetted areas.  For example,
+if several devices are slow and time out, the delays should occur
+in parallel and the elapsed time should be less than before.
+
+
+Release considerations
+----------------------
+R1. Async I/O should be widely available and largely reliable on linux
+nowadays (even though parts of its interface and implementation remain a
+matter of controversy) so we should try to make its use the default
+whereever it is supported.  If certain types of systems have problems we
+should try to detect those cases and disable it automatically there.
+
+R2. Because the implications of an unexpected problem in the new code
+could be severe for the people affected, the roll out needs to be gentle
+without a deadline to allow us plenty of time to gain confidence in the
+new code.  Our own testing will only be able to cover a tiny fraction of
+the different setups our users have, so we need to look out for problems
+caused by this proactively and encourage people to test it on their own
+systems and report back.  It must go into the tree near the start of a
+release cycle rather than at the end to provide time for our confidence
+in it to grow.
+
author	Alasdair G Kergon <agk@redhat.com>	2018-01-22 17:45:12 +0000
committer	Alasdair G Kergon <agk@redhat.com>	2018-02-08 20:15:14 +0000
commit	8c7bbcfb0f6771a10acd238057fdb7931c3fbb12 (patch)
tree	89b9d4bf11569d2da9f59396074adde5a6fdcecc /doc
parent	7a9af3cd0e609e10840202ae1754ab76b6add511 (diff)
download	lvm2-8c7bbcfb0f6771a10acd238057fdb7931c3fbb12.tar.gz