diff options
Diffstat (limited to 'doc/aio_design.txt')
-rw-r--r-- | doc/aio_design.txt | 215 |
1 files changed, 215 insertions, 0 deletions
diff --git a/doc/aio_design.txt b/doc/aio_design.txt new file mode 100644 index 000000000..c6eb44352 --- /dev/null +++ b/doc/aio_design.txt @@ -0,0 +1,215 @@ +Introducing asynchronous I/O to LVM +=================================== + +Issuing I/O asynchronously means instructing the kernel to perform specific +I/O and return immediately without waiting for it to complete. The data +is collected from the kernel later. + +Advantages +---------- + +A1. While waiting for the I/O to happen, the program could perform other +operations. + +A2. When LVM is searching for its Physical Volumes, it issues a small amount of +I/O to a large number of disks. If this was issued in parallel the overall +runtime might be shorter while there should be little effect on the cpu time. + +A3. If more than one timeout occurs when accessing any devices, these can be +taken in parallel, again reducing the runtime. This applies globally, +not just while the code is searching for Physical Volumes, so reading, +writing and committing the metadata may occasionally benefit too to some +extent and there are probably maintenance advantages in using the same +method of I/O throughout the main body of the code. + +A4. By introducing a simple callback function mechanism, the conversion can be +performed largely incrementally by first refactoring and continuing to +use synchronous I/O with the callbacks performed immediately. This allows the +callbacks to be introduced without changing the running sequence of the code +initially. Future projects could refactor some of the calling sites to +simplify the code structure and even eliminate some of the nesting. +This allows each part of what might ultimately amount to a large change to be +introduced and tested independently. + + +Disadvantages +------------- + +D1. The resulting code may be more complex with more failure modes to +handle. Mitigate by thorough auditing and testing, rolling out +gradually, and offering a simple switch to revert to the old behaviour. + +D2. The linux asynchronous I/O implementation is less mature than +its synchronous I/O implementation and might show up problems that +depend on the version of the kernel or library used. Fixes or +workarounds for some of these might require kernel changes. For +example, there are suggestions that despite being supposedly async, +there are still cases where system calls can block. There might be +resource dependencies on other processes running on the system that make +it unsuitable for use while any devices are suspended. Mitigation +as for D1. + +D3. The error handling within callbacks becomes more complicated. +However we know that existing call paths can already sometimes discard +errors, sometimes deliberately, sometimes not, so this aspect is in need +of a complete review anyway and the new approach will make the error +handling more transparent. Aim initially for overall behaviour that is +no worse than that of the existing code, then work on improving it +later. + +D4. The work will take a few weeks to code and test. This leads to a +significant opportunity cost when compared against other enhancements +that could be achieved in that time. However, the proof-of-concept work +performed while writing this design has satisfied me that the work could +proceed and be committed incrementally as a background task. + + +Observations regarding LVM's I/O Architecture +--------------------------------------------- + +H1. All device, metadata and config file I/O is constrained to pass through a +single route in lib/device. + +H2. The first step of the analysis was to instrument this code path with +log_debug messages. I/O is split into the following categories: + + "dev signatures", + "PV labels", + "VG metadata header", + "VG metadata content", + "extra VG metadata header", + "extra VG metadata content", + "LVM1 metadata", + "pool metadata", + "LV content", + "logging", + +H3. A bounce buffer is used for most I/O. + +H4. Most callers finish using the supplied data before any further I/O is +issued. The few that don't could be converted trivially to do so. + +H5. There is one stream of I/O per metadata area on each device. + +H6. Some reads fall at offsets close to immediately preceding reads, so it's +possible to avoid these by caching one "block" per metadata area I/O stream. + +H7. Simple analysis suggests a minimum aligned read size of 8k would deliver +immediate gains from this caching. A larger size might perform worse because +almost all the time the extra data read would not be used, but this can be +re-examined and tuned after the code is in place. + + +Proposal +-------- + +P1. Retain the "single I/O path" but offer an asynchronous option. + +P2. Eliminate the bounce buffer in most cases by improving alignment. + +P3. Reduce the number of reads by always reading a minimum of an aligned +8k block. + +P4. Eliminate repeated reads by caching the last block read and changing +the lib/device interface to return a pointer to read-only data within +this block. + +P5. Only perform these interface changes for code on the critical path +for now by converting other code sites to use wrappers around the new +interface. + +P6. Treat asynchronous I/O as the interface of choice and optimise only +for this case. + +P7. Convert the callers on the critical path to pass callback functions +to the device layer. These functions will be called later with the +read-only data, a context pointer and a success/failure indicator. +Where an existing function performs a sequence of I/O, this has the +advantage of breaking up the large function into smaller ones and +wrapping the parameters used into structures. While this might look +rather messy and ad-hoc in the short-term, it's a first step towards +breaking up confusingly long functions into component parts and wrapping +the existing long parameter lists into more appropriate structures and +refactoring these parts of the code. + +P8. Limit the resources used by the asynchronous I/O by using two +tunable parameters, one limiting the number of outstanding I/Os issued +and another limiting the total amount of memory used. + +P9. Provide a fallback option if asynchronous I/O is unavailable by +sharing the code paths but issuing the I/O synchronously and calling the +callback immediately. + +P10. Only allocate the buffer for the I/O at the point where the I/O is +about to be issued. + +P11. If the thresholds are exceeded, add the request to a simple queue, +and process it later after some I/O has completed. + + +Future work +----------- +F1. Perform a complete review of the error tracking so that device +failures are handled and reported more cleanly, extending the existing +basic error counting mechanism. + +F2. Consider whether some of the nested callbacks can be eliminated, +which would allow for additional simplifications. + +F3. Adjust the contents of the adhoc context structs into more logical +arrangements and use them more widely. + +F4. Perform wider refactoring of these areas of code. + + +Testing considerations +---------------------- +T1. The changes touch code on the device path, so a thorough re-test of +the device layer is required. The new code needs a full audit down +through the library layer into the kernel to check that all the error +conditions that are currently implemented (such as EAGAIN) are handled +sensibly. (LVM's I/O layer needs to remain as solid as we can make it.) + +T2. The current test suite provides a reasonably broad range of coverage +of this area but is far from comprehensive. + + +Acceptance criteria +------------------- +A1. The current test suite should pass to the same extent as before the +changes. + +A2. When all debugging and logging is disabled, strace -c must show +improvements e.g. the expected fewer number of reads. + +A3. Running a range of commands under valgrind must not reveal any +new leaks due to the changes. + +A4. All new coverity reports from the change must be addressed. + +A5. CPU time should be similar to that before, as the same work +is being done overall, just in a different order. + +A6. Tests need to show improved behaviour in targetted areas. For example, +if several devices are slow and time out, the delays should occur +in parallel and the elapsed time should be less than before. + + +Release considerations +---------------------- +R1. Async I/O should be widely available and largely reliable on linux +nowadays (even though parts of its interface and implementation remain a +matter of controversy) so we should try to make its use the default +whereever it is supported. If certain types of systems have problems we +should try to detect those cases and disable it automatically there. + +R2. Because the implications of an unexpected problem in the new code +could be severe for the people affected, the roll out needs to be gentle +without a deadline to allow us plenty of time to gain confidence in the +new code. Our own testing will only be able to cover a tiny fraction of +the different setups our users have, so we need to look out for problems +caused by this proactively and encourage people to test it on their own +systems and report back. It must go into the tree near the start of a +release cycle rather than at the end to provide time for our confidence +in it to grow. + |