diff options
Diffstat (limited to 'doc')
-rw-r--r-- | doc/aio_design.txt | 215 |
1 files changed, 0 insertions, 215 deletions
diff --git a/doc/aio_design.txt b/doc/aio_design.txt deleted file mode 100644 index c6eb44352..000000000 --- a/doc/aio_design.txt +++ /dev/null @@ -1,215 +0,0 @@ -Introducing asynchronous I/O to LVM -=================================== - -Issuing I/O asynchronously means instructing the kernel to perform specific -I/O and return immediately without waiting for it to complete. The data -is collected from the kernel later. - -Advantages ----------- - -A1. While waiting for the I/O to happen, the program could perform other -operations. - -A2. When LVM is searching for its Physical Volumes, it issues a small amount of -I/O to a large number of disks. If this was issued in parallel the overall -runtime might be shorter while there should be little effect on the cpu time. - -A3. If more than one timeout occurs when accessing any devices, these can be -taken in parallel, again reducing the runtime. This applies globally, -not just while the code is searching for Physical Volumes, so reading, -writing and committing the metadata may occasionally benefit too to some -extent and there are probably maintenance advantages in using the same -method of I/O throughout the main body of the code. - -A4. By introducing a simple callback function mechanism, the conversion can be -performed largely incrementally by first refactoring and continuing to -use synchronous I/O with the callbacks performed immediately. This allows the -callbacks to be introduced without changing the running sequence of the code -initially. Future projects could refactor some of the calling sites to -simplify the code structure and even eliminate some of the nesting. -This allows each part of what might ultimately amount to a large change to be -introduced and tested independently. - - -Disadvantages -------------- - -D1. The resulting code may be more complex with more failure modes to -handle. Mitigate by thorough auditing and testing, rolling out -gradually, and offering a simple switch to revert to the old behaviour. - -D2. The linux asynchronous I/O implementation is less mature than -its synchronous I/O implementation and might show up problems that -depend on the version of the kernel or library used. Fixes or -workarounds for some of these might require kernel changes. For -example, there are suggestions that despite being supposedly async, -there are still cases where system calls can block. There might be -resource dependencies on other processes running on the system that make -it unsuitable for use while any devices are suspended. Mitigation -as for D1. - -D3. The error handling within callbacks becomes more complicated. -However we know that existing call paths can already sometimes discard -errors, sometimes deliberately, sometimes not, so this aspect is in need -of a complete review anyway and the new approach will make the error -handling more transparent. Aim initially for overall behaviour that is -no worse than that of the existing code, then work on improving it -later. - -D4. The work will take a few weeks to code and test. This leads to a -significant opportunity cost when compared against other enhancements -that could be achieved in that time. However, the proof-of-concept work -performed while writing this design has satisfied me that the work could -proceed and be committed incrementally as a background task. - - -Observations regarding LVM's I/O Architecture ---------------------------------------------- - -H1. All device, metadata and config file I/O is constrained to pass through a -single route in lib/device. - -H2. The first step of the analysis was to instrument this code path with -log_debug messages. I/O is split into the following categories: - - "dev signatures", - "PV labels", - "VG metadata header", - "VG metadata content", - "extra VG metadata header", - "extra VG metadata content", - "LVM1 metadata", - "pool metadata", - "LV content", - "logging", - -H3. A bounce buffer is used for most I/O. - -H4. Most callers finish using the supplied data before any further I/O is -issued. The few that don't could be converted trivially to do so. - -H5. There is one stream of I/O per metadata area on each device. - -H6. Some reads fall at offsets close to immediately preceding reads, so it's -possible to avoid these by caching one "block" per metadata area I/O stream. - -H7. Simple analysis suggests a minimum aligned read size of 8k would deliver -immediate gains from this caching. A larger size might perform worse because -almost all the time the extra data read would not be used, but this can be -re-examined and tuned after the code is in place. - - -Proposal --------- - -P1. Retain the "single I/O path" but offer an asynchronous option. - -P2. Eliminate the bounce buffer in most cases by improving alignment. - -P3. Reduce the number of reads by always reading a minimum of an aligned -8k block. - -P4. Eliminate repeated reads by caching the last block read and changing -the lib/device interface to return a pointer to read-only data within -this block. - -P5. Only perform these interface changes for code on the critical path -for now by converting other code sites to use wrappers around the new -interface. - -P6. Treat asynchronous I/O as the interface of choice and optimise only -for this case. - -P7. Convert the callers on the critical path to pass callback functions -to the device layer. These functions will be called later with the -read-only data, a context pointer and a success/failure indicator. -Where an existing function performs a sequence of I/O, this has the -advantage of breaking up the large function into smaller ones and -wrapping the parameters used into structures. While this might look -rather messy and ad-hoc in the short-term, it's a first step towards -breaking up confusingly long functions into component parts and wrapping -the existing long parameter lists into more appropriate structures and -refactoring these parts of the code. - -P8. Limit the resources used by the asynchronous I/O by using two -tunable parameters, one limiting the number of outstanding I/Os issued -and another limiting the total amount of memory used. - -P9. Provide a fallback option if asynchronous I/O is unavailable by -sharing the code paths but issuing the I/O synchronously and calling the -callback immediately. - -P10. Only allocate the buffer for the I/O at the point where the I/O is -about to be issued. - -P11. If the thresholds are exceeded, add the request to a simple queue, -and process it later after some I/O has completed. - - -Future work ------------ -F1. Perform a complete review of the error tracking so that device -failures are handled and reported more cleanly, extending the existing -basic error counting mechanism. - -F2. Consider whether some of the nested callbacks can be eliminated, -which would allow for additional simplifications. - -F3. Adjust the contents of the adhoc context structs into more logical -arrangements and use them more widely. - -F4. Perform wider refactoring of these areas of code. - - -Testing considerations ----------------------- -T1. The changes touch code on the device path, so a thorough re-test of -the device layer is required. The new code needs a full audit down -through the library layer into the kernel to check that all the error -conditions that are currently implemented (such as EAGAIN) are handled -sensibly. (LVM's I/O layer needs to remain as solid as we can make it.) - -T2. The current test suite provides a reasonably broad range of coverage -of this area but is far from comprehensive. - - -Acceptance criteria -------------------- -A1. The current test suite should pass to the same extent as before the -changes. - -A2. When all debugging and logging is disabled, strace -c must show -improvements e.g. the expected fewer number of reads. - -A3. Running a range of commands under valgrind must not reveal any -new leaks due to the changes. - -A4. All new coverity reports from the change must be addressed. - -A5. CPU time should be similar to that before, as the same work -is being done overall, just in a different order. - -A6. Tests need to show improved behaviour in targetted areas. For example, -if several devices are slow and time out, the delays should occur -in parallel and the elapsed time should be less than before. - - -Release considerations ----------------------- -R1. Async I/O should be widely available and largely reliable on linux -nowadays (even though parts of its interface and implementation remain a -matter of controversy) so we should try to make its use the default -whereever it is supported. If certain types of systems have problems we -should try to detect those cases and disable it automatically there. - -R2. Because the implications of an unexpected problem in the new code -could be severe for the people affected, the roll out needs to be gentle -without a deadline to allow us plenty of time to gain confidence in the -new code. Our own testing will only be able to cover a tiny fraction of -the different setups our users have, so we need to look out for problems -caused by this proactively and encourage people to test it on their own -systems and report back. It must go into the tree near the start of a -release cycle rather than at the end to provide time for our confidence -in it to grow. - |