doc/aio_design.txt


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215

Introducing asynchronous I/O to LVM
===================================

Issuing I/O asynchronously means instructing the kernel to perform specific
I/O and return immediately without waiting for it to complete.  The data
is collected from the kernel later.

Advantages
----------

A1. While waiting for the I/O to happen, the program could perform other
operations.

A2. When LVM is searching for its Physical Volumes, it issues a small amount of
I/O to a large number of disks.  If this was issued in parallel the overall
runtime might be shorter while there should be little effect on the cpu time.

A3. If more than one timeout occurs when accessing any devices, these can be
taken in parallel, again reducing the runtime.  This applies globally,
not just while the code is searching for Physical Volumes, so reading,
writing and committing the metadata may occasionally benefit too to some
extent and there are probably maintenance advantages in using the same
method of I/O throughout the main body of the code.

A4. By introducing a simple callback function mechanism, the conversion can be
performed largely incrementally by first refactoring and continuing to
use synchronous I/O with the callbacks performed immediately.  This allows the
callbacks to be introduced without changing the running sequence of the code
initially.  Future projects could refactor some of the calling sites to
simplify the code structure and even eliminate some of the nesting.
This allows each part of what might ultimately amount to a large change to be
introduced and tested independently.


Disadvantages
-------------

D1. The resulting code may be more complex with more failure modes to
handle.  Mitigate by thorough auditing and testing, rolling out
gradually, and offering a simple switch to revert to the old behaviour.

D2. The linux asynchronous I/O implementation is less mature than
its synchronous I/O implementation and might show up problems that
depend on the version of the kernel or library used.  Fixes or
workarounds for some of these might require kernel changes.  For
example, there are suggestions that despite being supposedly async,
there are still cases where system calls can block.  There might be
resource dependencies on other processes running on the system that make
it unsuitable for use while any devices are suspended.  Mitigation
as for D1.

D3. The error handling within callbacks becomes more complicated.
However we know that existing call paths can already sometimes discard
errors, sometimes deliberately, sometimes not, so this aspect is in need
of a complete review anyway and the new approach will make the error
handling more transparent.  Aim initially for overall behaviour that is
no worse than that of the existing code, then work on improving it
later.

D4. The work will take a few weeks to code and test.  This leads to a
significant opportunity cost when compared against other enhancements
that could be achieved in that time.  However, the proof-of-concept work
performed while writing this design has satisfied me that the work could
proceed and be committed incrementally as a background task.


Observations regarding LVM's I/O Architecture 
---------------------------------------------

H1. All device, metadata and config file I/O is constrained to pass through a
single route in lib/device.

H2. The first step of the analysis was to instrument this code path with
log_debug messages.  I/O is split into the following categories:

        "dev signatures",
        "PV labels",
        "VG metadata header",
        "VG metadata content",
        "extra VG metadata header",
        "extra VG metadata content",
        "LVM1 metadata",
        "pool metadata",
        "LV content",
        "logging",

H3. A bounce buffer is used for most I/O.

H4. Most callers finish using the supplied data before any further I/O is
issued.  The few that don't could be converted trivially to do so.

H5. There is one stream of I/O per metadata area on each device.

H6. Some reads fall at offsets close to immediately preceding reads, so it's
possible to avoid these by caching one "block" per metadata area I/O stream.

H7. Simple analysis suggests a minimum aligned read size of 8k would deliver
immediate gains from this caching.  A larger size might perform worse because
almost all the time the extra data read would not be used, but this can be
re-examined and tuned after the code is in place.


Proposal
--------

P1. Retain the "single I/O path" but offer an asynchronous option.

P2. Eliminate the bounce buffer in most cases by improving alignment.

P3. Reduce the number of reads by always reading a minimum of an aligned
8k block.  

P4. Eliminate repeated reads by caching the last block read and changing
the lib/device interface to return a pointer to read-only data within
this block.

P5. Only perform these interface changes for code on the critical path
for now by converting other code sites to use wrappers around the new
interface.

P6. Treat asynchronous I/O as the interface of choice and optimise only
for this case.

P7. Convert the callers on the critical path to pass callback functions
to the device layer.  These functions will be called later with the
read-only data, a context pointer and a success/failure indicator.
Where an existing function performs a sequence of I/O, this has the
advantage of breaking up the large function into smaller ones and
wrapping the parameters used into structures.  While this might look
rather messy and ad-hoc in the short-term, it's a first step towards
breaking up confusingly long functions into component parts and wrapping
the existing long parameter lists into more appropriate structures and
refactoring these parts of the code.

P8. Limit the resources used by the asynchronous I/O by using two
tunable parameters, one limiting the number of outstanding I/Os issued
and another limiting the total amount of memory used.

P9. Provide a fallback option if asynchronous I/O is unavailable by
sharing the code paths but issuing the I/O synchronously and calling the
callback immediately.

P10. Only allocate the buffer for the I/O at the point where the I/O is
about to be issued.

P11. If the thresholds are exceeded, add the request to a simple queue,
and process it later after some I/O has completed.


Future work
-----------
F1. Perform a complete review of the error tracking so that device
failures are handled and reported more cleanly, extending the existing
basic error counting mechanism.

F2. Consider whether some of the nested callbacks can be eliminated,
which would allow for additional simplifications.

F3. Adjust the contents of the adhoc context structs into more logical
arrangements and use them more widely.

F4. Perform wider refactoring of these areas of code.


Testing considerations
----------------------
T1. The changes touch code on the device path, so a thorough re-test of
the device layer is required.  The new code needs a full audit down
through the library layer into the kernel to check that all the error
conditions that are currently implemented (such as EAGAIN) are handled
sensibly. (LVM's I/O layer needs to remain as solid as we can make it.)

T2. The current test suite provides a reasonably broad range of coverage
of this area but is far from comprehensive.


Acceptance criteria
-------------------
A1. The current test suite should pass to the same extent as before the
changes.

A2. When all debugging and logging is disabled, strace -c must show
improvements e.g. the expected fewer number of reads.

A3. Running a range of commands under valgrind must not reveal any
new leaks due to the changes.

A4. All new coverity reports from the change must be addressed.

A5. CPU time should be similar to that before, as the same work
is being done overall, just in a different order.

A6. Tests need to show improved behaviour in targetted areas.  For example,
if several devices are slow and time out, the delays should occur
in parallel and the elapsed time should be less than before.


Release considerations
----------------------
R1. Async I/O should be widely available and largely reliable on linux
nowadays (even though parts of its interface and implementation remain a
matter of controversy) so we should try to make its use the default
whereever it is supported.  If certain types of systems have problems we
should try to detect those cases and disable it automatically there.

R2. Because the implications of an unexpected problem in the new code
could be severe for the people affected, the roll out needs to be gentle
without a deadline to allow us plenty of time to gain confidence in the
new code.  Our own testing will only be able to cover a tiny fraction of
the different setups our users have, so we need to look out for problems
caused by this proactively and encourage people to test it on their own
systems and report back.  It must go into the tree near the start of a
release cycle rather than at the end to provide time for our confidence
in it to grow.