1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
|
Coherent Accelerator Interface (CXL)
====================================
Introduction
============
The coherent accelerator interface is designed to allow the
coherent connection of accelerators (FPGAs and other devices) to a
POWER system. These devices need to adhere to the Coherent
Accelerator Interface Architecture (CAIA).
IBM refers to this as the Coherent Accelerator Processor Interface
or CAPI. In the kernel it's referred to by the name CXL to avoid
confusion with the ISDN CAPI subsystem.
Coherent in this context means that the accelerator and CPUs can
both access system memory directly and with the same effective
addresses.
Hardware overview
=================
POWER8 FPGA
+----------+ +---------+
| | | |
| CPU | | AFU |
| | | |
| | | |
| | | |
+----------+ +---------+
| PHB | | |
| +------+ | PSL |
| | CAPP |<------>| |
+---+------+ PCIE +---------+
The POWER8 chip has a Coherently Attached Processor Proxy (CAPP)
unit which is part of the PCIe Host Bridge (PHB). This is managed
by Linux by calls into OPAL. Linux doesn't directly program the
CAPP.
The FPGA (or coherently attached device) consists of two parts.
The POWER Service Layer (PSL) and the Accelerator Function Unit
(AFU). The AFU is used to implement specific functionality behind
the PSL. The PSL, among other things, provides memory address
translation services to allow each AFU direct access to userspace
memory.
The AFU is the core part of the accelerator (eg. the compression,
crypto etc function). The kernel has no knowledge of the function
of the AFU. Only userspace interacts directly with the AFU.
The PSL provides the translation and interrupt services that the
AFU needs. This is what the kernel interacts with. For example, if
the AFU needs to read a particular effective address, it sends
that address to the PSL, the PSL then translates it, fetches the
data from memory and returns it to the AFU. If the PSL has a
translation miss, it interrupts the kernel and the kernel services
the fault. The context to which this fault is serviced is based on
who owns that acceleration function.
AFU Modes
=========
There are two programming modes supported by the AFU. Dedicated
and AFU directed. AFU may support one or both modes.
When using dedicated mode only one MMU context is supported. In
this mode, only one userspace process can use the accelerator at
time.
When using AFU directed mode, up to 16K simultaneous contexts can
be supported. This means up to 16K simultaneous userspace
applications may use the accelerator (although specific AFUs may
support fewer). In this mode, the AFU sends a 16 bit context ID
with each of its requests. This tells the PSL which context is
associated with each operation. If the PSL can't translate an
operation, the ID can also be accessed by the kernel so it can
determine the userspace context associated with an operation.
MMIO space
==========
A portion of the accelerator MMIO space can be directly mapped
from the AFU to userspace. Either the whole space can be mapped or
just a per context portion. The hardware is self describing, hence
the kernel can determine the offset and size of the per context
portion.
Interrupts
==========
AFUs may generate interrupts that are destined for userspace. These
are received by the kernel as hardware interrupts and passed onto
userspace by a read syscall documented below.
Data storage faults and error interrupts are handled by the kernel
driver.
Work Element Descriptor (WED)
=============================
The WED is a 64-bit parameter passed to the AFU when a context is
started. Its format is up to the AFU hence the kernel has no
knowledge of what it represents. Typically it will be the
effective address of a work queue or status block where the AFU
and userspace can share control and status information.
User API
========
For AFUs operating in AFU directed mode, two character device
files will be created. /dev/cxl/afu0.0m will correspond to a
master context and /dev/cxl/afu0.0s will correspond to a slave
context. Master contexts have access to the full MMIO space an
AFU provides. Slave contexts have access to only the per process
MMIO space an AFU provides.
For AFUs operating in dedicated process mode, the driver will
only create a single character device per AFU called
/dev/cxl/afu0.0d. This will have access to the entire MMIO space
that the AFU provides (like master contexts in AFU directed).
The types described below are defined in include/uapi/misc/cxl.h
The following file operations are supported on both slave and
master devices.
open
----
Opens the device and allocates a file descriptor to be used with
the rest of the API.
A dedicated mode AFU only has one context and only allows the
device to be opened once.
An AFU directed mode AFU can have many contexts, the device can be
opened once for each context that is available.
When all available contexts are allocated the open call will fail
and return -ENOSPC.
Note: IRQs need to be allocated for each context, which may limit
the number of contexts that can be created, and therefore
how many times the device can be opened. The POWER8 CAPP
supports 2040 IRQs and 3 are used by the kernel, so 2037 are
left. If 1 IRQ is needed per context, then only 2037
contexts can be allocated. If 4 IRQs are needed per context,
then only 2037/4 = 509 contexts can be allocated.
ioctl
-----
CXL_IOCTL_START_WORK:
Starts the AFU context and associates it with the current
process. Once this ioctl is successfully executed, all memory
mapped into this process is accessible to this AFU context
using the same effective addresses. No additional calls are
required to map/unmap memory. The AFU memory context will be
updated as userspace allocates and frees memory. This ioctl
returns once the AFU context is started.
Takes a pointer to a struct cxl_ioctl_start_work:
struct cxl_ioctl_start_work {
__u64 flags;
__u64 work_element_descriptor;
__u64 amr;
__s16 num_interrupts;
__s16 reserved1;
__s32 reserved2;
__u64 reserved3;
__u64 reserved4;
__u64 reserved5;
__u64 reserved6;
};
flags:
Indicates which optional fields in the structure are
valid.
work_element_descriptor:
The Work Element Descriptor (WED) is a 64-bit argument
defined by the AFU. Typically this is an effective
address pointing to an AFU specific structure
describing what work to perform.
amr:
Authority Mask Register (AMR), same as the powerpc
AMR. This field is only used by the kernel when the
corresponding CXL_START_WORK_AMR value is specified in
flags. If not specified the kernel will use a default
value of 0.
num_interrupts:
Number of userspace interrupts to request. This field
is only used by the kernel when the corresponding
CXL_START_WORK_NUM_IRQS value is specified in flags.
If not specified the minimum number required by the
AFU will be allocated. The min and max number can be
obtained from sysfs.
reserved fields:
For ABI padding and future extensions
CXL_IOCTL_GET_PROCESS_ELEMENT:
Get the current context id, also known as the process element.
The value is returned from the kernel as a __u32.
mmap
----
An AFU may have an MMIO space to facilitate communication with the
AFU. If it does, the MMIO space can be accessed via mmap. The size
and contents of this area are specific to the particular AFU. The
size can be discovered via sysfs.
In AFU directed mode, master contexts are allowed to map all of
the MMIO space and slave contexts are allowed to only map the per
process MMIO space associated with the context. In dedicated
process mode the entire MMIO space can always be mapped.
This mmap call must be done after the START_WORK ioctl.
Care should be taken when accessing MMIO space. Only 32 and 64-bit
accesses are supported by POWER8. Also, the AFU will be designed
with a specific endianness, so all MMIO accesses should consider
endianness (recommend endian(3) variants like: le64toh(),
be64toh() etc). These endian issues equally apply to shared memory
queues the WED may describe.
read
----
Reads events from the AFU. Blocks if no events are pending
(unless O_NONBLOCK is supplied). Returns -EIO in the case of an
unrecoverable error or if the card is removed.
read() will always return an integral number of events.
The buffer passed to read() must be at least 4K bytes.
The result of the read will be a buffer of one or more events,
each event is of type struct cxl_event, of varying size.
struct cxl_event {
struct cxl_event_header header;
union {
struct cxl_event_afu_interrupt irq;
struct cxl_event_data_storage fault;
struct cxl_event_afu_error afu_error;
};
};
The struct cxl_event_header is defined as:
struct cxl_event_header {
__u16 type;
__u16 size;
__u16 process_element;
__u16 reserved1;
};
type:
This defines the type of event. The type determines how
the rest of the event is structured. These types are
described below and defined by enum cxl_event_type.
size:
This is the size of the event in bytes including the
struct cxl_event_header. The start of the next event can
be found at this offset from the start of the current
event.
process_element:
Context ID of the event.
reserved field:
For future extensions and padding.
If the event type is CXL_EVENT_AFU_INTERRUPT then the event
structure is defined as:
struct cxl_event_afu_interrupt {
__u16 flags;
__u16 irq; /* Raised AFU interrupt number */
__u32 reserved1;
};
flags:
These flags indicate which optional fields are present
in this struct. Currently all fields are mandatory.
irq:
The IRQ number sent by the AFU.
reserved field:
For future extensions and padding.
If the event type is CXL_EVENT_DATA_STORAGE then the event
structure is defined as:
struct cxl_event_data_storage {
__u16 flags;
__u16 reserved1;
__u32 reserved2;
__u64 addr;
__u64 dsisr;
__u64 reserved3;
};
flags:
These flags indicate which optional fields are present in
this struct. Currently all fields are mandatory.
address:
The address that the AFU unsuccessfully attempted to
access. Valid accesses will be handled transparently by the
kernel but invalid accesses will generate this event.
dsisr:
This field gives information on the type of fault. It is a
copy of the DSISR from the PSL hardware when the address
fault occurred. The form of the DSISR is as defined in the
CAIA.
reserved fields:
For future extensions
If the event type is CXL_EVENT_AFU_ERROR then the event structure
is defined as:
struct cxl_event_afu_error {
__u16 flags;
__u16 reserved1;
__u32 reserved2;
__u64 error;
};
flags:
These flags indicate which optional fields are present in
this struct. Currently all fields are Mandatory.
error:
Error status from the AFU. Defined by the AFU.
reserved fields:
For future extensions and padding
Sysfs Class
===========
A cxl sysfs class is added under /sys/class/cxl to facilitate
enumeration and tuning of the accelerators. Its layout is
described in Documentation/ABI/testing/sysfs-class-cxl
Udev rules
==========
The following udev rules could be used to create a symlink to the
most logical chardev to use in any programming mode (afuX.Yd for
dedicated, afuX.Ys for afu directed), since the API is virtually
identical for each:
SUBSYSTEM=="cxl", ATTRS{mode}=="dedicated_process", SYMLINK="cxl/%b"
SUBSYSTEM=="cxl", ATTRS{mode}=="afu_directed", \
KERNEL=="afu[0-9]*.[0-9]*s", SYMLINK="cxl/%b"
|