1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
|
This directory contains mpn functions optimized for DEC Alpha processors.
ALPHA ASSEMBLY RULES AND REGULATIONS
The `.prologue N' pseudo op marks the end of instruction that needs
special handling by unwinding. It also says whether $27 is really
needed for computing the gp. The `.mask M' pseudo op says which
registers are saved on the stack, and at what offset in the frame.
Cray code is very very different...
RELEVANT OPTIMIZATION ISSUES
EV4
1. This chip has very limited store bandwidth. The on-chip L1 cache is
write-through, and a cache line is transfered from the store buffer to
the off-chip L2 in as much 15 cycles on most systems. This delay hurts
mpn_add_n, mpn_sub_n, mpn_lshift, and mpn_rshift.
2. Pairing is possible between memory instructions and integer arithmetic
instructions.
3. mulq and umulh are documented to have a latency of 23 cycles, but 2 of
these cycles are pipelined. Thus, multiply instructions can be issued at
a rate of one each 21st cycle.
EV5
1. The memory bandwidth of this chip seems excellent, both for loads and
stores. Even when the working set is larger than the on-chip L1 and L2
caches, the performance remain almost unaffected.
2. mulq has a latency of 12 cycles and an issue rate of 1 each 8th cycle.
umulh has a measured latency of 14 cycles and an issue rate of 1 each
10th cycle. But the exact timing is somewhat confusing.
3. mpn_add_n. With 4-fold unrolling, we need 37 instructions, whereof 12
are memory operations. This will take at least
ceil(37/2) [dual issue] + 1 [taken branch] = 19 cycles
We have 12 memory cycles, plus 4 after-store conflict cycles, or 16 data
cache cycles, which should be completely hidden in the 19 issue cycles.
The computation is inherently serial, with these dependencies:
ldq ldq
\ /\
(or) addq |
|\ / \ |
| addq cmpult
\ | |
cmpult |
\ /
or
I.e., 3 operations are needed between carry-in and carry-out, making 12
cycles the absolute minimum for the 4 limbs. We could replace the `or'
with a cmoveq/cmovne, which could issue one cycle earlier that the `or',
but that might waste a cycle on EV4. The total depth remain unaffected,
since cmov has a latency of 2 cycles.
addq
/ \
addq cmpult
| \
cmpult -> cmovne
Montgomery has a slightly different way of computing carry that requires one
less instruction, but has depth 4 (instead of the current 3). Since the
code is currently instruction issue bound, Montgomery's idea should save us
1/2 cycle per limb, or bring us down to a total of 17 cycles or 4.25
cycles/limb. Unfortunately, this method will not be good for the EV6.
EV6
Here we have a really parallel pipeline, capable of issuing up to 4 integer
instructions per cycle. One integer multiply instruction can issue each
cycle. To get optimal speed, we need to pretend we are vectorizing the code,
i.e., minimize the iterative dependencies.
There are two dependencies to watch out for. 1) Address arithmetic
dependencies, and 2) carry propagation dependencies.
We can avoid serializing due to address arithmetic by unrolling the loop, so
that addresses don't depend heavily on an index variable. Avoiding
serializing because of carry propagation is trickier; the ultimate performance
of the code will be determined of the number of latency cycles it takes from
accepting carry-in to a vector point until we can generate carry-out.
Most integer instructions can execute in either the L0, U0, L1, or U1
pipelines. Shifts only execute in U0 and U1, and multiply only in U1.
CMOV instructions split into two internal instructions, CMOV1 and CMOV2, but
the execute efficiently. But CMOV split the mapping process (see pg 2-26 in
cmpwrgd.pdf), suggesting the CMOV should always be placed as the last
instruction of an aligned 4 instruction block (?).
Perhaps the most important issue is the latency between the L0/U0 and L1/U1
clusters; a result obtained on either cluster has an extra cycle of latency
for consumers in the opposite cluster. Because of the dynamic nature of the
implementation, it is hard to predict where an instruction will execute.
The shift loops need (per limb):
1 load (Lx pipes)
1 store (Lx pipes)
2 shift (Ux pipes)
1 iaddlog (Lx pipes, Ux pipes)
Obviously, since the pipes are very equally loaded, we should get 4 insn/cycle, or 1.25 cycles/limb.
For mpn_add_n, we currently have
2 load (Lx pipes)
1 store (Lx pipes)
5 iaddlog (Lx pipes, Ux pipes)
Again, we have a perfect balance and will be limited by carry propagation
delays, currently three cycles. The superoptimizer indicates that ther
might be sequences that--using a final cmov--have a carry propagation delay
of just two. Montgomery's subtraction sequence could perhaps be used, by
complementing some operands. All in all, we should get down to 2 cycles
without much problems.
For mpn_mul_1, we could do, just like for mpn_add_n:
not newlo,notnewlo
addq cylimb,newlo,newlo || cmpult cylimb,notnewlo,cyout
addq cyout,newhi,cylimb
and get 2-cycle carry propagation. The instructions needed will be
1 ld (Lx pipes)
1 st (Lx pipes)
2 mul (U1 pipe)
4 iaddlog (Lx pipes, Ux pipes)
issue1: addq not mul ld
issue2: cmpult addq mul st
Conclusion: no cluster delays and 2-cycle carry delays will give us 2 cycles/limb!
Last, we have mpn_addmul_1. Almost certainly, we will get down to 3
cycles/limb, which would be absolutely awesome.
Old, perhaps obsolete addmul_1 dependency diagram (needs 175 columns wide screen):
i
s
s i
u n
e s
d t
r
i u
l n c
i s t
v t i
e r o
u n
v c
a t t
l i y
u o p
e n e
s s s
issue
in
cycle
-1 ldq
/ \
0 | \
| \
1 | |
| |
2 | | ldq
| | / \
3 | mulq | \
| \ | \
4 umulh \ | |
| | | |
5 | | | | ldq
| | | | / \
4calm 6 | | ldq | mulq | \
| | / | \ | \
4casm 7 | | / umulh \ | |
6 | || | | | |
3aal 8 | || | | | | ldq
7 | || | | | | / \
4calm 9 | || | | ldq | mulq | \
9 | || | | / | \ | \
4casm 10 | || | | / umulh \ | |
9 | || | || | | | |
3aal 11 | addq | || | | | | ldq
9 | // \ | || | | | | / \
4calm 12 \ cmpult addq<-cy | || | | ldq | mulq | \
13 \ / // \ | || | | / | \ | \
4casm 13 addq cmpult stq | || | | / umulh \ | |
11 \ / | || | || | | | |
3aal 14 addq | addq | || | | | | ldq
10 \ | // \ | || | | | | / \
4calm 15 cy ----> \ cmpult addq<-cy | || | | ldq | mulq | \
13 \ / // \ | || | | / | \ | \
4casm 16 addq cmpult stq | || | | / umulh \ | |
11 \ / | || | || | | | |
3aal 17 addq | addq | || | | | |
10 \ | // \ | || | | | |
4calm 18 cy ----> \ cmpult addq<-cy | || | | ldq | mulq
13 \ / // \ | || | | / | \
4casm 19 addq cmpult stq | || | | / umulh \
11 \ / | || | || | |
3aal 20 addq | addq | || | |
10 \ | // \ | || | |
4calm 21 cy ----> \ cmpult addq<-cy | || | | ldq
\ / // \ | || | | /
22 addq cmpult stq | || | | /
\ / | || | ||
23 addq | addq | ||
\ | // \ | ||
24 cy ----> \ cmpult addq<-cy | ||
\ / // \ | ||
25 addq cmpult stq | ||
\ / | ||
26 addq | addq
\ | // \
27 cy ----> \ cmpult addq<-cy
\ / // \
28 addq cmpult stq
\ /
As many as 6 consecutive points will be under execution simultaneously, or if we addq
schedule loads even further away, maybe 7 or 8. But the number of live quantities \
is reasonable, and can easily be satisfied. cy ---->
|