1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
|
AMD K6 MPN SUBROUTINES
This directory contains code optimized for AMD K6 CPUs, meaning K6, K6-2 and
K6-3.
The mmx and k62mmx subdirectories have routines using MMX instructions. All
K6s have MMX, the separate directories are just so that ./configure can omit
them if the assembler doesn't support MMX.
STATUS
Times for the loops, with all code and data in L1 cache, are as follows.
cycles/limb
mpn_add_n/sub_n 3.25 normal, 2.75 in-place
mpn_mul_1 6.25
mpn_add/submul_1 7.65-8.4 (varying with data values)
mpn_mul_basecase 9.25 cycles/crossproduct (approx)
mpn_sqr_basecase 4.7 cycles/crossproduct (approx)
or 9.2 cycles/triangleproduct (approx)
mpn_divrem_1 20.0
mpn_mod_1 20.0
mpn_divexact_by3 11.0
mpn_l/rshift 3.0
mpn_copyi/copyd 1.0
mpn_com_n 1.5-1.85 \
mpn_and/andn/ior/xor_n 1.5-1.75 | varying with
mpn_iorn/xnor_n 2.0-2.25 | data alignment
mpn_nand/nior_n 2.0-2.25 /
mpn_popcount 12.5
mpn_hamdist 13.0
K6-2 and K6-3 have dual-issue MMX and get the following improvements.
mpn_l/rshift 1.75
mpn_copyi/copyd 0.56 or 1.0 \
|
mpn_com_n 1.0-1.2 | varying with
mpn_and/andn/ior/xor_n 1.2-1.5 | data alignment
mpn_iorn/xnor_n 1.5-2.0 |
mpn_nand/nior_n 1.75-2.0 /
mpn_popcount 9.0
mpn_hamdist 11.5
Prefetching of sources hasn't yet given any joy. With the 3DNow "prefetch"
instruction, code seems to run slower, and with just "mov" loads it doesn't
seem faster. Results so far are inconsistent. The K6 does a hardware
prefetch of the second cache line in a sector, so the penalty for not
prefetching in software is reduced.
NOTES
All K6 family chips have MMX, but only K6-2 and K6-3 have 3DNow.
Plain K6 executes MMX instructions only in the X pipe, but K6-2 and K6-3 can
execute them in both X and Y (and together).
Branch misprediction penalty is 1 to 4 cycles (Optimization Manual
chapter 6 table 12).
Write-allocate L1 data cache means prefetching of destinations is unnecessary.
Store queue is 7 entries of 64 bits each.
Floating point multiplications can be done in parallel with integer
multiplications, but there doesn't seem to be any way to make use of this.
OPTIMIZATIONS
Unrolled loops are used to reduce looping overhead. The unrolling is
configurable up to 32 limbs/loop for most routines, up to 64 for some.
Sometimes computed jumps into the unrolling are used to handle sizes not a
multiple of the unrolling. An attractive feature of this is that times
smoothly increase with operand size, but an indirect jump is about 6 cycles
and the setups about another 6, so it depends on how much the unrolled code
is faster than a simple loop as to whether a computed jump ought to be used.
Position independent code is implemented using a call to get eip for
computed jumps and a ret is always done, rather than an addl $4,%esp or a
popl, so the CPU return address branch prediction stack stays synchronised
with the actual stack in memory. Such a call however still costs 4 to 7
cycles.
Branch prediction, in absence of any history, will guess forward jumps are
not taken and backward jumps are taken. Where possible it's arranged that
the less likely or less important case is under a taken forward jump.
MMX
Putting emms or femms as late as possible in a routine seems to be fastest.
Perhaps an emms or femms stalls until all outstanding MMX instructions have
completed, so putting it later gives them a chance to complete on their own,
in parallel with other operations (like register popping).
The Optimization Manual chapter 5 recommends using a femms on K6-2 and K6-3
at the start of a routine, in case it's been preceded by x87 floating point
operations. This isn't done because in gmp programs it's expected that x87
floating point won't be much used and that chances are an mpn routine won't
have been preceded by any x87 code.
CODING
Instructions in general code are shown paired if they can decode and execute
together, meaning two short decode instructions with the second not
depending on the first, only the first using the shifter, no more than one
load, and no more than one store.
K6 does some out of order execution so the pairings aren't essential, they
just show what slots might be available. When decoding is the limiting
factor things can be scheduled that might not execute until later.
NOTES
Code alignment
- if an opcode/modrm or 0Fh/opcode/modrm crosses a cache line boundary,
short decode is inhibited. The cross.pl script detects this.
- loops and branch targets should be aligned to 16 bytes, or ensure at least
2 instructions before a 32 byte boundary. This makes use of the 16 byte
cache in the BTB.
Addressing modes
- (%esi) degrades decoding from short to vector. 0(%esi) doesn't have this
problem, and can be used as an equivalent, or easier is just to use a
different register, like %ebx.
- K6 and pre-CXT core K6-2 have the following problem. (K6-2 CXT and K6-3
have it fixed, these being cpuid function 1 signatures 0x588 to 0x58F).
If more than 3 bytes are needed to determine instruction length then
decoding degrades from direct to long, or from long to vector. This
happens with forms like "0F opcode mod/rm" with mod/rm=00-xxx-100 since
with mod=00 the sib determines whether there's a displacement.
This affects all MMX and 3DNow instructions, and others with an 0F prefix
like movzbl. The modes affected are anything with an index and no
displacement, or an index but no base, and this includes (%esp) which is
really (,%esp,1).
The cross.pl script detects problem cases. The workaround is to always
use a displacement, and to do this with Zdisp if it's zero so the
assembler doesn't discard it.
See Optimization Manual rev D page 67 and 3DNow Porting Guide rev B pages
13-14 and 36-37.
Calls
- indirect jumps and calls are not branch predicted, they measure about 6
cycles.
Various
- adcl 2 cycles of decode, maybe 2 cycles executing in the X pipe
- bsf 12-27 cycles
- emms 5 cycles
- femms 3 cycles
- jecxz 2 cycles taken, 13 not taken (optimization manual says 7 not taken)
- divl 20 cycles back-to-back
- imull 2 decode, 2 execute
- mull 2 decode, 3 execute (optimization manual decoding sample)
- prefetch 2 cycles
- rcll/rcrl implicit by one bit: 2 cycles
immediate or %cl count: 11 + 2 per bit for dword
13 + 4 per bit for byte
- setCC 2 cycles
- xchgl %eax,reg 1.5 cycles, back-to-back (strange)
reg,reg 2 cycles, back-to-back
REFERENCES
"AMD-K6 Processor Code Optimization Application Note", AMD publication
number 21924, revision D amendment 0, January 2000. This describes K6-2 and
K6-3. Available on-line,
http://www.amd.com/K6/k6docs/pdf/21924.pdf
"AMD-K6 MMX Enhanced Processor x86 Code Optimization Application Note", AMD
publication number 21828, revision A amendment 0, August 1997. This is an
older edition of the above document, describing plain K6. Available
on-line,
http://www.amd.com/K6/k6docs/pdf/21828.pdf
"3DNow Technology Manual", AMD publication number 21928F/0-August 1999.
This describes the femms and prefetch instructions, but nothing else from
3DNow has been used. Available on-line,
http://www.amd.com/K6/k6docs/pdf/21928.pdf
"3DNow Instruction Porting Guide", AMD publication number 22621, revision B,
August 1999. This has some notes on general K6 optimizations as well as
3DNow. Available on-line,
http://www.amd.com/products/cpg/athlon/techdocs/pdf/22621.pdf
----------------
Local variables:
mode: text
fill-column: 76
End:
|