mpn/alpha/README


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134

Copyright 1996, 1997, 1999, 2000, 2001 Free Software Foundation, Inc.

This file is part of the GNU MP Library.

The GNU MP Library is free software; you can redistribute it and/or modify it
under the terms of the GNU Lesser General Public License as published by the
Free Software Foundation; either version 2.1 of the License, or (at your
option) any later version.

The GNU MP Library is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
FITNESS FOR A PARTICULAR PURPOSE.  See the GNU Lesser General Public License
for more details.

You should have received a copy of the GNU Lesser General Public License along
with the GNU MP Library; see the file COPYING.LIB.  If not, write to the Free
Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307,
USA.


This directory contains mpn functions optimized for DEC Alpha processors.

ALPHA ASSEMBLY RULES AND REGULATIONS

The `.prologue N' pseudo op marks the end of instruction that needs special
handling by unwinding.  It also says whether $27 is really needed for computing
the gp.  The `.mask M' pseudo op says which registers are saved on the stack,
and at what offset in the frame.

Cray T3 code is very very different...


RELEVANT OPTIMIZATION ISSUES

EV4

1. This chip has very limited store bandwidth.  The on-chip L1 cache is write-
   through, and a cache line is transfered from the store buffer to the off-
   chip L2 in as much 15 cycles on most systems.  This delay hurts mpn_add_n,
   mpn_sub_n, mpn_lshift, and mpn_rshift.

2. Pairing is possible between memory instructions and integer arithmetic
   instructions.

3. mulq and umulh are documented to have a latency of 23 cycles, but 2 of these
   cycles are pipelined.  Thus, multiply instructions can be issued at a rate
   of one each 21st cycle.

EV5

1. The memory bandwidth of this chip is good, both for loads and stores.  The
   L1 cache can handle two loads or one store per cycle, but two cycles after a
   store, no ld can issue.

2. mulq has a latency of 12 cycles and an issue rate of 1 each 8th cycle.
   umulh has a latency of 14 cycles and an issue rate of 1 each 10th cycle.
   (Note that published documentation gets these numbers slightly wrong.)

3. mpn_add_n.  With 4-fold unrolling, we need 37 instructions, whereof 12
   are memory operations.  This will take at least
	ceil(37/2) [dual issue] + 1 [taken branch] = 19 cycles
   We have 12 memory cycles, plus 4 after-store conflict cycles, or 16 data
   cache cycles, which should be completely hidden in the 19 issue cycles.
   The computation is inherently serial, with these dependencies:

	       ldq  ldq
		 \  /\
	  (or)   addq |
	   |\   /   \ |
	   | addq  cmpult
	    \  |     |
	     cmpult  |
		 \  /
		  or

   I.e., 3 operations are needed between carry-in and carry-out, making 12
   cycles the absolute minimum for the 4 limbs.  We could replace the `or' with
   a cmoveq/cmovne, which could issue one cycle earlier that the `or', but that
   might waste a cycle on EV4.  The total depth remain unaffected, since cmov
   has a latency of 2 cycles.

     addq
     /   \
   addq  cmpult
     |      \
   cmpult -> cmovne

  Montgomery has a slightly different way of computing carry that requires one
  less instruction, but has depth 4 (instead of the current 3).  Since the code
  is currently instruction issue bound, Montgomery's idea should save us 1/2
  cycle per limb, or bring us down to a total of 17 cycles or 4.25 cycles/limb.
  Unfortunately, this method will not be good for the EV6.

4. addmul_1 and friends: We previously had a scheme for splitting the single-
   limb operand in 21-bits chunks and the multi-limb operand in 32-bit chunks,
   and then use FP operations for every 2nd multiply, and integer operations
   for every 2nd multiply.

   But it seems much better to split the single-limb operand in 16-bit chunks,
   since we save many integer shifts and adds that way.  See powerpc64/README
   for some more details.

EV6

Here we have a really parallel pipeline, capable of issuing up to 4 integer
instructions per cycle.  One integer multiply instruction can issue each cycle.
To get optimal speed, we need to pretend we are vectorizing the code, i.e.,
minimize the depth of recurrences.  In actual practice, it is never possible to
sustain more than 3.5 insns/cycle due to renaming register constraints.

There are two dependencies to watch out for.  1) Address arithmetic
dependencies, and 2) carry propagation dependencies.

We can avoid serializing due to address arithmetic by unrolling the loop, so
that addresses don't depend heavily on an index variable.  Avoiding serializing
because of carry propagation is trickier; the ultimate performance of the code
will be determined of the number of latency cycles it takes from accepting
carry-in to a vector point until we can generate carry-out.

Most integer instructions can execute in either the L0, U0, L1, or U1
pipelines.  Shifts only execute in U0 and U1, and multiply only in U1.

CMOV instructions split into two internal instructions, CMOV1 and CMOV2.  CMOV
split the mapping process (see pg 2-26 in cmpwrgd.pdf), suggesting the CMOV
should always be placed as the last instruction of an aligned 4 instruction
block (?).

Perhaps the most important issue is the latency between the L0/U0 and L1/U1
clusters; a result obtained on either cluster has an extra cycle of latency for
consumers in the opposite cluster.  Because of the dynamic nature of the
implementation, it is hard to predict where an instruction will execute.