mpn/x86_64/README


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115

Copyright 2003, 2004 Free Software Foundation, Inc.

This file is part of the GNU MP Library.

The GNU MP Library is free software; you can redistribute it and/or modify
it under the terms of the GNU Lesser General Public License as published by
the Free Software Foundation; either version 2.1 of the License, or (at your
option) any later version.

The GNU MP Library is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU Lesser General Public
License for more details.

You should have received a copy of the GNU Lesser General Public License
along with the GNU MP Library; see the file COPYING.LIB.  If not, write to
the Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA
02110-1301, USA.


			AMD64 MPN SUBROUTINES


This directory contains mpn functions for AMD64 chips.  It might also be
useful for 64-bit Pentiums, but that chip's poor carry handling makes it
unlikely.  We'll need completely separate code (in a subdirectory).


		     RELEVANT OPTIMIZATION ISSUES

The only AMD64 core as of this writing is the AMD Hammer, sold under the
names Opteron and Athlon64.  The Hammer can sustain up to 3 instructions per
cycle, but in practice that is only possible for integer instructions.  But
almost any three integer instructions can issue simultaneously, including
any 3 ALU operations, including shifts.  Up to two memory operations can
issue each cycle.

Scheduling typically requires that load-use instructions are split into
separate load and use instructions.  That requires more decode resources,
and it is rarely a win.  Hammer is a deep out-of-order core.


		      STATUS OF INDIVIDUAL ROUTINES

mpn_addmul_1, mpn_submul_1, mpn_mul_1:

It is possible to run the critical mulq instruction at a rate 1 every 2nd
cycle.  But only if either the low or the high product part is ignored.
As soon as we use the full product, any tested sequence needs 3 cycles.

Considering this core limitation, the current multiplication code runs very
well:

	mul_1:		3.0
	addmul_1:	3.25
	submul_1:	3.5


mpn_add_n, mpn_sub_n:

Currently: 1.75 cycles/limb, using 4-way unrolling.

It seems possible to reach the ld/st bandwidth 1.5 cycles/limb, perhaps even
with only 2-way unrolling, by making use of indexed addressing.


mpn_lshift, mpn_rshift:

Currently: 2.375 cycles/limb, using 4-way unrolling.

Using integer instructions, it seems 2 cycles/limb is the limit.  It might be
possible to use 128-bit SSE2 instructions for better performance, but since
Hammer in practice only executes 1 SSE instruction/cycle, that seem unlikely.

The mpn_lshift code should have a special case for count=1, since that could be
done at 1 cycle/limb.


mpn_copyi, mpn_copyd:

Currently: 1 cycle/limb, using 4-way unrolling.  Slow for 1-limb operations,
somewhat varying performance for small blocks.  The code uses an unusual
method for the non-unrolled code, which might need improvements.

Using integer instructions, 1 cycle/limb is the limit.  SSE 128-bit
instructions would surely be a win for 128-bit aligned operations, but one
would probably want to fall back to integer instructions for small or
unaligned operations.  SSE instructions might reach 0.5 cycles/limb.


logops:

All logops could be made to run at 1.5 cycles/limb using integer
instructions.  SSE operations could perhaps reach 0.75 cycles/limb, but only
if all 3 operands are 128-bit aligned.  That 1/8 win doesn't seem worth the
implementation effort.


mpn_divrem_1, mpn_mod_1, mpn_preinv_divrem_1, mpn_preinv_mod_1:

The current divrem_1.asm code needs 17 cycles/limb.  Experimental versions
need only 15 cycles/limb.

While this is the best divrem_1 performance of all chips, even better
performance seems attainable by using a two-limb inverse, developing two
quotient limbs per iteration.


REFERENCES

"System V Application Binary Interface AMD64 Architecture Processor
Supplement", draft version 0.90, April 2003.