summaryrefslogtreecommitdiff
path: root/mpn
diff options
context:
space:
mode:
authortege <tege@gmplib.org>2001-11-20 16:56:00 +0100
committertege <tege@gmplib.org>2001-11-20 16:56:00 +0100
commit42751b120569c53be5ec14b74ded87462ea2a550 (patch)
treebc2410332de96f9b9dbdea6f9246b0d9fbe11f95 /mpn
parent0a87d21d70bf2e5c61f64106594066909fae8bbe (diff)
downloadgmp-42751b120569c53be5ec14b74ded87462ea2a550.tar.gz
*** empty log message ***
Diffstat (limited to 'mpn')
-rw-r--r--mpn/x86/pentium4/README17
1 files changed, 13 insertions, 4 deletions
diff --git a/mpn/x86/pentium4/README b/mpn/x86/pentium4/README
index 72f037c74..e9f7b0f24 100644
--- a/mpn/x86/pentium4/README
+++ b/mpn/x86/pentium4/README
@@ -54,15 +54,20 @@ The shifts ought to be able to go at 1.5 c/l, but not much effort has been
applied to them yet.
In-place operations, and all addmul, submul, mul_basecase and sqr_basecase
-calls, suffer from hardware slowdowns associated with write combining and
-movd reads and writes to the same or nearby locations. Software movq and
-splitting/combining seems to require too many extra instructions to help.
-Perhaps future chip steppings will be better.
+calls, suffer from pipeline anomalies associated with write combining and
+movd reads and writes to the same or nearby locations. The movq
+instructions do not trigger the same hardware problems. Unfortunately,
+using movq and splitting/combining seems to require too many extra
+instructions to help. Perhaps future chip steppings will be better.
NOTES
+The Pentium-4 pipeline "Netburst", provides for quite a number of surprises.
+Many traditional x86 instructions run very slowly, requiring use of
+alterative instructions for acceptable performance.
+
adcl and sbbl are quite slow at 8 cycles for reg->reg. paddq of 32-bits
within a 64-bit mmx register seems better, though the combination
paddq/psrlq when propagating a carry is still a 4 cycle latency.
@@ -71,6 +76,10 @@ incl and decl should be avoided, instead use add $1 and sub $1. Apparently
the carry flag is not separately renamed, so incl and decl depend on all
previous flags-setting instructions.
+shll and shrl have a 4 cycle latency, or 8 times the latency of the fastest
+integer instructions (addl, subl, orl, andl, and some more). shldl and
+shrdl seem to have around 13 cycle latency.
+
movq mmx -> mmx does have 6 cycle latency, as noted in the documentation.
pxor/por or similar combination at 2 cycles latency can be used instead.
The movq however executes in the float unit, thereby saving MMX execution