*** empty log message ***

author: tege <tege@gmplib.org> 2001-11-20 16:56:00 +0100
committer: tege <tege@gmplib.org> 2001-11-20 16:56:00 +0100
commit: 42751b120569c53be5ec14b74ded87462ea2a550 (patch)
tree: bc2410332de96f9b9dbdea6f9246b0d9fbe11f95 /mpn
parent: 0a87d21d70bf2e5c61f64106594066909fae8bbe (diff)
download: gmp-42751b120569c53be5ec14b74ded87462ea2a550.tar.gz
1 files changed, 13 insertions, 4 deletions
diff --git a/mpn/x86/pentium4/README b/mpn/x86/pentium4/README
index 72f037c74..e9f7b0f24 100644
--- a/mpn/x86/pentium4/README
+++ b/mpn/x86/pentium4/README
@@ -54,15 +54,20 @@ The shifts ought to be able to go at 1.5 c/l, but not much effort has been
 applied to them yet.
 
 In-place operations, and all addmul, submul, mul_basecase and sqr_basecase
-calls, suffer from hardware slowdowns associated with write combining and
-movd reads and writes to the same or nearby locations.  Software movq and
-splitting/combining seems to require too many extra instructions to help.
-Perhaps future chip steppings will be better.
+calls, suffer from pipeline anomalies associated with write combining and
+movd reads and writes to the same or nearby locations.  The movq
+instructions do not trigger the same hardware problems.  Unfortunately,
+using movq and splitting/combining seems to require too many extra
+instructions to help.  Perhaps future chip steppings will be better.
 
 
 
 NOTES
 
+The Pentium-4 pipeline "Netburst", provides for quite a number of surprises.
+Many traditional x86 instructions run very slowly, requiring use of
+alterative instructions for acceptable performance.
+
 adcl and sbbl are quite slow at 8 cycles for reg->reg.  paddq of 32-bits
 within a 64-bit mmx register seems better, though the combination
 paddq/psrlq when propagating a carry is still a 4 cycle latency.
@@ -71,6 +76,10 @@ incl and decl should be avoided, instead use add $1 and sub $1.  Apparently
 the carry flag is not separately renamed, so incl and decl depend on all
 previous flags-setting instructions.
 
+shll and shrl have a 4 cycle latency, or 8 times the latency of the fastest
+integer instructions (addl, subl, orl, andl, and some more).  shldl and
+shrdl seem to have around 13 cycle latency.
+
 movq mmx -> mmx does have 6 cycle latency, as noted in the documentation.
 pxor/por or similar combination at 2 cycles latency can be used instead.
 The movq however executes in the float unit, thereby saving MMX execution
author	tege <tege@gmplib.org>	2001-11-20 16:56:00 +0100
committer	tege <tege@gmplib.org>	2001-11-20 16:56:00 +0100
commit	42751b120569c53be5ec14b74ded87462ea2a550 (patch)
tree	bc2410332de96f9b9dbdea6f9246b0d9fbe11f95 /mpn
parent	0a87d21d70bf2e5c61f64106594066909fae8bbe (diff)
download	gmp-42751b120569c53be5ec14b74ded87462ea2a550.tar.gz