diff options
author | tege <tege@gmplib.org> | 2001-11-20 16:56:00 +0100 |
---|---|---|
committer | tege <tege@gmplib.org> | 2001-11-20 16:56:00 +0100 |
commit | 42751b120569c53be5ec14b74ded87462ea2a550 (patch) | |
tree | bc2410332de96f9b9dbdea6f9246b0d9fbe11f95 /mpn | |
parent | 0a87d21d70bf2e5c61f64106594066909fae8bbe (diff) | |
download | gmp-42751b120569c53be5ec14b74ded87462ea2a550.tar.gz |
*** empty log message ***
Diffstat (limited to 'mpn')
-rw-r--r-- | mpn/x86/pentium4/README | 17 |
1 files changed, 13 insertions, 4 deletions
diff --git a/mpn/x86/pentium4/README b/mpn/x86/pentium4/README index 72f037c74..e9f7b0f24 100644 --- a/mpn/x86/pentium4/README +++ b/mpn/x86/pentium4/README @@ -54,15 +54,20 @@ The shifts ought to be able to go at 1.5 c/l, but not much effort has been applied to them yet. In-place operations, and all addmul, submul, mul_basecase and sqr_basecase -calls, suffer from hardware slowdowns associated with write combining and -movd reads and writes to the same or nearby locations. Software movq and -splitting/combining seems to require too many extra instructions to help. -Perhaps future chip steppings will be better. +calls, suffer from pipeline anomalies associated with write combining and +movd reads and writes to the same or nearby locations. The movq +instructions do not trigger the same hardware problems. Unfortunately, +using movq and splitting/combining seems to require too many extra +instructions to help. Perhaps future chip steppings will be better. NOTES +The Pentium-4 pipeline "Netburst", provides for quite a number of surprises. +Many traditional x86 instructions run very slowly, requiring use of +alterative instructions for acceptable performance. + adcl and sbbl are quite slow at 8 cycles for reg->reg. paddq of 32-bits within a 64-bit mmx register seems better, though the combination paddq/psrlq when propagating a carry is still a 4 cycle latency. @@ -71,6 +76,10 @@ incl and decl should be avoided, instead use add $1 and sub $1. Apparently the carry flag is not separately renamed, so incl and decl depend on all previous flags-setting instructions. +shll and shrl have a 4 cycle latency, or 8 times the latency of the fastest +integer instructions (addl, subl, orl, andl, and some more). shldl and +shrdl seem to have around 13 cycle latency. + movq mmx -> mmx does have 6 cycle latency, as noted in the documentation. pxor/por or similar combination at 2 cycles latency can be used instead. The movq however executes in the float unit, thereby saving MMX execution |