diff options
Diffstat (limited to 'rts/gmp/mpn/sparc64/README')
-rw-r--r-- | rts/gmp/mpn/sparc64/README | 48 |
1 files changed, 0 insertions, 48 deletions
diff --git a/rts/gmp/mpn/sparc64/README b/rts/gmp/mpn/sparc64/README deleted file mode 100644 index 6923a133f3..0000000000 --- a/rts/gmp/mpn/sparc64/README +++ /dev/null @@ -1,48 +0,0 @@ -This directory contains mpn functions for 64-bit V9 SPARC - -RELEVANT OPTIMIZATION ISSUES - -The Ultra I/II pipeline executes up to two simple integer arithmetic operations -per cycle. The 64-bit integer multiply instruction mulx takes from 5 cycles to -35 cycles, depending on the position of the most significant bit of the 1st -source operand. It cannot overlap with other instructions. For our use of -mulx, it will take from 5 to 20 cycles. - -Integer conditional move instructions cannot dual-issue with other integer -instructions. No conditional move can issue 1-5 cycles after a load. (Or -something such bizzare.) - -Integer branches can issue with two integer arithmetic instructions. Likewise -for integer loads. Four instructions may issue (arith, arith, ld/st, branch) -but only if the branch is last. - -(The V9 architecture manual recommends that the 2nd operand of a multiply -instruction be the smaller one. For UltraSPARC, they got things backwards and -optimize for the wrong operand! Really helpful in the light of that multiply -is incredibly slow on these CPUs!) - -STATUS - -There is new code in ~/prec/gmp-remote/sparc64. Not tested or completed, but -the pipelines are worked out. Here are the timings: - -* lshift, rshift: The code is well-optimized and runs at 2.0 cycles/limb. - -* add_n, sub_n: add3.s currently runs at 6 cycles/limb. We use a bizarre - scheme of compares and branches (with some nops and fnops to align things) - and carefully stay away from the instructions intended for this application - (i.e., movcs and movcc). - - Using movcc/movcs, even with deep unrolling, seems to get down to 7 - cycles/limb. - - The most promising approach is to split operands in 32-bit pieces using - srlx, then use two addccc, and finally compile the results with sllx+or. - The result could run at 5 cycles/limb, I think. It might be possible to - do without unrolling, or with minimal unrolling. - -* addmul_1/submul_1: Should optimize for when scalar operand < 2^32. -* addmul_1/submul_1: Since mulx is horrendously slow on UltraSPARC I/II, - Karatsuba's method should save up to 16 cycles (i.e. > 20%). -* mul_1 (and possibly the other multiply functions): Handle carry in the - same tricky way as add_n,sub_n. |