summaryrefslogtreecommitdiff
path: root/rts/gmp/mpn/alpha/README
diff options
context:
space:
mode:
Diffstat (limited to 'rts/gmp/mpn/alpha/README')
-rw-r--r--rts/gmp/mpn/alpha/README224
1 files changed, 224 insertions, 0 deletions
diff --git a/rts/gmp/mpn/alpha/README b/rts/gmp/mpn/alpha/README
new file mode 100644
index 0000000000..744260c7c5
--- /dev/null
+++ b/rts/gmp/mpn/alpha/README
@@ -0,0 +1,224 @@
+This directory contains mpn functions optimized for DEC Alpha processors.
+
+ALPHA ASSEMBLY RULES AND REGULATIONS
+
+The `.prologue N' pseudo op marks the end of instruction that needs
+special handling by unwinding. It also says whether $27 is really
+needed for computing the gp. The `.mask M' pseudo op says which
+registers are saved on the stack, and at what offset in the frame.
+
+Cray code is very very different...
+
+
+RELEVANT OPTIMIZATION ISSUES
+
+EV4
+
+1. This chip has very limited store bandwidth. The on-chip L1 cache is
+ write-through, and a cache line is transfered from the store buffer to
+ the off-chip L2 in as much 15 cycles on most systems. This delay hurts
+ mpn_add_n, mpn_sub_n, mpn_lshift, and mpn_rshift.
+
+2. Pairing is possible between memory instructions and integer arithmetic
+ instructions.
+
+3. mulq and umulh are documented to have a latency of 23 cycles, but 2 of
+ these cycles are pipelined. Thus, multiply instructions can be issued at
+ a rate of one each 21st cycle.
+
+EV5
+
+1. The memory bandwidth of this chip seems excellent, both for loads and
+ stores. Even when the working set is larger than the on-chip L1 and L2
+ caches, the performance remain almost unaffected.
+
+2. mulq has a latency of 12 cycles and an issue rate of 1 each 8th cycle.
+ umulh has a measured latency of 14 cycles and an issue rate of 1 each
+ 10th cycle. But the exact timing is somewhat confusing.
+
+3. mpn_add_n. With 4-fold unrolling, we need 37 instructions, whereof 12
+ are memory operations. This will take at least
+ ceil(37/2) [dual issue] + 1 [taken branch] = 19 cycles
+ We have 12 memory cycles, plus 4 after-store conflict cycles, or 16 data
+ cache cycles, which should be completely hidden in the 19 issue cycles.
+ The computation is inherently serial, with these dependencies:
+
+ ldq ldq
+ \ /\
+ (or) addq |
+ |\ / \ |
+ | addq cmpult
+ \ | |
+ cmpult |
+ \ /
+ or
+
+ I.e., 3 operations are needed between carry-in and carry-out, making 12
+ cycles the absolute minimum for the 4 limbs. We could replace the `or'
+ with a cmoveq/cmovne, which could issue one cycle earlier that the `or',
+ but that might waste a cycle on EV4. The total depth remain unaffected,
+ since cmov has a latency of 2 cycles.
+
+ addq
+ / \
+ addq cmpult
+ | \
+ cmpult -> cmovne
+
+Montgomery has a slightly different way of computing carry that requires one
+less instruction, but has depth 4 (instead of the current 3). Since the
+code is currently instruction issue bound, Montgomery's idea should save us
+1/2 cycle per limb, or bring us down to a total of 17 cycles or 4.25
+cycles/limb. Unfortunately, this method will not be good for the EV6.
+
+EV6
+
+Here we have a really parallel pipeline, capable of issuing up to 4 integer
+instructions per cycle. One integer multiply instruction can issue each
+cycle. To get optimal speed, we need to pretend we are vectorizing the code,
+i.e., minimize the iterative dependencies.
+
+There are two dependencies to watch out for. 1) Address arithmetic
+dependencies, and 2) carry propagation dependencies.
+
+We can avoid serializing due to address arithmetic by unrolling the loop, so
+that addresses don't depend heavily on an index variable. Avoiding
+serializing because of carry propagation is trickier; the ultimate performance
+of the code will be determined of the number of latency cycles it takes from
+accepting carry-in to a vector point until we can generate carry-out.
+
+Most integer instructions can execute in either the L0, U0, L1, or U1
+pipelines. Shifts only execute in U0 and U1, and multiply only in U1.
+
+CMOV instructions split into two internal instructions, CMOV1 and CMOV2, but
+the execute efficiently. But CMOV split the mapping process (see pg 2-26 in
+cmpwrgd.pdf), suggesting the CMOV should always be placed as the last
+instruction of an aligned 4 instruction block (?).
+
+Perhaps the most important issue is the latency between the L0/U0 and L1/U1
+clusters; a result obtained on either cluster has an extra cycle of latency
+for consumers in the opposite cluster. Because of the dynamic nature of the
+implementation, it is hard to predict where an instruction will execute.
+
+The shift loops need (per limb):
+ 1 load (Lx pipes)
+ 1 store (Lx pipes)
+ 2 shift (Ux pipes)
+ 1 iaddlog (Lx pipes, Ux pipes)
+Obviously, since the pipes are very equally loaded, we should get 4 insn/cycle, or 1.25 cycles/limb.
+
+For mpn_add_n, we currently have
+ 2 load (Lx pipes)
+ 1 store (Lx pipes)
+ 5 iaddlog (Lx pipes, Ux pipes)
+
+Again, we have a perfect balance and will be limited by carry propagation
+delays, currently three cycles. The superoptimizer indicates that ther
+might be sequences that--using a final cmov--have a carry propagation delay
+of just two. Montgomery's subtraction sequence could perhaps be used, by
+complementing some operands. All in all, we should get down to 2 cycles
+without much problems.
+
+For mpn_mul_1, we could do, just like for mpn_add_n:
+ not newlo,notnewlo
+ addq cylimb,newlo,newlo || cmpult cylimb,notnewlo,cyout
+ addq cyout,newhi,cylimb
+and get 2-cycle carry propagation. The instructions needed will be
+ 1 ld (Lx pipes)
+ 1 st (Lx pipes)
+ 2 mul (U1 pipe)
+ 4 iaddlog (Lx pipes, Ux pipes)
+issue1: addq not mul ld
+issue2: cmpult addq mul st
+Conclusion: no cluster delays and 2-cycle carry delays will give us 2 cycles/limb!
+
+Last, we have mpn_addmul_1. Almost certainly, we will get down to 3
+cycles/limb, which would be absolutely awesome.
+
+Old, perhaps obsolete addmul_1 dependency diagram (needs 175 columns wide screen):
+
+ i
+ s
+ s i
+ u n
+ e s
+ d t
+ r
+ i u
+l n c
+i s t
+v t i
+e r o
+ u n
+v c
+a t t
+l i y
+u o p
+e n e
+s s s
+ issue
+ in
+ cycle
+ -1 ldq
+ / \
+ 0 | \
+ | \
+ 1 | |
+ | |
+ 2 | | ldq
+ | | / \
+ 3 | mulq | \
+ | \ | \
+ 4 umulh \ | |
+ | | | |
+ 5 | | | | ldq
+ | | | | / \
+ 4calm 6 | | ldq | mulq | \
+ | | / | \ | \
+ 4casm 7 | | / umulh \ | |
+6 | || | | | |
+ 3aal 8 | || | | | | ldq
+7 | || | | | | / \
+ 4calm 9 | || | | ldq | mulq | \
+9 | || | | / | \ | \
+ 4casm 10 | || | | / umulh \ | |
+9 | || | || | | | |
+ 3aal 11 | addq | || | | | | ldq
+9 | // \ | || | | | | / \
+ 4calm 12 \ cmpult addq<-cy | || | | ldq | mulq | \
+13 \ / // \ | || | | / | \ | \
+ 4casm 13 addq cmpult stq | || | | / umulh \ | |
+11 \ / | || | || | | | |
+ 3aal 14 addq | addq | || | | | | ldq
+10 \ | // \ | || | | | | / \
+ 4calm 15 cy ----> \ cmpult addq<-cy | || | | ldq | mulq | \
+13 \ / // \ | || | | / | \ | \
+ 4casm 16 addq cmpult stq | || | | / umulh \ | |
+11 \ / | || | || | | | |
+ 3aal 17 addq | addq | || | | | |
+10 \ | // \ | || | | | |
+ 4calm 18 cy ----> \ cmpult addq<-cy | || | | ldq | mulq
+13 \ / // \ | || | | / | \
+ 4casm 19 addq cmpult stq | || | | / umulh \
+11 \ / | || | || | |
+ 3aal 20 addq | addq | || | |
+10 \ | // \ | || | |
+ 4calm 21 cy ----> \ cmpult addq<-cy | || | | ldq
+ \ / // \ | || | | /
+ 22 addq cmpult stq | || | | /
+ \ / | || | ||
+ 23 addq | addq | ||
+ \ | // \ | ||
+ 24 cy ----> \ cmpult addq<-cy | ||
+ \ / // \ | ||
+ 25 addq cmpult stq | ||
+ \ / | ||
+ 26 addq | addq
+ \ | // \
+ 27 cy ----> \ cmpult addq<-cy
+ \ / // \
+ 28 addq cmpult stq
+ \ /
+As many as 6 consecutive points will be under execution simultaneously, or if we addq
+schedule loads even further away, maybe 7 or 8. But the number of live quantities \
+is reasonable, and can easily be satisfied. cy ---->