diff options
author | tege <tege@gmplib.org> | 2002-05-21 18:36:42 +0200 |
---|---|---|
committer | tege <tege@gmplib.org> | 2002-05-21 18:36:42 +0200 |
commit | 37e7a3c00c80dacaa487e0c74c36b93d1748fcc7 (patch) | |
tree | 4219e009fd7e7ab8822e51956b2a8367f6e5d4e6 | |
parent | 9ef9e2b4ec78a88f821da3fc1f875c3d9d87896c (diff) | |
download | gmp-37e7a3c00c80dacaa487e0c74c36b93d1748fcc7.tar.gz |
*** empty log message ***
-rw-r--r-- | ChangeLog | 4 | ||||
-rw-r--r-- | mpn/alpha/README | 14 |
2 files changed, 11 insertions, 7 deletions
@@ -21,6 +21,10 @@ MA 02111-1307, USA. 2002-05-21 Torbjorn Granlund <tege@swox.com> + * mpz/set_str.c: Nailify. + + * randlc2x.c (gmp_randinit_lc_2exp): Nailify. + From Jakub Jelinek: * longlong.h (add_ssaaaa,sub_ddmmss) [64-bit sparc]: Make it actually work. diff --git a/mpn/alpha/README b/mpn/alpha/README index 67ed43220..b2bcd08f4 100644 --- a/mpn/alpha/README +++ b/mpn/alpha/README @@ -106,16 +106,16 @@ EV5 EV6 Here we have a really parallel pipeline, capable of issuing up to 4 integer -instructions per cycle. One integer multiply instruction can issue each cycle. -To get optimal speed, we need to pretend we are vectorizing the code, i.e., -minimize the depth of recurrences. In actual practice, it is never possible to -sustain more than 3.5 insns/cycle due to renaming register constraints. +instructions per cycle. In actual practice, it is never possible to sustain +more than 3.5 integer insns/cycle due to rename register shortage. One integer +multiply instruction can issue each cycle. To get optimal speed, we need to +pretend we are vectorizing the code, i.e., minimize the depth of recurrences. There are two dependencies to watch out for. 1) Address arithmetic dependencies, and 2) carry propagation dependencies. -We can avoid serializing due to address arithmetic by unrolling the loop, so -that addresses don't depend heavily on an index variable. Avoiding serializing +We can avoid serializing due to address arithmetic by unrolling loops, so that +addresses don't depend heavily on an index variable. Avoiding serializing because of carry propagation is trickier; the ultimate performance of the code will be determined of the number of latency cycles it takes from accepting carry-in to a vector point until we can generate carry-out. @@ -126,7 +126,7 @@ pipelines. Shifts only execute in U0 and U1, and multiply only in U1. CMOV instructions split into two internal instructions, CMOV1 and CMOV2. CMOV split the mapping process (see pg 2-26 in cmpwrgd.pdf), suggesting the CMOV should always be placed as the last instruction of an aligned 4 instruction -block (?). +block, or perhaps simply avoided. Perhaps the most important issue is the latency between the L0/U0 and L1/U1 clusters; a result obtained on either cluster has an extra cycle of latency for |