summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authortege <tege@gmplib.org>2002-05-21 18:36:42 +0200
committertege <tege@gmplib.org>2002-05-21 18:36:42 +0200
commit37e7a3c00c80dacaa487e0c74c36b93d1748fcc7 (patch)
tree4219e009fd7e7ab8822e51956b2a8367f6e5d4e6
parent9ef9e2b4ec78a88f821da3fc1f875c3d9d87896c (diff)
downloadgmp-37e7a3c00c80dacaa487e0c74c36b93d1748fcc7.tar.gz
*** empty log message ***
-rw-r--r--ChangeLog4
-rw-r--r--mpn/alpha/README14
2 files changed, 11 insertions, 7 deletions
diff --git a/ChangeLog b/ChangeLog
index 2c695a24f..ba5ec890c 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -21,6 +21,10 @@ MA 02111-1307, USA.
2002-05-21 Torbjorn Granlund <tege@swox.com>
+ * mpz/set_str.c: Nailify.
+
+ * randlc2x.c (gmp_randinit_lc_2exp): Nailify.
+
From Jakub Jelinek:
* longlong.h (add_ssaaaa,sub_ddmmss) [64-bit sparc]:
Make it actually work.
diff --git a/mpn/alpha/README b/mpn/alpha/README
index 67ed43220..b2bcd08f4 100644
--- a/mpn/alpha/README
+++ b/mpn/alpha/README
@@ -106,16 +106,16 @@ EV5
EV6
Here we have a really parallel pipeline, capable of issuing up to 4 integer
-instructions per cycle. One integer multiply instruction can issue each cycle.
-To get optimal speed, we need to pretend we are vectorizing the code, i.e.,
-minimize the depth of recurrences. In actual practice, it is never possible to
-sustain more than 3.5 insns/cycle due to renaming register constraints.
+instructions per cycle. In actual practice, it is never possible to sustain
+more than 3.5 integer insns/cycle due to rename register shortage. One integer
+multiply instruction can issue each cycle. To get optimal speed, we need to
+pretend we are vectorizing the code, i.e., minimize the depth of recurrences.
There are two dependencies to watch out for. 1) Address arithmetic
dependencies, and 2) carry propagation dependencies.
-We can avoid serializing due to address arithmetic by unrolling the loop, so
-that addresses don't depend heavily on an index variable. Avoiding serializing
+We can avoid serializing due to address arithmetic by unrolling loops, so that
+addresses don't depend heavily on an index variable. Avoiding serializing
because of carry propagation is trickier; the ultimate performance of the code
will be determined of the number of latency cycles it takes from accepting
carry-in to a vector point until we can generate carry-out.
@@ -126,7 +126,7 @@ pipelines. Shifts only execute in U0 and U1, and multiply only in U1.
CMOV instructions split into two internal instructions, CMOV1 and CMOV2. CMOV
split the mapping process (see pg 2-26 in cmpwrgd.pdf), suggesting the CMOV
should always be placed as the last instruction of an aligned 4 instruction
-block (?).
+block, or perhaps simply avoided.
Perhaps the most important issue is the latency between the L0/U0 and L1/U1
clusters; a result obtained on either cluster has an extra cycle of latency for