summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorKevin Ryde <user42@zip.com.au>2001-02-01 22:37:46 +0100
committerKevin Ryde <user42@zip.com.au>2001-02-01 22:37:46 +0100
commitefbed147941b2e807d2ab17708b4ca8684c9ee37 (patch)
treeff4e7eff4a944bb6f00ba7acc76be26cb48d15bd
parentf66effe817b63e940e56a94b1919d5d91160ecae (diff)
downloadgmp-efbed147941b2e807d2ab17708b4ca8684c9ee37.tar.gz
* tune/README: Misc updates.
-rw-r--r--tune/README109
1 files changed, 69 insertions, 40 deletions
diff --git a/tune/README b/tune/README
index e4dec1c15..181af7579 100644
--- a/tune/README
+++ b/tune/README
@@ -10,25 +10,49 @@ The programs here are tools, not ready to run solutions. Nothing is built
in a normal "make all", but various Makefile targets described below exist.
Relatively few systems and CPUs have been tested, so be sure to verify that
-results are sensible.
+results are sensible before relying on them.
MISCELLANEOUS NOTES
-Don't configure with --enable-assert when using the things here, since the
-extra code added by assertion checking may influence measurements.
+--enable-assert
-Some effort has been made to accommodate CPUs with direct mapped caches, but
-it will depend on TMP_ALLOC using a proper alloca, and even then it may or
-may not be enough.
+ Don't configure with --enable-assert, since the extra code added by
+ assertion checking may influence measurements.
-The sparc32/v9 addmul_1 code runs at noticeably different speeds on
-successive sizes, and this has a bad effect on the tune program's
-determinations of the multiply and square thresholds.
+Direct mapped caches
+ Some effort has been made to accommodate CPUs with direct mapped caches,
+ by putting data blocks more or less contiguously on the stack. But this
+ will depend on TMP_ALLOC using alloca, and even then it may or may not
+ be enough.
+sparc32/v9 (eg. ultrasparc under solaris 2.6)
+
+ The sparc32/v9 addmul_1 code runs at noticeably different speeds on
+ successive sizes (mod 4), and this has a bad effect on the tune program
+ determinations of multiply and square thresholds.
+
+FreeBSD 4.2 i486 getrusage
+
+ This getrusage seems to be a bit doubtful, it looks like ru_utime is
+ microsecond accurate, but sometimes ru_utime remains unchanged after a
+ time of many microseconds has elapsed. It'd be good to detect this in
+ the time.c initializations, but for now the suggestion is to pretend
+ getrusage doesn't exist.
+
+ ./configure ac_cv_func_getrusage=no
+
+NetBSD 1.4.1 m68k macintosh time base
+
+ On this system its been found getrusage often goes backwards, making it
+ unusable. And gettimeofday sometimes doesn't update atomically when it
+ crosses a 1 second boundary. Not sure what to do about this. Disable
+ getrusage with ac_cv_func_getrusage=no as per above to try gettimeofday,
+ but don't expect it to work properly.
+
@@ -42,12 +66,14 @@ into gmp-mparam.h. The program is built and run with
If the thresholds indicated are grossly different from the values in the
selected gmp-mparam.h then there may be a performance boost in relevant size
-ranges by changing gmp-mparam.h accordingly.
+ranges by changing gmp-mparam.h accordingly. Do a full reconfigure and
+rebuild to get them to take effect (a partial rebuild might be enough
+sometimes, but a fresh configure and make is certain to be correct).
If a CPU has specific tuned parameters coming from a gmp-mparam.h in one of
the mpn subdirectories then the values from "make tune" should be similar.
-Check though that the CPU target is right and there are no machine specific
-effects causing a difference.
+Check though that the configured CPU is right and there are no machine
+specific effects causing a difference.
It's hoped the compiler and options used won't have too much effect on
thresholds, since for most CPUs they ultimately come down to comparisons
@@ -55,10 +81,10 @@ between assembler subroutines. Missing out on the longlong.h macros by not
using gcc will probably have an effect.
Some thresholds produced by the tune program are merely single values chosen
-from what's actually a range of sizes where two algorithms are pretty much
-the same speed. When this happens the program is likely to give somewhat
-different values on successive runs. This is noticeable on the toom3
-thresholds for instance.
+from what's a range of sizes where two algorithms are pretty much the same
+speed. When this happens the program is likely to give somewhat different
+values on successive runs. This is noticeable on the toom3 thresholds for
+instance.
@@ -100,22 +126,27 @@ don't get this since it would upset gnuplot or other data viewers.
TIME BASE
The time measuring method is determined in time.c, based on what the
-configured target has available. A microsecond accurate gettimeofday() will
-work well, but there's code to use better methods, such as the cycle
-counters on various CPUs.
-
-Currently, all methods except possibly the alpha cycle counter depend on the
-machine being otherwise idle, or rather on other jobs not stealing CPU time
-from the measuring program. Short routines (those that complete within a
-timeslice) should work even on a busy machine. Some trouble is taken by
-speed_measure() in common.c to avoid the ill effects of sporadic interrupts,
-or other intermittent things (like cron waking up every minute). But
-generally an idle machine will be necessary to be certain of consistent
-results.
-
-The CPU frequency is needed if times in cycles are to be displayed, and it's
-always needed when using a cycle counter time base. freq.c knows how to get
-the frequency on some systems, but when that fails, or needs to be
+configured target has available. A cycle counter is preferred, possibly
+supplemented by another method if the counter has a limited range. A
+microsecond accurate getrusage() or gettimeofday() will work well.
+
+The cycle counters (except possibly on alpha) and gettimeofday() will depend
+on the machine being otherwise idle, or rather on other jobs not stealing
+CPU time from the measuring program. Short routines (those that complete
+within a timeslice) should work even on a busy machine.
+
+Some trouble is taken by speed_measure() in common.c to avoid ill effects
+from sporadic interrupts, or other intermittent things (like cron waking up
+every minute). But generally an idle machine will be necessary to be
+certain of consistent results.
+
+The CPU frequency is needed to convert between cycles and seconds, or for
+when a cycle counter is supplemented by getrusage() etc. The speed program
+will convert as necessary according to the output format requested. The
+tune program will work with either cycles or seconds.
+
+freq.c knows how to get the frequency on some systems, and can measure a
+cycle counter against gettimeofday(), but when that fails, or needs to be
overridden, an environment variable GMP_CPU_FREQUENCY can be used (in
Hertz). For example in "bash" on a 650 MHz machine,
@@ -124,12 +155,6 @@ Hertz). For example in "bash" on a 650 MHz machine,
A high precision time base makes it possible to get accurate measurements in
a shorter time. Support for systems and CPUs not already covered is wanted.
-When choosing or creating a method, be sure not to claim a higher accuracy
-than is really available. For example the default gettimeofday() code is
-set for microsecond accuracy, but if only 10ms or 55ms is available then
-inconsistent results can be expected.
-
-
@@ -334,6 +359,11 @@ Extensions should be fairly easy to make though. speed-ext.c is an example,
in a style that should suit one-off tests, or new code fragments under
development.
+many.pl is a script for generating a new speed program supplemented with
+alternate versions of the standard routines. It can be used for measuring
+experimental code, or for comparing different implementations that exist
+within a CPU family.
+
@@ -378,8 +408,7 @@ large measurements. Make it able to test each available method, including
perhaps the apparent resolution of each.
Add versions of the generic C mpn_divrem_1 using straight division versus a
-multiply by inverse, so the two can be compared. Include the branch-free
-version of multiply by inverse too.
+multiply by inverse, so the two can be compared.
Make an option in struct speed_parameters to specify operand overlap,
perhaps 0 for none, 1 for dst=src1, 2 for dst=src2, 3 for dst1=src1