diff options
author | Kevin Ryde <user42@zip.com.au> | 2001-02-01 22:37:46 +0100 |
---|---|---|
committer | Kevin Ryde <user42@zip.com.au> | 2001-02-01 22:37:46 +0100 |
commit | efbed147941b2e807d2ab17708b4ca8684c9ee37 (patch) | |
tree | ff4e7eff4a944bb6f00ba7acc76be26cb48d15bd | |
parent | f66effe817b63e940e56a94b1919d5d91160ecae (diff) | |
download | gmp-efbed147941b2e807d2ab17708b4ca8684c9ee37.tar.gz |
* tune/README: Misc updates.
-rw-r--r-- | tune/README | 109 |
1 files changed, 69 insertions, 40 deletions
diff --git a/tune/README b/tune/README index e4dec1c15..181af7579 100644 --- a/tune/README +++ b/tune/README @@ -10,25 +10,49 @@ The programs here are tools, not ready to run solutions. Nothing is built in a normal "make all", but various Makefile targets described below exist. Relatively few systems and CPUs have been tested, so be sure to verify that -results are sensible. +results are sensible before relying on them. MISCELLANEOUS NOTES -Don't configure with --enable-assert when using the things here, since the -extra code added by assertion checking may influence measurements. +--enable-assert -Some effort has been made to accommodate CPUs with direct mapped caches, but -it will depend on TMP_ALLOC using a proper alloca, and even then it may or -may not be enough. + Don't configure with --enable-assert, since the extra code added by + assertion checking may influence measurements. -The sparc32/v9 addmul_1 code runs at noticeably different speeds on -successive sizes, and this has a bad effect on the tune program's -determinations of the multiply and square thresholds. +Direct mapped caches + Some effort has been made to accommodate CPUs with direct mapped caches, + by putting data blocks more or less contiguously on the stack. But this + will depend on TMP_ALLOC using alloca, and even then it may or may not + be enough. +sparc32/v9 (eg. ultrasparc under solaris 2.6) + + The sparc32/v9 addmul_1 code runs at noticeably different speeds on + successive sizes (mod 4), and this has a bad effect on the tune program + determinations of multiply and square thresholds. + +FreeBSD 4.2 i486 getrusage + + This getrusage seems to be a bit doubtful, it looks like ru_utime is + microsecond accurate, but sometimes ru_utime remains unchanged after a + time of many microseconds has elapsed. It'd be good to detect this in + the time.c initializations, but for now the suggestion is to pretend + getrusage doesn't exist. + + ./configure ac_cv_func_getrusage=no + +NetBSD 1.4.1 m68k macintosh time base + + On this system its been found getrusage often goes backwards, making it + unusable. And gettimeofday sometimes doesn't update atomically when it + crosses a 1 second boundary. Not sure what to do about this. Disable + getrusage with ac_cv_func_getrusage=no as per above to try gettimeofday, + but don't expect it to work properly. + @@ -42,12 +66,14 @@ into gmp-mparam.h. The program is built and run with If the thresholds indicated are grossly different from the values in the selected gmp-mparam.h then there may be a performance boost in relevant size -ranges by changing gmp-mparam.h accordingly. +ranges by changing gmp-mparam.h accordingly. Do a full reconfigure and +rebuild to get them to take effect (a partial rebuild might be enough +sometimes, but a fresh configure and make is certain to be correct). If a CPU has specific tuned parameters coming from a gmp-mparam.h in one of the mpn subdirectories then the values from "make tune" should be similar. -Check though that the CPU target is right and there are no machine specific -effects causing a difference. +Check though that the configured CPU is right and there are no machine +specific effects causing a difference. It's hoped the compiler and options used won't have too much effect on thresholds, since for most CPUs they ultimately come down to comparisons @@ -55,10 +81,10 @@ between assembler subroutines. Missing out on the longlong.h macros by not using gcc will probably have an effect. Some thresholds produced by the tune program are merely single values chosen -from what's actually a range of sizes where two algorithms are pretty much -the same speed. When this happens the program is likely to give somewhat -different values on successive runs. This is noticeable on the toom3 -thresholds for instance. +from what's a range of sizes where two algorithms are pretty much the same +speed. When this happens the program is likely to give somewhat different +values on successive runs. This is noticeable on the toom3 thresholds for +instance. @@ -100,22 +126,27 @@ don't get this since it would upset gnuplot or other data viewers. TIME BASE The time measuring method is determined in time.c, based on what the -configured target has available. A microsecond accurate gettimeofday() will -work well, but there's code to use better methods, such as the cycle -counters on various CPUs. - -Currently, all methods except possibly the alpha cycle counter depend on the -machine being otherwise idle, or rather on other jobs not stealing CPU time -from the measuring program. Short routines (those that complete within a -timeslice) should work even on a busy machine. Some trouble is taken by -speed_measure() in common.c to avoid the ill effects of sporadic interrupts, -or other intermittent things (like cron waking up every minute). But -generally an idle machine will be necessary to be certain of consistent -results. - -The CPU frequency is needed if times in cycles are to be displayed, and it's -always needed when using a cycle counter time base. freq.c knows how to get -the frequency on some systems, but when that fails, or needs to be +configured target has available. A cycle counter is preferred, possibly +supplemented by another method if the counter has a limited range. A +microsecond accurate getrusage() or gettimeofday() will work well. + +The cycle counters (except possibly on alpha) and gettimeofday() will depend +on the machine being otherwise idle, or rather on other jobs not stealing +CPU time from the measuring program. Short routines (those that complete +within a timeslice) should work even on a busy machine. + +Some trouble is taken by speed_measure() in common.c to avoid ill effects +from sporadic interrupts, or other intermittent things (like cron waking up +every minute). But generally an idle machine will be necessary to be +certain of consistent results. + +The CPU frequency is needed to convert between cycles and seconds, or for +when a cycle counter is supplemented by getrusage() etc. The speed program +will convert as necessary according to the output format requested. The +tune program will work with either cycles or seconds. + +freq.c knows how to get the frequency on some systems, and can measure a +cycle counter against gettimeofday(), but when that fails, or needs to be overridden, an environment variable GMP_CPU_FREQUENCY can be used (in Hertz). For example in "bash" on a 650 MHz machine, @@ -124,12 +155,6 @@ Hertz). For example in "bash" on a 650 MHz machine, A high precision time base makes it possible to get accurate measurements in a shorter time. Support for systems and CPUs not already covered is wanted. -When choosing or creating a method, be sure not to claim a higher accuracy -than is really available. For example the default gettimeofday() code is -set for microsecond accuracy, but if only 10ms or 55ms is available then -inconsistent results can be expected. - - @@ -334,6 +359,11 @@ Extensions should be fairly easy to make though. speed-ext.c is an example, in a style that should suit one-off tests, or new code fragments under development. +many.pl is a script for generating a new speed program supplemented with +alternate versions of the standard routines. It can be used for measuring +experimental code, or for comparing different implementations that exist +within a CPU family. + @@ -378,8 +408,7 @@ large measurements. Make it able to test each available method, including perhaps the apparent resolution of each. Add versions of the generic C mpn_divrem_1 using straight division versus a -multiply by inverse, so the two can be compared. Include the branch-free -version of multiply by inverse too. +multiply by inverse, so the two can be compared. Make an option in struct speed_parameters to specify operand overlap, perhaps 0 for none, 1 for dst=src1, 2 for dst=src2, 3 for dst1=src1 |