These are itemized GMP development tasks. Not all the tasks listed here are suitable for volunteers, but many of them are. Please see the projects file for more sizeable projects.
mp_size_t
data in several int
variables. (It is also really slow. Perhaps just call
mpn_divrem_1
. See also optimization item below.)
_mpz_realloc
with a small (1 limb) size.
mpz_XXX(a,a,a)
.
mpf_t
numbers with exponents >2^53 on
machines with 64-bit mp_exp_t
, the precision of
__mp_bases[base].chars_per_bit_exactly
is insufficient and
mpf_get_str
aborts. Detect and compensate. Alternately,
think seriously about using some sort of fixed-point integer value.
Avoiding unnecessary floating point is probably a good thing in general,
and it might be faster on some CPUs.
mpf_eq
is not always correct, when one operand is
1000000000... and the other operand is 0111111111..., i.e., extremely
close. There is a special case in mpf_sub
for this
situation; put similar code in mpf_eq
.
mpf_eq
doesn't implement what gmp.texi specifies. It should
not use just whole limbs, but partial limbs.
__builtin_constant_p
available. Same
problem with MacOS X.
TMP_ALLOC
is not reentrant when using stack-alloc.c. Perhaps
make it so by default and leave a --enable-alloca=malloc-nonreentrant with
the current code. A reentrant version will probably be slower since it
can't share malloced blocks across function calls.
_mpz_realloc
will reallocate to less than the current size
of an mpz. This is ok internally when a destination size is yet to be
updated, but from user code it should probably refuse to go below the
size of the current value. Could switch internal uses across to a new
function, and tighten up _mpz_realloc
.
count_leading_zeros
returned count is checked for zero in
hundreds of places. Instead check the most significant bit of the
operand, and avoid invoking count_leading_zeros
if the bit is
set. This is an optimization on all machines, and significant on machines
with slow count_leading_zeros
.
count_trailing_zeros
is used on more or less uniformly
distributed numbers in a couple of places. For some CPUs
count_trailing_zeros
is slow and it's probably worth handling
the frequently occurring 0 to 2 trailing zeros cases specially.
umul_ppmm
to use floating-point for generating the
most significant limb (if BITS_PER_MP_LIMB
<= 52 bits).
(Peter Montgomery has some ideas on this subject.)
umul_ppmm
code in longlong.h: Add partial
products with fewer operations.
mpn_get_str
and mpn_set_str
running in
the sub O(n^2) range, using some divide-and-conquer approach, preferably
without using division.
mpn_get_str
should use a fast native
mpn_divrem_1
when available (athlon, p6mmx), possibly via a
new mpn_preinv_divrem_1
interface. New functions like
mpn_divrem_1_norm
or mpn_preinv_divrem_1_norm
could exist for those targets where pre-shifting helps (p6 maybe).
mpn_get_str
to mpf/get_str. (Talk to Torbjörn about this.)
mpz_size
,
mpz_set_ui
, mpz_set_q
, mpz_clear
,
mpz_init
, mpz_get_ui
, mpz_getlimbn
,
mpz_init_set_ui
, mpz_perfect_square_p
,
mpz_popcount
, mpf_size
,
mpf_get_prec
, mpf_set_prec_raw
,
mpf_set_ui
, mpf_init
, mpf_init2
,
mpf_clear
, mpf_set_si
.
extern inline
in the style
of the current mpn_add
etc, since applications might be
using them as function pointers. The GLIBC trick of first defining a
prototype then a macro looks like it won't work with the GMP
__gmpz
renamings.
mpz_neg
as if (src != dst) mpz_set (dst, src); dst->_mp_size = - dst->_mp_size;so that an in-place
mpz_neg(z,z)
can be a simple negation of
the size field. Likewise for mpz_abs
, and mpq
and mpf
too.
mpz_powm
and mpz_powm_ui
aren't very
fast on one or two limb moduli, due to a lot of function call
overheads. These could perhaps be handled as special cases.
mpz_powm
and mpz_powm_ui
want better
algorithm selection, and the latter should use REDC. Both could
change to use an mpn_powm
and mpn_redc
.
mpz_powm
REDC should do multiplications by g[]
using the division method when they're small, since the REDC form of a
small multiplier is normally a full size product. Probably would need a
new tuned parameter to say what size multiplier is "small", as a function
of the size of the modulus.
mpz_powm
REDC should handle even moduli if possible. Maybe
this would mean for m=n*2^k doing mod n using REDC and an auxiliary
calculation mod 2^k, then putting them together at the end.
mpn_gcd
might be able to be sped up on small to
moderate sizes by improving find_a
, possibly just by
providing an alternate implementation for CPUs with slowish
count_leading_zeros
.
USE_MORE_MPN
could use a low to high cache localized
evaluate and interpolate. The necessary mpn_divexact_by3c
exists.
mpn_mul_basecase
on NxM with big N but small M could try for
better cache locality by taking N piece by piece. The current code could
be left available for CPUs without caching. Depending how karatsuba etc
is applied to unequal size operands it might be possible to assume M is
always smallish.
mpn_perfect_square_p
on small operands might be better off
skipping the residue tests and just taking a square root.
mpn_perfect_square_p
could take a remainder mod 15 instead
of 3 and 5 separately, and use a bit table which indicates whether an N
mod 15 is a quadratic residue mod both 3 and 5. More combining would
work too, like say 3*11 and 5*7, in whatever good combinations, possibly
aiming to keep under 64 or 128 bits.
mpz_perfect_power_p
could be improved in a number of ways.
Test for Nth power residues modulo small primes like
mpn_perfect_square_p
does. Use p-adic arithmetic to find
possible roots. Negatives should be handled, and used to know the power
must be odd. Divisibility by powers of 2 should be tested with
mpz_scan1
or similar. Divisibility by other primes should
be tested by grouping into a limb like PP
.
PP
/PP_INVERTED
into an array of such
pairs, listing several hundred of primes. Perhaps actually make the
products larger than one limb each.
mpz_probab_prime_p
, mpn_perfect_square_p
and
mpz_perfect_power_p
could take a remainder mod 2^24-1 to
quickly get remainders mod 3, 5, 7, 13 and 17 (factors of 2^24-1). A
remainder mod 2^24-1 can be taken at adder throughput speed.
mpf_set_str
produces low zero limbs when a string has a
fraction but is exactly representable, eg. 0.5 in decimal. These could be
stripped to save work in later operations.
mpz_and
, mpz_ior
and mpz_xor
should
use mpn_and
etc for the benefit of the small number of
targets with native versions of those routines. Need to be careful not to
pass size==0. Is some code sharing possible between the mpz
routines?
mpn_bdivmod
should use a divide and conquer like the normal
division. See "Exact Division with Karatsuba Complexity" by Jebelean for
a (brief) description. This will benefit mpz_divexact
immediately, and mpn_gcd
on large unequal size operands.
REDC should be able to use it too.
restrict
'ed pointers: Does the C99 definition of restrict
(one writer many readers, or whatever it is) suit the GMP style "same or
separate" function parameters? If so, judicious use might improve the
code generated a bit. Do any compilers have their own flavour of
restrict as "completely unaliased", and is that still usable?
__attribute__ ((malloc))
could be used on
__gmp_allocate_func
and on the stack-alloc.c
TMP_ALLOC
. Don't know if it'd do much. Current pre-release
gcc 3 doesn't like attaching function attributes to function pointers
like __gmp_allocate_func
(see "(gcc)Attribute Syntax"), this
has to wait for the future.
#ifdef (__GNUC__) #if __GNUC__ == 2 && __GNUC_MINOR__ == 7 ... #endif #if __GNUC__ == 2 && __GNUC_MINOR__ == 8 ... #endif #ifndef KARATSUBA_MUL_THRESHOLD /* Default GNUC values */ ... #endif #else /* system compiler */ ... #endif
mpn_addmul_1
and
mpn_submul_1
. Integrate new mpn_mul_1
code.
mpn_addmul_1
, mpn_submul_1
,
and mpn_mul_1
for the 21164. This should use both integer
multiplies and floating-point multiplies. For the floating-point
operations, the single-limb multiplier should be split into three 21-bit
chunks, or perhaps even better in four 16-bit chunks. Probably possible
to reach 9 cycles/limb.
mpn_mul_1
, mpn_addmul_1
,
and mpn_submul_1
. Should use floating-point operations, and
split the invariant single-limb multiplier into 16-bit chunks. Will give
14 cycles/limb, but with a deep pipeline.
umul_ppmm
. Using four
"mulx
"s either with an asm block or via the generic C code is
about 90 cycles. Try using fp operations, and also try using karatsuba
for just three "mulx
"s.
mpn_sqr_diagonal
. Since we're
squaring each limb, it seems best to split limbs into one 22-bit chunk
(for most significant part) and two 21 bit chunks. Things magically
align to 64-bits borders with that splitting. Will need 19 memory
operations, and won't therefore run faster than at about 20 cycles/limb.
(The current umul_ppmm loop of mpn_sqr_basecase runs at around 60
cycles/limb.) If we use VIS fand
for splitting operands,
just 13 memory are needed.
mpn_lshift
, mpn_rshift
.
Will give 2 cycles/limb. Trivial modifications of mpn/sparc64 should do.
mpn_addmul_1
, mpn_submul_1
, and
mpn_mul_1
. The current code runs at 11 cycles/limb. It
should be possible to saturate the cache, which will happen at 8
cycles/limb (7.5 for mpn_mul_1). Write special loops for s2 < 2^32;
it should be possible to make them run at about 5 cycles/limb.
mpn_addmul_1
, mpn_submul_1
, and
mpn_mul_1
. Use both integer and floating-point operations,
possibly two floating-point and one integer limb per loop. Split operands
into four 16-bit chunks for fast fp operations. Should easily reach 9
cycles/limb (using one int + one fp), but perhaps even 7 cycles/limb
(using one int + two fp).
mpn_mul_basecase
and mpn_sqr_basecase
for important machines. Helping the generic sqr_basecase.c with an
mpn_sqr_diagonal
might be enough for some of the RISCs.
mpn_lshift
/mpn_rshift
.
Will bring time from 1.75 to 1.25 cycles/limb.
mpn_lshift
for shifts by 1. (See
Pentium code.)
rep
movs
would upset GCC register allocation for the whole function.
Is this still true in GCC 3? Examine the code for some simple and
complex functions to find out. Inlining rep movs
would be
desirable, it'd be both smaller and faster.
mpn_divrem_1
might be able to use a
mul-by-inverse, hoping for maybe 30 c/l.
mpn_add_n
and mpn_sub_n
should be able to go
faster than the generic x86 code at 3.5 c/l. The athlon code for instance
runs at about 2.7.
mpn_lshift
.
The pipeline is now extremely deep, perhaps unnecessarily deep.
mpn_rshift
based on new mpn_lshift
.
mpn_add_n
and mpn_sub_n
. Should
run at just 3.25 cycles/limb.
mpn_mul_basecase
and
mpn_sqr_basecase
. This should use a "vertical multiplication
method", to avoid carry propagation. splitting one of the operands in
11-bit chunks.
TMP_ALLOC
to use them, or introduce a new scheme. Memory
blocks wanted unconditionally are easy enough, those wanted only
sometimes are a problem. Perhaps a special size calculation to ask for a
dummy length 1 when unwanted, or perhaps an inlined subroutine
duplicating code under each conditional. Don't really want to turn
everything into a dog's dinner just because Cray don't offer an
alloca
.
count_leading_zeros
for 64-bit machines:
if ((x >> 32) == 0) { x <<= 32; cnt += 32; } if ((x >> 48) == 0) { x <<= 16; cnt += 16; } ...
mp?_out_raw
and
mp?_inp_raw
.
mpz_get_nth_ui
. Return the nth word (not necessarily the
nth limb).
mpf_getlimbn
similar to mpz_getlimbn
and
accepting negative N for fraction limbs. Functions to get the range of
limbs would be wanted before this would be useful.
mpz_crr
(Chinese Remainder Reconstruction).
mpz_init
and mpq_init
could do lazy allocation.
Set ALLOC(var)
to 0, and have mpz_realloc
special-handle that case. Would mean the size required for the first use
is the first allocated, rather than allocating a small size and then
reallocing it. Update functions that rely on a single limb like
mpz_set_ui
, mpz_{t,f,c}div_{qr,r}_ui
, and
others. Would need the initial z->_mp_d
to point to a dummy
initial location, so it can be fetched from by mpz_odd_p
and
similar macros.
mpf_out_raw
and mpf_inp_raw
. Make sure
format is portable between 32-bit and 64-bit machines, and between
little-endian and big-endian machines.
gmp_errno
. It should be possible to detect errors at the
start of documented entrypoints and invoke an exception handler with a
nice clean state, giving it the chance to recover with
longjmp
or a C++ throw or whatever.
longjmp
or throw a C++
exception on being out of memory (or out of the memory they care to offer
GMP). This would need GMP to ensure all variables are in a sensible
state whenever attempting an alloc or realloc, and would need some way to
record other temporary memory or temporary variables in use which should
be freed. The latter might be difficult, perhaps the C++ destructor
mechanism would be necessary.
gmp_fprintf
, gmp_sprintf
, and
gmp_snprintf
. Think about some sort of wrapper around
printf
so it and its several variants don't have to be
completely reimplemented. Variants like mpz_printf
etc
supporting only mpz
etc could exist to keep down the amount
of library dragged in.
mpq
input and output functions.
mpn_and_n
... mpn_copyd
: Make the mpn logops
and copys available in gmp.h, either as library functions or inlines,
with the availability of library functions instantiated in the generated
gmp.h at build time.
mpz_set_str
etc variants taking string lengths rather than
null-terminators.
<=
" rather than "<
", so a threshold can
be set to MP_SIZE_T_MAX
to get only the simpler code (the
compiler will know size <= MP_SIZE_T_MAX
is always true).
Alternately it looks like the ABOVE_THRESHOLD
and
BELOW_THRESHOLD
macros can do this adequately, and also pick
up cases where a threshold of zero should mean only the second algorithm.
mpz_powm
could handle negative exponents by powering the
modular inverse of the base. But what to do if an inverse doesn't exist
(base and modulus have common factors)? Throw a divide by zero maybe, or
return a flag like mpz_invert
does.
mpz_nthprime
.
mpn_divexact_1
could be created and used for an
mpz_divexact_ui
and perhaps other purposes. Work on this is
in progress.
mpz_init2
, initializing and making initial room for
N bits. The actual size would be rounded up to a limb, and perhaps an
extra limb added since so many mpz
routines need that on
their destination.
#ifdef
mess in gmp-impl.h. This is simple when doing a
native compile, but needs a reliable way to examine object files when
cross-compiling. Falling back on a run-time test would be reasonable, if
build time tests fail.
_system_configuration.implementation
for more accurate CPU
determination.
udiv
should be selected
according to the CPU target. Currently floating point ends up being
used on all sparcs, which is probably not right for generic V7 and V8.
m68040
etc, currently it just
gives m68k
for everything. In particular detect
m68000
or m68010
since they won't like the
68020 instructions we use under an m68k
config.
umul
and udiv
code not being
used. Check all such for bit rot and then put umul and udiv in
$gmp_mpn_functions_optional
as "standard optional" objects.
mpn_umul_ppmm
and mpn_udiv_qrnnd
have a
different parameter order than those functions on other CPUs. It might
avoid confusion to have them under different names, maybe
mpn_umul_ppmm_r
or some such. Prototypes then wouldn't
be conditionalized, and the appropriate form could be selected with the
HAVE_NATIVE
scheme if/when the code switches to use a
PROLOGUE
style.
mpz_div
and mpz_divmod
use rounding
analogous to mpz_mod
. Document, and list as an
incompatibility.
mpz_invert
should call mpn_gcdext
directly.
mpz_urandomm
should do something for n<=0 (but what?),
and the use of mpz_mod
looks doubtful (does it suffice to
generate numbers of nbits
until getting one <n?)
random
/ mrand48
.
libgmp
. This establishes good cross-checks, but it might be
better to use simple reference routines where possible. Where it's not
possible some attention could be paid to the order of the tests, so a
main libgmp
is only used to construct tests once it seems to
be good.
TMP_ALLOC
could do a separate malloc
for each
allocated block, to help a redzoning malloc debugger. Perhaps a config
option --enable-alloca=debug.
ASSERT
s at the start of each user-visible mpz/mpq/mpf
function to check the validity of each mp?_t
parameter, in
particular to check they've been mp?_init
ed. This might
catch elementary mistakes in user programs. Care would need to be taken
over MPZ_TMP_INIT
ed variables used internally. If nothing
else then consistency checks like size<=alloc, ptr not
NULL
and ptr+size not wrapping around the address space,
would be possible.
unsigned long int
is used for
bit counts/ranges, and that mp_size_t
is used for limb counts.
mpz_inp_str
(etc) doesn't say when it stops reading digits.
divrem_1
or mod_1
) could be done
two limbs at a time, by following udiv_qrnnd_preinv
but with
two limbs everywhere one is used. This would mean 3 or 4 multiplies each
place 1 is currently done, but they'll be independent and so 2 limbs can
be processed with the same latency as 1 would have been. This idea would
apply to CPUs with long latency but good thoughput multipliers. Clearly
it can be extended to 3 limbs at a time, or 4, or however many, though
each time more are used then more time will be taken creating an initial
multi-limb inverse (invert_limb
style), and the
quadratically increasing number of cross-products will at some point see
throughput unable to offer a saving over latency.
libdir
settings, or whatever.
AC_PROG_F77
if such a file is found. Automake will generate some of what's needed in
the makefiles, but libtool doesn't know fortran and so rules like the
current ".asm.lo" will be needed (the multi-language branch of libtool
doesn't look like it's got fortran yet either).
mpz_get_si
returns 0x80000000 for -0x100000000, whereas it's
sort of supposed to return the low 31 (or 63) bits. But this is
undocumented, and perhaps not too important.
mpz_*_ui
division routines currently return abs(a%b).
Perhaps make them return the real remainder instead? Return type would
be signed long int
. But this would be an incompatible
change, so it might have to be under newly named functions.
mpz_init_set*
and mpz_realloc
could allocate
say an extra 16 limbs over what's needed, so as to reduce the chance of
having to do a reallocate if the mpz_t
grows a bit more.
This could only be an option, since it'd badly bloat memory usage in
applications using many small values.
Please send comments about this page to
tege@swox.com. Copyright 1999, 2000, 2001 Torbjörn Granlund. |