This file lists itemized GMP development tasks. Not all the tasks listed here are suitable for volunteers, but many of them are. Please see the projects file for more sizeable projects.
_mpz_realloc
with a small (1 limb) size.
mpz_XXX(a,a,a)
.
mp_exp_t
, the precision of
__mp_bases[base].chars_per_bit_exactly
is insufficient and
mpf_get_str
aborts. Detect and compensate.
mpz_*_ui
division routines. Perhaps make them return the
real remainder instead? Changes return type to signed long int
.
mpf_eq
is not always correct, when one operand is
1000000000... and the other operand is 0111111111..., i.e., extremely
close. There is a special case in mpf_sub
for this
situation; put similar code in mpf_eq
.
mpf_eq
doesn't implement what gmp.texi specifies. It should
not use just whole limbs, but partial limbs.
__builtin_constant_p
is unavailable? Same problem with MacOS
X.
dump_abort
in
mp?/tests/*.c.
mpz_get_si
returns 0x80000000 for -0x100000000.
TMP_ALLOC
is not reentrant when using stack-alloc.c. Perhaps
make it so by default and leave a --enable-alloca=malloc-nonreentrant with
the current code. A reentrant version will probably be slower since it
can't share malloced blocks across function calls.
mpz_scan0
and mpz_scan1
only work in quite
limited circumstances and could be improved. Supporting twos complement
like the other bitwise functions would be good. A defined return value
for no 0 or 1 found would be good too, perhaps ULONG_MAX
.
count_leading_zeros
returned count is checked for zero in
hundreds of places. Instead check the most significant bit of the
operand, and avoid invoking count_leading_zeros
if the bit is
set. This is an optimization on all machines, and significant on machines
with slow count_leading_zeros
.
count_trailing_zeros
is used on more or less uniformly
distributed numbers in a couple of places. For some CPUs
count_trailing_zeros
is slow and it's probably worth handling
the frequently occurring 0 to 2 trailing zeros cases specially.
umul_ppmm
to use floating-point for generating the
most significant limb (if BITS_PER_MP_LIMB
<= 52 bits).
(Peter Montgomery has some ideas on this subject.)
umul_ppmm
code in longlong.h: Add partial
products with fewer operations.
mpn_get_str
and mpn_set_str
running in
the sub O(n^2) range, using some divide-and-conquer approach, preferably
without using division.
mpn_get_str
should use a fast native
mpn_divrem_1
when available (athlon, p6mmx), possibly via a
new mpn_divrem_1_preinv
interface. New functions like
mpn_divrem_1_norm
or mpn_divrem_1_preinvnorm
could exist for those targets where pre-shifting helps (p6 maybe).
mpn_get_str
to mpf/get_str. (Talk to Torbjörn about this.)
mpz_size
,
mpz_set_ui
, mpz_set_q
, mpz_clear
,
mpz_init
, mpz_get_ui
, mpz_scan0
,
mpz_scan1
, mpz_getlimbn
,
mpz_init_set_ui
, mpz_perfect_square_p
,
mpz_popcount
, mpf_size
,
mpf_get_prec
, mpf_set_prec_raw
,
mpf_set_ui
, mpf_init
, mpf_init2
,
mpf_clear
, mpf_set_si
.
mpz_powm
and mpz_powm_ui
aren't very
fast on one or two limb moduli, due to a lot of function call
overheads. These could perhaps be handled as special cases.
mpz_powm
and mpz_powm_ui
want better
algorithm selection, and the latter should use REDC. Both could
change to use an mpn_powm
and mpn_redc
.
mpz_powm
could handle negative exponents by powering the
modular inverse of the base. But what to do if an inverse doesn't exist
(base and modulus have common factors)? Throw a divide by zero maybe, or
return a flag like mpz_invert
does.
mpn_gcd
might be able to be sped up on small to
moderate sizes by improving find_a
, possibly just by
providing an alternate implementation for CPUs with slowish
count_leading_zeros
.
USE_MORE_MPN
could use a low to high cache localized
evaluate and interpolate. The necessary mpn_divexact_by3c
exists.
mpn_mul_basecase
on NxM with big N but small M could try for
better cache locality by taking N piece by piece. The current code could
be left available for CPUs without caching. Depending how karatsuba etc
is applied to unequal size operands it might be possible to assume M is
always smallish.
mpn_perfect_square_p
on small operands might be better off
skipping the residue tests and just taking a square root.
mpz_perfect_power_p
could be improved in a number of ways.
Test for Nth power residues modulo small primes like
mpn_perfect_square_p
does. Use p-adic arithmetic to find
possible roots. Negatives should be handled, and used to know the power
must be odd. mpn_perfect_square_p
should be called to test
for square roots. Divisibility by powers of 2 should be tested with
mpz_scan1
or similar. Divisibility by other primes should
be tested by grouping into a limb like PP
.
mpz_probab_prime_p
, mpn_perfect_square_p
and
mpz_perfect_power_p
could take a remainder mod 2^24-1 to
quickly get remainders mod 3, 5, 7, 13 and 17 (factors of 2^24-1).
mpf_set_str
produces low zero limbs when a string has a
fraction but is exactly representable, eg. 0.5 in decimal. These could be
stripped to save work in later operations.
mpz_and
, mpz_ior
and mpz_xor
should
use mpn_and
etc for the benefit of the small number of
targets with native versions of those routines. Need to be careful not to
pass size==0. Is some code sharing possible between the mpz
routines?
mpn_bdivmod
should use a divide and conquer like the normal
division. See "Exact Division with Karatsuba Complexity" by Jebelean for
a (brief) description. This will benefit mpz_divexact
immediately, and mpn_gcd
on large unequal size operands.
REDC should be able to use it too.
mpn_gcd_1
, mpz_kronecker_ui
etc should be able
to do an exact division style reduction initially, rather than
mpn_mod_1
. This would be the same as mpn_gcd
does using mpn_bdivmod
. This is still two multiplies per
limb, but is simpler and should be faster than mul by inverse division.
Perhaps a function mpn_modexact_1
, being the remainder part
of an mpn_divexact_1
. Creating the modular inverse will take
a few cycles, so it might be only for say 4 limbs and up.
mpn_addmul_1
, mpn_submul_1
, and
mpn_mul_1
for the 21264. On 21264, they should run at 4, 3,
and 3 cycles/limb respectively, if the code is unrolled properly. (Ask
Torbjörn for his xm.s and xam.s skeleton files.)
mpn_addmul_1
, mpn_submul_1
, and
mpn_mul_1
for the 21164. This should use both integer
multiplies and floating-point multiplies. For the floating-point
operations, the single-limb multiplier should be split into three 21-bit
chunks.
mpn_addmul_1
,
mpn_submul_1
, and mpn_mul_1
. Should use
floating-point operations, and split the invariant single-limb multiplier
into 21-bit chunks. Should give about 18 cycles/limb, but the pipeline
will become very deep. (Torbjörn has C code that is useful as a starting
point.)
mpn_lshift
and mpn_rshift
.
Should give 2 cycles/limb. (Torbjörn has code that just needs to be
finished.)
mpn_addmul_1
and the other multiplies varies so much on successive sizes.
mpn_addmul_1
, mpn_submul_1
, and
mpn_mul_1
. The current development code runs at 11
cycles/limb, which is already very good. But it should be possible to
saturate the cache, which will happen at 7.5 cycles/limb.
umul_ppmm
. Important in particular for
mpn_sqr_basecase
. Using four "mulx
"s either
with an asm block or via the generic C code is about 90 cycles.
mpn_mul_basecase
and mpn_sqr_basecase
for important machines. Helping the generic sqr_basecase.c with an
mpn_sqr_diagonal
might be enough for some of the RISCs.
mpn_lshift
/mpn_rshift
.
Will bring time from 1.75 to 1.25 cycles/limb.
mpn_lshift
for shifts by 1. (See Pentium code.)
mpn_mod_1
can use a mul-by-inverse the same as
P-II but without the shifts inside the loop and hence without MMX. The
same might speed up the loop for P-II also, but maybe at the cost of one
extra division step.
mpn_divrem_1
might be able to use a
mul-by-inverse, hoping for maybe 30 c/l.
mpn_add_n
and mpn_sub_n
should be able to go
faster than the generic x86 code at 3.5 c/l. The athlon code for instance
runs at about 2.7.
mpn_lshift
.
The pipeline is now extremely deep, perhaps unnecessarily deep.
mpn_rshift
based on new mpn_lshift
.
mpn_add_n
and mpn_sub_n
. Should
run at just 3.25 cycles/limb. (Ask for xxx-add_n.s as a starting point.)
mpn_mul_basecase
and
mpn_sqr_basecase
. This should use a "vertical multiplication
method", to avoid carry propagation. splitting one of the operands in
11-bit chunks.
mpn_mul_basecase
and
mpn_sqr_basecase
. Same comment applies to this as to the
same functions for Fujitsu VPP.
count_leading_zeros
for 64-bit machines:
if ((x >> 32) == 0) { x <<= 32; cnt += 32; } if ((x >> 48) == 0) { x <<= 16; cnt += 16; } ...
mpz_get_nth_ui
. Return the nth word (not necessarily the nth limb).
mpz_crr
(Chinese Remainder Reconstruction).
mpz_init
(and mpq_init
) do lazy
allocation. Set ALLOC(var)
to 0, and have
mpz_realloc
special-handle that case. Update functions that
rely on a single limb (like mpz_set_ui
,
mpz_{t,f,c}div_{qr,r}_ui
, and others).
mpf_out_raw
and mpf_inp_raw
. Make sure
format is portable between 32-bit and 64-bit machines, and between
little-endian and big-endian machines.
gmp_errno
.
gmp_fprintf
, gmp_sprintf
, and
gmp_snprintf
. Think about some sort of wrapper
around printf
so it and its several variants don't
have to be completely reimplemented.
mpq
input and output functions.
mpz_kronecker
, leave
mpz_jacobi
for compatibility.
mpz_set_str
etc taking string
lengths rather than null-terminators.
<=
" rather than "<
", so a threshold can
be set to MP_SIZE_T_MAX
to get only the simpler code (the
compiler will know size <= MP_SIZE_T_MAX
is always true).
mpz_cdiv_q_2exp
and mpz_cdiv_r_2exp
could be implemented to match the corresponding tdiv and fdiv.
Maybe some code sharing is possible.
alloca
detection could be improved a bit. One possible
approach would be to bring the code block from AC_FUNC_ALLOCA
into gmp-impl.h, then use gmp-impl.h in a test for a working
alloca
. The test would correspond to how the build will be
done. gmp-impl.h would probably need to be told to use confdefs.h not
config.h during the test.
alloca
, but cannot find any symbol to test
for HPUX 10. Damn.
udiv
should be selected
according to the CPU target. Currently floating point ends up being
used on all sparcs, which is probably not right for generic V7 and V8.
AC_PROG_F77
if such a file is found. Automake will generate some of what's needed in
the makefiles, but libtool doesn't know fortran and so rules like the
current ".asm.lo" will be needed.
umul
and udiv
code not being
used. Check all such for bit rot and then put umul and udiv in
$gmp_mpn_functions_optional
as "standard optional" objects.
mpn_umul_ppmm
and mpn_udiv_qrnnd
have a
different parameter order than those functions on other CPUs. It might
avoid confusion to have them under different names, maybe
mpn_umul_ppmm_plast
or some such. Prototypes then wouldn't
be conditionalized, and the appropriate form could be selected with the
HAVE_NATIVE
scheme if/when the code switches to use a
PROLOGUE
style.
In general, getting the exact right configuration, passing the exact right options to the compiler, etc, might mean that the GMP performance more than doubles.
When testing, make sure to test at least the following for all out target machines: (1) Both gcc and cc (and c89). (2) Both 32-bit mode and 64-bit mode (such as -n32 vs -64 under Irix). (3) Both the system `make' and GNU `make'. (4) With and without GNU binutils.
mpz_div
and mpz_divmod
use rounding
analogous to mpz_mod
. Document, and list as an
incompatibility.
mpz_invert
should call mpn_gcdext
directly.
TMP_ALLOC
could do a separate malloc
for each
allocated block, to help a redzoning malloc debugger. Perhaps a config
option --enable-alloca=debug.
ASSERT
s at the start of each user-visible
mpz/mpq/mpf function to check the validity of each
mp?_t
parameter, in particular to check they've been
mp?_init
ed. This might catch elementary mistakes in
user programs. Care would need to be taken over
MPZ_TMP_INIT
ed variables used internally.
mp_set_memory_functions
and a
check at program end to guard against memory leaks. This could also check
that the sizes passed to allocate and free are the same.
unsigned long int
is used for
bit counts/ranges, and that mp_size_t
is used for limb counts.
mpz_inp_str
(etc) doesn't say when it stops reading digits.
Please send comments about this page to
tege@swox.com. Copyright 1999, 2000 Torbjörn Granlund. |