GMP Itemized Development Tasks

An up-to-date html version of this file is available at http://www.swox.com/gmp/tasks.html.

This file lists itemized GMP development tasks. Not all the tasks listed here are suitable for volunteers, but many of them are. Please see the projects file for more sizeable projects.

Correctness and Completeness

The various reuse.c tests need to force reallocation by calling _mpz_realloc with a small (1 limb) size.
One reuse case is missing from mpX/tests/reuse.c: mpz_XXX(a,a,a).
When printing mpf_t numbers with exponents >2^53 on machines with 64-bit mp_exp_t, the precision of __mp_bases[base].chars_per_bit_exactly is insufficient and mpf_get_str aborts. Detect and compensate.
Make the string reading functions allow the `0x' prefix when the base is explicitly 16. They currently only allow that prefix when the base is unspecified.
In the development sources, we return abs(a%b) in the mpz_*_ui division routines. Perhaps make them return the real remainder instead? Changes return type to signed long int.
mpf_eq is not always correct, when one operand is 1000000000... and the other operand is 0111111111..., i.e., extremely close. There is a special case in mpf_sub for this situation; put similar code in mpf_eq.
mpf_eq doesn't implement what gmp.texi specifies. It should not use just whole limbs, but partial limbs.
NeXT has problems with newlines in asm strings in longlong.h. Also, __builtin_constant_p is unavailable? Same problem with MacOS X.
Shut up SGI's compiler by declaring dump_abort in mp?/tests/*.c.
mpz_get_si returns 0x80000000 for -0x100000000.
TMP_ALLOC is not reentrant when using stack-alloc.c. Perhaps make it so by default and leave a --enable-alloca=malloc-nonreentrant with the current code. A reentrant version will probably be slower since it can't share malloced blocks across function calls.
mpz_scan0 and mpz_scan1 only work in quite limited circumstances and could be improved. Supporting twos complement like the other bitwise functions would be good. A defined return value for no 0 or 1 found would be good too, perhaps ULONG_MAX.

Machine Independent Optimization

count_leading_zeros returned count is checked for zero in hundreds of places. Instead check the most significant bit of the operand, and avoid invoking count_leading_zeros if the bit is set. This is an optimization on all machines, and significant on machines with slow count_leading_zeros.
count_trailing_zeros is used on more or less uniformly distributed numbers in a couple of places. For some CPUs count_trailing_zeros is slow and it's probably worth handling the frequently occurring 0 to 2 trailing zeros cases specially.
Reorganize longlong.h so that we can inline the operations even for the system compiler. When there is no such compiler feature, make calls to stub functions. Write such stub functions for as many machines as possible.
Rewrite umul_ppmm to use floating-point for generating the most significant limb (if BITS_PER_MP_LIMB <= 52 bits). (Peter Montgomery has some ideas on this subject.)
Improve the default umul_ppmm code in longlong.h: Add partial products with fewer operations.
Write new mpn_get_str and mpn_set_str running in the sub O(n^2) range, using some divide-and-conquer approach, preferably without using division.
mpn_get_str should use a fast native mpn_divrem_1 when available (athlon, p6mmx), possibly via a new mpn_divrem_1_preinv interface. New functions like mpn_divrem_1_norm or mpn_divrem_1_preinvnorm could exist for those targets where pre-shifting helps (p6 maybe).
Copy tricky code for converting a limb from development version of mpn_get_str to mpf/get_str. (Talk to Torbjörn about this.)
Consider inlining these functions: mpz_size, mpz_set_ui, mpz_set_q, mpz_clear, mpz_init, mpz_get_ui, mpz_scan0, mpz_scan1, mpz_getlimbn, mpz_init_set_ui, mpz_perfect_square_p, mpz_popcount, mpf_size, mpf_get_prec, mpf_set_prec_raw, mpf_set_ui, mpf_init, mpf_init2, mpf_clear, mpf_set_si.
mpz_powm and mpz_powm_ui aren't very fast on one or two limb moduli, due to a lot of function call overheads. These could perhaps be handled as special cases.
mpz_powm and mpz_powm_ui want better algorithm selection, and the latter should use REDC. Both could change to use an mpn_powm and mpn_redc.
mpz_powm could handle negative exponents by powering the modular inverse of the base. But what to do if an inverse doesn't exist (base and modulus have common factors)? Throw a divide by zero maybe, or return a flag like mpz_invert does.
mpn_gcd might be able to be sped up on small to moderate sizes by improving find_a, possibly just by providing an alternate implementation for CPUs with slowish count_leading_zeros.
Toom3 USE_MORE_MPN could use a low to high cache localized evaluate and interpolate. The necessary mpn_divexact_by3c exists.
mpn_mul_basecase on NxM with big N but small M could try for better cache locality by taking N piece by piece. The current code could be left available for CPUs without caching. Depending how karatsuba etc is applied to unequal size operands it might be possible to assume M is always smallish.
mpn_perfect_square_p on small operands might be better off skipping the residue tests and just taking a square root.
mpz_perfect_power_p could be improved in a number of ways. Test for Nth power residues modulo small primes like mpn_perfect_square_p does. Use p-adic arithmetic to find possible roots. Negatives should be handled, and used to know the power must be odd. mpn_perfect_square_p should be called to test for square roots. Divisibility by powers of 2 should be tested with mpz_scan1 or similar. Divisibility by other primes should be tested by grouping into a limb like PP.
mpz_probab_prime_p, mpn_perfect_square_p and mpz_perfect_power_p could take a remainder mod 2^24-1 to quickly get remainders mod 3, 5, 7, 13 and 17 (factors of 2^24-1).
mpf_set_str produces low zero limbs when a string has a fraction but is exactly representable, eg. 0.5 in decimal. These could be stripped to save work in later operations.
mpz_and, mpz_ior and mpz_xor should use mpn_and etc for the benefit of the small number of targets with native versions of those routines. Need to be careful not to pass size==0. Is some code sharing possible between the mpz routines?
mpn_bdivmod should use a divide and conquer like the normal division. See "Exact Division with Karatsuba Complexity" by Jebelean for a (brief) description. This will benefit mpz_divexact immediately, and mpn_gcd on large unequal size operands. REDC should be able to use it too.
mpn_gcd_1, mpz_kronecker_ui etc should be able to do an exact division style reduction initially, rather than mpn_mod_1. This would be the same as mpn_gcd does using mpn_bdivmod. This is still two multiplies per limb, but is simpler and should be faster than mul by inverse division. Perhaps a function mpn_modexact_1, being the remainder part of an mpn_divexact_1. Creating the modular inverse will take a few cycles, so it might be only for say 4 limbs and up.

Machine Dependent Optimization

Run the `tune' utility for more compiler/CPU combinations. We would like to have gmp-mparam.h files in practically every implementation specific mpn subdirectory, and repeat each *_THRESHOLD for gcc and the system compiler. See the `tune' top-level directory for more information.
Alpha: Rewrite mpn_addmul_1, mpn_submul_1, and mpn_mul_1 for the 21264. On 21264, they should run at 4, 3, and 3 cycles/limb respectively, if the code is unrolled properly. (Ask Torbjörn for his xm.s and xam.s skeleton files.)
Alpha: Rewrite mpn_addmul_1, mpn_submul_1, and mpn_mul_1 for the 21164. This should use both integer multiplies and floating-point multiplies. For the floating-point operations, the single-limb multiplier should be split into three 21-bit chunks.
UltraSPARC: Rewrite 64-bit mpn_addmul_1, mpn_submul_1, and mpn_mul_1. Should use floating-point operations, and split the invariant single-limb multiplier into 21-bit chunks. Should give about 18 cycles/limb, but the pipeline will become very deep. (Torbjörn has C code that is useful as a starting point.)
UltraSPARC: Rewrite mpn_lshift and mpn_rshift. Should give 2 cycles/limb. (Torbjörn has code that just needs to be finished.)
SPARC32/V9: Find out why the speed of mpn_addmul_1 and the other multiplies varies so much on successive sizes.
PA64: Improve mpn_addmul_1, mpn_submul_1, and mpn_mul_1. The current development code runs at 11 cycles/limb, which is already very good. But it should be possible to saturate the cache, which will happen at 7.5 cycles/limb.
UltraSPARC: Write umul_ppmm. Important in particular for mpn_sqr_basecase. Using four "mulx"s either with an asm block or via the generic C code is about 90 cycles.
Implement mpn_mul_basecase and mpn_sqr_basecase for important machines. Helping the generic sqr_basecase.c with an mpn_sqr_diagonal might be enough for some of the RISCs.
POWER2/POWER2SC: Schedule mpn_lshift/mpn_rshift. Will bring time from 1.75 to 1.25 cycles/limb.
X86: Optimize non-MMX mpn_lshift for shifts by 1. (See Pentium code.)
PentiumPro: mpn_mod_1 can use a mul-by-inverse the same as P-II but without the shifts inside the loop and hence without MMX. The same might speed up the loop for P-II also, but maybe at the cost of one extra division step.
PentiumPro: mpn_divrem_1 might be able to use a mul-by-inverse, hoping for maybe 30 c/l.
P6: mpn_add_n and mpn_sub_n should be able to go faster than the generic x86 code at 3.5 c/l. The athlon code for instance runs at about 2.7.
PPC32: Try using fewer registers in the current mpn_lshift. The pipeline is now extremely deep, perhaps unnecessarily deep.
PPC32: Write mpn_rshift based on new mpn_lshift.
PPC32: Rewrite mpn_add_n and mpn_sub_n. Should run at just 3.25 cycles/limb. (Ask for xxx-add_n.s as a starting point.)
Fujitsu VPP: Vectorize main functions, perhaps in assembly language.
Fujitsu VPP: Write mpn_mul_basecase and mpn_sqr_basecase. This should use a "vertical multiplication method", to avoid carry propagation. splitting one of the operands in 11-bit chunks.
Cray: Vectorize main functions, perhaps in assembly language.
Cray: Write mpn_mul_basecase and mpn_sqr_basecase. Same comment applies to this as to the same functions for Fujitsu VPP.

Improve count_leading_zeros for 64-bit machines:

	   if ((x >> 32) == 0) { x <<= 32; cnt += 32; }
	   if ((x >> 48) == 0) { x <<= 16; cnt += 16; }
	   ...

New Functionality

mpz_get_nth_ui. Return the nth word (not necessarily the nth limb).
Maybe add mpz_crr (Chinese Remainder Reconstruction).
Let `0b' and `0B' mean binary input everywhere.
Maybe make mpz_init (and mpq_init) do lazy allocation. Set ALLOC(var) to 0, and have mpz_realloc special-handle that case. Update functions that rely on a single limb (like mpz_set_ui, mpz_{t,f,c}div_{qr,r}_ui, and others).
Add mpf_out_raw and mpf_inp_raw. Make sure format is portable between 32-bit and 64-bit machines, and between little-endian and big-endian machines.
Handle numeric exceptions: Call an error handler, and/or set gmp_errno.
Implement gmp_fprintf, gmp_sprintf, and gmp_snprintf. Think about some sort of wrapper around printf so it and its several variants don't have to be completely reimplemented.
Implement some mpq input and output functions.
Implement a full precision mpz_kronecker, leave mpz_jacobi for compatibility.
Make the mpn logops and copys available in gmp.h. Since they can be either library functions or inlines, gmp.h would need to be generated from a gmp.in based on what's in the library. gmp.h would still be compiler-independent though.
Make versions of mpz_set_str etc taking string lengths rather than null-terminators.
Consider changing the thresholds to apply the simpler algorithm when "<=" rather than "<", so a threshold can be set to MP_SIZE_T_MAX to get only the simpler code (the compiler will know size <= MP_SIZE_T_MAX is always true).
mpz_cdiv_q_2exp and mpz_cdiv_r_2exp could be implemented to match the corresponding tdiv and fdiv. Maybe some code sharing is possible.

Configuration

alloca detection could be improved a bit. One possible approach would be to bring the code block from AC_FUNC_ALLOCA into gmp-impl.h, then use gmp-impl.h in a test for a working alloca. The test would correspond to how the build will be done. gmp-impl.h would probably need to be told to use confdefs.h not config.h during the test.
The CPP tests alone make it hard to detect library functions or compiler builtins. One thing current tests don't cover: HPUX 10 C compiler supports alloca, but cannot find any symbol to test for HPUX 10. Damn.
Determine floating-point format with a feature test. Get rid of large #ifdef mess for FP format in gmp-impl.h.
HPPA: config.guess should recognize 7000, 7100, 7200, and 8x00.
Mips: config.guess should say mipsr3000, mipsr4000, mipsr10000, etc. "hinv -c processor" gives lots of information on Irix.
Sparc: config.guess should say supersparc, microsparc, ultrasparc1, ultrasparc2, etc. "prtconf -vp" gives lots of information about a Solaris system.
Sparc32: floating point or integer udiv should be selected according to the CPU target. Currently floating point ends up being used on all sparcs, which is probably not right for generic V7 and V8.
Extend the "optional" compiler arguments to choose the first that works from from a set, so when gcc gets athlon support it can try -mcpu=athlon, -mcpu=pentiumpro, or -mcpu=i486, whichever works.
Build multiple variants of the library under certain systems. An example is -n32 and -64 on Irix.
There are a few filenames that don't fit in 14 chars, if this matters.
Enable support for FORTRAN versions of mpn files (eg. for mpn/cray/mulww.f). Add "f" to MPN_SUFFIXES, run AC_PROG_F77 if such a file is found. Automake will generate some of what's needed in the makefiles, but libtool doesn't know fortran and so rules like the current ".asm.lo" will be needed.
Configure in demos directory. Now pexpr.c is becoming the usual *nix mess of nested ifdefs.
Some CPUs have umul and udiv code not being used. Check all such for bit rot and then put umul and udiv in $gmp_mpn_functions_optional as "standard optional" objects.
In particular Sparc and SparcV8 on non-gcc should benefit from umul.asm enabled; the generic umul is suspected to be making sqr_basecase slower than mul_basecase.
HPPA mpn_umul_ppmm and mpn_udiv_qrnnd have a different parameter order than those functions on other CPUs. It might avoid confusion to have them under different names, maybe mpn_umul_ppmm_plast or some such. Prototypes then wouldn't be conditionalized, and the appropriate form could be selected with the HAVE_NATIVE scheme if/when the code switches to use a PROLOGUE style.
Think about a --enable-abi=FOO to select which abi to use on systems that support multiple calling conventions and/or data sizes. The default could be "guess", other choices could be made by users wanting compatibility with other object code. Doing the right thing with where and how to install library files probably needs a lot of help from automake and libtool. A few simple cases like /usr/lib/sparcv9 might be easy enough though.

In general, getting the exact right configuration, passing the exact right options to the compiler, etc, might mean that the GMP performance more than doubles.

When testing, make sure to test at least the following for all out target machines: (1) Both gcc and cc (and c89). (2) Both 32-bit mode and 64-bit mode (such as -n32 vs -64 under Irix). (3) Both the system `make' and GNU `make'. (4) With and without GNU binutils.

Miscellaneous

Make mpz_div and mpz_divmod use rounding analogous to mpz_mod. Document, and list as an incompatibility.
Maybe make mpz/pow_ui.c more like mpz/ui_pow_ui.c, or write new mpn/generic/pow_ui.
mpz_invert should call mpn_gcdext directly.

Aids to Debugging

TMP_ALLOC could do a separate malloc for each allocated block, to help a redzoning malloc debugger. Perhaps a config option --enable-alloca=debug.
Add ASSERTs at the start of each user-visible mpz/mpq/mpf function to check the validity of each mp?_t parameter, in particular to check they've been mp?_inited. This might catch elementary mistakes in user programs. Care would need to be taken over MPZ_TMP_INITed variables used internally.
The test programs could use mp_set_memory_functions and a check at program end to guard against memory leaks. This could also check that the sizes passed to allocate and free are the same.

Documentation

Document conventions, like that unsigned long int is used for bit counts/ranges, and that mp_size_t is used for limb counts.
mpz_inp_str (etc) doesn't say when it stops reading digits.