GMP Itemized Development Tasks

This file current as of 2 Apr 2001. An up-to-date version is available at http://www.swox.com/gmp/tasks.html.

These are itemized GMP development tasks. Not all the tasks listed here are suitable for volunteers, but many of them are. Please see the projects file for more sizeable projects.

Correctness and Completeness

mpn/generic/get_str.c stores mp_size_t data in several int variables. (It is also really slow. Perhaps just call mpn_divrem_1. See also optimization item below.)
The various reuse.c tests need to force reallocation by calling _mpz_realloc with a small (1 limb) size.
One reuse case is missing from mpX/tests/reuse.c: mpz_XXX(a,a,a).
When printing mpf_t numbers with exponents >2^53 on machines with 64-bit mp_exp_t, the precision of __mp_bases[base].chars_per_bit_exactly is insufficient and mpf_get_str aborts. Detect and compensate. Alternately, think seriously about using some sort of fixed-point integer value. Avoiding unnecessary floating point is probably a good thing in general, and it might be faster on some CPUs.
Make the string reading functions allow the `0x' prefix when the base is explicitly 16. They currently only allow that prefix when the base is unspecified (zero).
mpf_eq is not always correct, when one operand is 1000000000... and the other operand is 0111111111..., i.e., extremely close. There is a special case in mpf_sub for this situation; put similar code in mpf_eq.
mpf_eq doesn't implement what gmp.texi specifies. It should not use just whole limbs, but partial limbs.
NeXT might not have __builtin_constant_p available. Same problem with MacOS X.
TMP_ALLOC is not reentrant when using stack-alloc.c. Perhaps make it so by default and leave a --enable-alloca=malloc-nonreentrant with the current code. A reentrant version will probably be slower since it can't share malloced blocks across function calls.
_mpz_realloc will reallocate to less than the current size of an mpz. This is ok internally when a destination size is yet to be updated, but from user code it should probably refuse to go below the size of the current value. Could switch internal uses across to a new function, and tighten up _mpz_realloc.

Machine Independent Optimization

count_leading_zeros returned count is checked for zero in hundreds of places. Instead check the most significant bit of the operand, and avoid invoking count_leading_zeros if the bit is set. This is an optimization on all machines, and significant on machines with slow count_leading_zeros.
count_trailing_zeros is used on more or less uniformly distributed numbers in a couple of places. For some CPUs count_trailing_zeros is slow and it's probably worth handling the frequently occurring 0 to 2 trailing zeros cases specially.
Reorganize longlong.h so that we can inline the operations even for the system compiler. When there is no such compiler feature, make calls to stub functions. Write such stub functions for as many machines as possible.
Rewrite umul_ppmm to use floating-point for generating the most significant limb (if BITS_PER_MP_LIMB <= 52 bits). (Peter Montgomery has some ideas on this subject.)
Improve the default umul_ppmm code in longlong.h: Add partial products with fewer operations.
Write new mpn_get_str and mpn_set_str running in the sub O(n^2) range, using some divide-and-conquer approach, preferably without using division.
mpn_get_str should use a fast native mpn_divrem_1 when available (athlon, p6mmx), possibly via a new mpn_preinv_divrem_1 interface. New functions like mpn_divrem_1_norm or mpn_preinv_divrem_1_norm could exist for those targets where pre-shifting helps (p6 maybe).
Copy tricky code for converting a limb from development version of mpn_get_str to mpf/get_str. (Talk to Torbjörn about this.)
Consider inlining these functions: mpz_size, mpz_set_ui, mpz_set_q, mpz_clear, mpz_init, mpz_get_ui, mpz_getlimbn, mpz_init_set_ui, mpz_perfect_square_p, mpz_popcount, mpf_size, mpf_get_prec, mpf_set_prec_raw, mpf_set_ui, mpf_init, mpf_init2, mpf_clear, mpf_set_si.
These will want to be done with extern inline in the style of the current mpn_add etc, since applications might be using them as function pointers. The GLIBC trick of first defining a prototype then a macro looks like it won't work with the GMP __gmpz renamings.
Consider inlining mpz_neg as
```
        if (src != dst) mpz_set (dst, src);
        dst->_mp_size = - dst->_mp_size; 
```
so that an in-place mpz_neg(z,z) can be a simple negation of the size field. Likewise for mpz_abs, and mpq and mpf too.
mpz_powm and mpz_powm_ui aren't very fast on one or two limb moduli, due to a lot of function call overheads. These could perhaps be handled as special cases.
mpz_powm and mpz_powm_ui want better algorithm selection, and the latter should use REDC. Both could change to use an mpn_powm and mpn_redc.
mpz_powm REDC should do multiplications by g[] using the division method when they're small, since the REDC form of a small multiplier is normally a full size product. Probably would need a new tuned parameter to say what size multiplier is "small", as a function of the size of the modulus.
mpz_powm REDC should handle even moduli if possible. Maybe this would mean for m=n*2^k doing mod n using REDC and an auxiliary calculation mod 2^k, then putting them together at the end.
mpn_gcd might be able to be sped up on small to moderate sizes by improving find_a, possibly just by providing an alternate implementation for CPUs with slowish count_leading_zeros.
Toom3 USE_MORE_MPN could use a low to high cache localized evaluate and interpolate. The necessary mpn_divexact_by3c exists.
mpn_mul_basecase on NxM with big N but small M could try for better cache locality by taking N piece by piece. The current code could be left available for CPUs without caching. Depending how karatsuba etc is applied to unequal size operands it might be possible to assume M is always smallish.
mpn_perfect_square_p on small operands might be better off skipping the residue tests and just taking a square root.
mpn_perfect_square_p could take a remainder mod 15 instead of 3 and 5 separately, and use a bit table which indicates whether an N mod 15 is a quadratic residue mod both 3 and 5. More combining would work too, like say 3*11 and 5*7, in whatever good combinations, possibly aiming to keep under 64 or 128 bits.
mpz_perfect_power_p could be improved in a number of ways. Test for Nth power residues modulo small primes like mpn_perfect_square_p does. Use p-adic arithmetic to find possible roots. Negatives should be handled, and used to know the power must be odd. Divisibility by powers of 2 should be tested with mpz_scan1 or similar. Divisibility by other primes should be tested by grouping into a limb like PP.
Change PP/PP_INVERTED into an array of such pairs, listing several hundred of primes. Perhaps actually make the products larger than one limb each.
mpz_probab_prime_p, mpn_perfect_square_p and mpz_perfect_power_p could take a remainder mod 2^24-1 to quickly get remainders mod 3, 5, 7, 13 and 17 (factors of 2^24-1). A remainder mod 2^24-1 can be taken at adder throughput speed.
mpf_set_str produces low zero limbs when a string has a fraction but is exactly representable, eg. 0.5 in decimal. These could be stripped to save work in later operations.
mpz_and, mpz_ior and mpz_xor should use mpn_and etc for the benefit of the small number of targets with native versions of those routines. Need to be careful not to pass size==0. Is some code sharing possible between the mpz routines?
mpn_bdivmod should use a divide and conquer like the normal division. See "Exact Division with Karatsuba Complexity" by Jebelean for a (brief) description. This will benefit mpz_divexact immediately, and mpn_gcd on large unequal size operands. REDC should be able to use it too.
restrict'ed pointers: Does the C99 definition of restrict (one writer many readers, or whatever it is) suit the GMP style "same or separate" function parameters? If so, judicious use might improve the code generated a bit. Do any compilers have their own flavour of restrict as "completely unaliased", and is that still usable?
GCC __attribute__ ((malloc)) could be used on __gmp_allocate_func and on the stack-alloc.c TMP_ALLOC. Don't know if it'd do much. Current pre-release gcc 3 doesn't like attaching function attributes to function pointers like __gmp_allocate_func (see "(gcc)Attribute Syntax"), this has to wait for the future.

Machine Dependent Optimization

Run the `tune' utility for more compiler/CPU combinations. We would like to have gmp-mparam.h files in practically every implementation specific mpn subdirectory, and repeat each *_THRESHOLD for gcc and the system compiler. See the `tune' top-level directory for more information.
```
	#ifdef (__GNUC__)
	#if __GNUC__ == 2 && __GNUC_MINOR__ == 7
	...
	#endif
	#if __GNUC__ == 2 && __GNUC_MINOR__ == 8
	...
	#endif
	#ifndef KARATSUBA_MUL_THRESHOLD
	/* Default GNUC values */
	...
	#endif
	#else /* system compiler */
	...
	#endif	
```
Alpha 21264: Improve feed-in code for mpn_addmul_1 and mpn_submul_1. Integrate new mpn_mul_1 code.
Alpha 21164: Rewrite mpn_addmul_1, mpn_submul_1, and mpn_mul_1 for the 21164. This should use both integer multiplies and floating-point multiplies. For the floating-point operations, the single-limb multiplier should be split into three 21-bit chunks, or perhaps even better in four 16-bit chunks. Probably possible to reach 9 cycles/limb.
UltraSPARC/64: Rewrite mpn_mul_1, mpn_addmul_1, and mpn_submul_1. Should use floating-point operations, and split the invariant single-limb multiplier into 16-bit chunks. Will give 14 cycles/limb, but with a deep pipeline.
UltraSPARC/64: Write umul_ppmm. Using four "mulx"s either with an asm block or via the generic C code is about 90 cycles. Try using fp operations, and also try using karatsuba for just three "mulx"s.
UltraSPARC/64: Write mpn_sqr_diagonal. Since we're squaring each limb, it seems best to split limbs into one 22-bit chunk (for most significant part) and two 21 bit chunks. Things magically align to 64-bits borders with that splitting. Will need 19 memory operations, and won't therefore run faster than at about 20 cycles/limb. (The current umul_ppmm loop of mpn_sqr_basecase runs at around 60 cycles/limb.) If we use VIS fand for splitting operands, just 13 memory are needed.
UltraSPARC/32: Rewrite mpn_lshift, mpn_rshift. Will give 2 cycles/limb. Trivial modifications of mpn/sparc64 should do.
UltraSPARC/32: Write special mpn_Xmul_1 loops for s2 < 2^16.
PA64: Improve mpn_addmul_1, mpn_submul_1, and mpn_mul_1. The current code runs at 11 cycles/limb. It should be possible to saturate the cache, which will happen at 8 cycles/limb (7.5 for mpn_mul_1). Write special loops for s2 < 2^32; it should be possible to make them run at about 5 cycles/limb.
PPC630: Rewrite mpn_addmul_1, mpn_submul_1, and mpn_mul_1. Use both integer and floating-point operations, possibly two floating-point and one integer limb per loop. Split operands into four 16-bit chunks for fast fp operations. Should easily reach 9 cycles/limb (using one int + one fp), but perhaps even 7 cycles/limb (using one int + two fp).
Implement mpn_mul_basecase and mpn_sqr_basecase for important machines. Helping the generic sqr_basecase.c with an mpn_sqr_diagonal might be enough for some of the RISCs.
POWER2/POWER2SC: Schedule mpn_lshift/mpn_rshift. Will bring time from 1.75 to 1.25 cycles/limb.
X86: Optimize non-MMX mpn_lshift for shifts by 1. (See Pentium code.)
X86: Good authority has it that in the past an inline rep movs would upset GCC register allocation for the whole function. Is this still true in GCC 3? Examine the code for some simple and complex functions to find out. Inlining rep movs would be desirable, it'd be both smaller and faster.
PentiumPro: mpn_divrem_1 might be able to use a mul-by-inverse, hoping for maybe 30 c/l.
P6: mpn_add_n and mpn_sub_n should be able to go faster than the generic x86 code at 3.5 c/l. The athlon code for instance runs at about 2.7.
PPC32: Try using fewer registers in the current mpn_lshift. The pipeline is now extremely deep, perhaps unnecessarily deep.
PPC32: Write mpn_rshift based on new mpn_lshift.
PPC32: Rewrite mpn_add_n and mpn_sub_n. Should run at just 3.25 cycles/limb.
Fujitsu VPP: Vectorize main functions, perhaps in assembly language.
Fujitsu VPP: Write mpn_mul_basecase and mpn_sqr_basecase. This should use a "vertical multiplication method", to avoid carry propagation. splitting one of the operands in 11-bit chunks.
Cray T3E: Experiment with optimization options. In particular, -hpipeline3 seems promising. We should at least up -O to -O2 or -O3.
Cray: Variable length arrays seem to be faster than the stack-alloc.c scheme. Not sure why, maybe they merely give the compiler more information about aliasing (or the lack thereof). Would like to modify TMP_ALLOC to use them, or introduce a new scheme. Memory blocks wanted unconditionally are easy enough, those wanted only sometimes are a problem. Perhaps a special size calculation to ask for a dummy length 1 when unwanted, or perhaps an inlined subroutine duplicating code under each conditional. Don't really want to turn everything into a dog's dinner just because Cray don't offer an alloca.

Improve count_leading_zeros for 64-bit machines:

	   if ((x >> 32) == 0) { x <<= 32; cnt += 32; }
	   if ((x >> 48) == 0) { x <<= 16; cnt += 16; }
	   ...

New Functionality

Add in-memory versions of mp?_out_raw and mp?_inp_raw.
mpz_get_nth_ui. Return the nth word (not necessarily the nth limb).
mpf_getlimbn similar to mpz_getlimbn and accepting negative N for fraction limbs. Functions to get the range of limbs would be wanted before this would be useful.
Maybe add mpz_crr (Chinese Remainder Reconstruction).
Let `0b' and `0B' mean binary input everywhere.
mpz_init and mpq_init could do lazy allocation. Set ALLOC(var) to 0, and have mpz_realloc special-handle that case. Would mean the size required for the first use is the first allocated, rather than allocating a small size and then reallocing it. Update functions that rely on a single limb like mpz_set_ui, mpz_{t,f,c}div_{qr,r}_ui, and others. Would need the initial z->_mp_d to point to a dummy initial location, so it can be fetched from by mpz_odd_p and similar macros.
Add mpf_out_raw and mpf_inp_raw. Make sure format is portable between 32-bit and 64-bit machines, and between little-endian and big-endian machines.
Handle numeric exceptions: Call an error handler, and/or set gmp_errno. It should be possible to detect errors at the start of documented entrypoints and invoke an exception handler with a nice clean state, giving it the chance to recover with longjmp or a C++ throw or whatever.
Handle out-of-memory exceptions: It'd be good to offer the custom allocation functions a way to longjmp or throw a C++ exception on being out of memory (or out of the memory they care to offer GMP). This would need GMP to ensure all variables are in a sensible state whenever attempting an alloc or realloc, and would need some way to record other temporary memory or temporary variables in use which should be freed. The latter might be difficult, perhaps the C++ destructor mechanism would be necessary.
gmp_fprintf, gmp_sprintf, and gmp_snprintf. Think about some sort of wrapper around printf so it and its several variants don't have to be completely reimplemented. Variants like mpz_printf etc supporting only mpz etc could exist to keep down the amount of library dragged in.
mpq input and output functions.
mpn_and_n ... mpn_copyd: Make the mpn logops and copys available in gmp.h, either as library functions or inlines, with the availability of library functions instantiated in the generated gmp.h at build time.
mpz_set_str etc variants taking string lengths rather than null-terminators.
Consider changing the thresholds to apply the simpler algorithm when "<=" rather than "<", so a threshold can be set to MP_SIZE_T_MAX to get only the simpler code (the compiler will know size <= MP_SIZE_T_MAX is always true). Alternately it looks like the ABOVE_THRESHOLD and BELOW_THRESHOLD macros can do this adequately, and also pick up cases where a threshold of zero should mean only the second algorithm.
mpz_powm could handle negative exponents by powering the modular inverse of the base. But what to do if an inverse doesn't exist (base and modulus have common factors)? Throw a divide by zero maybe, or return a flag like mpz_invert does.
mpz_nthprime.
mpn_divexact_1 could be created and used for an mpz_divexact_ui and perhaps other purposes. Work on this is in progress.
Perhaps mpz_init2, initializing and making initial room for N bits. The actual size would be rounded up to a limb, and perhaps an extra limb added since so many mpz routines need that on their destination.

Configuration

Floating-point format: Determine this with a feature test. Get rid of the #ifdef mess in gmp-impl.h. This is simple when doing a native compile, but needs a reliable way to examine object files when cross-compiling. Falling back on a run-time test would be reasonable, if build time tests fail.
HPPA: config.guess should recognize 7000, 7100, 7200, and 8x00.
Mips: config.guess should say mipsr3000, mipsr4000, mipsr10000, etc. "hinv -c processor" gives lots of information on Irix. Standard config.guess etc append "el" to indicate endianness, but GMP probably doesn't care about endianness and could forget about that.
Power: config.guess might be able to use AIX 4 _system_configuration.implementation for more accurate CPU determination.
Sparc: config.guess should say supersparc, microsparc, ultrasparc1, ultrasparc2, etc. "prtconf -vp" gives lots of information about a Solaris system.
Sparc32: floating point or integer udiv should be selected according to the CPU target. Currently floating point ends up being used on all sparcs, which is probably not right for generic V7 and V8.
m68k: config.guess should say m68040 etc, currently it just gives m68k for everything. In particular detect m68000 or m68010 since they won't like the 68020 instructions we use under an m68k config.
demos/pexpr.c: is becoming the usual *nix mess of nested ifdefs. Instead use the results of configure tests. pexpr.c could be generated from a pexpr.in with a set of #undef's.
Some CPUs have umul and udiv code not being used. Check all such for bit rot and then put umul and udiv in $gmp_mpn_functions_optional as "standard optional" objects.
In particular Sparc and SparcV8 on non-gcc should benefit from umul.asm enabled; the generic umul is suspected to be making sqr_basecase slower than mul_basecase.
HPPA mpn_umul_ppmm and mpn_udiv_qrnnd have a different parameter order than those functions on other CPUs. It might avoid confusion to have them under different names, maybe mpn_umul_ppmm_r or some such. Prototypes then wouldn't be conditionalized, and the appropriate form could be selected with the HAVE_NATIVE scheme if/when the code switches to use a PROLOGUE style.
Combine the pa64 and pa64w directories.
Rename `mips2' => `mips32' and `mips3' => `mips64'.

Miscellaneous

Make mpz_div and mpz_divmod use rounding analogous to mpz_mod. Document, and list as an incompatibility.
mpz_invert should call mpn_gcdext directly.
Merge mpn/pa64 and pa64w.
mpz_urandomm should do something for n<=0 (but what?), and the use of mpz_mod looks doubtful (does it suffice to generate numbers of nbits until getting one <n?)
demos/factorize.c should use the GMP random functions when restarting Pollard-Rho, not random / mrand48.
The various test programs use quite a bit of the main libgmp. This establishes good cross-checks, but it might be better to use simple reference routines where possible. Where it's not possible some attention could be paid to the order of the tests, so a main libgmp is only used to construct tests once it seems to be good.

Aids to Debugging

TMP_ALLOC could do a separate malloc for each allocated block, to help a redzoning malloc debugger. Perhaps a config option --enable-alloca=debug.
Add ASSERTs at the start of each user-visible mpz/mpq/mpf function to check the validity of each mp?_t parameter, in particular to check they've been mp?_inited. This might catch elementary mistakes in user programs. Care would need to be taken over MPZ_TMP_INITed variables used internally. If nothing else then consistency checks like size<=alloc, ptr not NULL and ptr+size not wrapping around the address space, would be possible.

Documentation

Document conventions, like that unsigned long int is used for bit counts/ranges, and that mp_size_t is used for limb counts.
mpz_inp_str (etc) doesn't say when it stops reading digits.

Bright Ideas

The following may or may not be feasible, and aren't likely to get done in the near future, but are at least worth thinking about.

Nx1 division (divrem_1 or mod_1) could be done two limbs at a time, by following udiv_qrnnd_preinv but with two limbs everywhere one is used. This would mean 3 or 4 multiplies each place 1 is currently done, but they'll be independent and so 2 limbs can be processed with the same latency as 1 would have been. This idea would apply to CPUs with long latency but good thoughput multipliers. Clearly it can be extended to 3 limbs at a time, or 4, or however many, though each time more are used then more time will be taken creating an initial multi-limb inverse (invert_limb style), and the quadratically increasing number of cross-products will at some point see throughput unable to offer a saving over latency.
Build multiple variants of the library under certain systems. An example is -n32 and -64 on Irix. Autoconf/automake/libtool give pretty much no help for this, and currently the best that can be done is to recommend users rebuild in the various possible or desired configurations, perhaps installing with different libdir settings, or whatever.
Enable support for FORTRAN versions of mpn files (eg. for mpn/cray/mulww.f). Add "f" to MPN_SUFFIXES, run AC_PROG_F77 if such a file is found. Automake will generate some of what's needed in the makefiles, but libtool doesn't know fortran and so rules like the current ".asm.lo" will be needed (the multi-language branch of libtool doesn't look like it's got fortran yet either).
There are a few filenames that don't fit in 14 chars, if this matters.
mpz_get_si returns 0x80000000 for -0x100000000, whereas it's sort of supposed to return the low 31 (or 63) bits. But this is undocumented, and perhaps not too important.
mpz_*_ui division routines currently return abs(a%b). Perhaps make them return the real remainder instead? Return type would be signed long int. But this would be an incompatible change, so it might have to be under newly named functions.
mpz_init_set* and mpz_realloc could allocate say an extra 16 limbs over what's needed, so as to reduce the chance of having to do a reallocate if the mpz_t grows a bit more. This could only be an option, since it'd badly bloat memory usage in applications using many small values.