These are itemized GMP development tasks. Not all the tasks listed here are suitable for volunteers, but many of them are. Please see the projects file for more sizeable projects.
_mpz_realloc
with a small (1 limb) size.
mpz_XXX(a,a,a)
.
mpf_t
numbers with exponents >2^53 on
machines with 64-bit mp_exp_t
, the precision of
__mp_bases[base].chars_per_bit_exactly
is insufficient and
mpf_get_str
aborts. Detect and compensate. Alternately,
think seriously about using some sort of fixed-point integer value.
Avoiding unnecessary floating point is probably a good thing in general,
and it might be faster on some CPUs.
mpf_eq
is not always correct, when one operand is
1000000000... and the other operand is 0111111111..., i.e., extremely
close. There is a special case in mpf_sub
for this
situation; put similar code in mpf_eq
.
mpf_eq
doesn't implement what gmp.texi specifies. It should
not use just whole limbs, but partial limbs.
_mpz_realloc
will reallocate to less than the current size
of an mpz. This is ok internally when a destination size is yet to be
updated, but from user code it should probably refuse to go below the
size of the current value. Could switch internal uses across to a new
function, and tighten up _mpz_realloc
.
mpf_set_str
doesn't validate it's exponent, for instance
garbage 123.456eX789X is accepted (and an exponent 0 used), and overflow
of a long
is not detected.
mpf_add
doesn't check for a carry from truncated portions of
the inputs, and in that respect doesn't implement the "infinite precision
followed by truncate" specified in the manual.
mpf_div
of x/x doesn't always give 1, reported by Peter
Moulder. Perhaps it suffices to put +1 on the effective divisor prec, so
that data bits rather than zeros are shifted in when normalizing. Would
prefer to switch to mpn_tdiv_qr
, where all shifting should
disappear.
mpz_add
etc, which doesn't work
when those routines are coming from a DLL (because they're effectively
function pointer global variables themselves). Need to rearrange perhaps
to a set of calls to a test function rather than iterating over an array.
main
might be
clobbered by the longjmp
.
mpn_gcdext
, mpz_get_d
,
mpf_get_str
: Don't test count_leading_zeros
for
zero, instead check the high bit of the operand and avoid invoking
count_leading_zeros
. This is an optimization on all
machines, and significant on machines with slow
count_leading_zeros
, though it's possible an already
normalized operand might not be encountered very often.
umul_ppmm
to use floating-point for generating the
most significant limb (if BITS_PER_MP_LIMB
<= 52 bits).
(Peter Montgomery has some ideas on this subject.)
umul_ppmm
code in longlong.h: Add partial
products with fewer operations.
mpz_set_ui
. This would be both small and
fast, especially for compile-time constants, but would make application
binaries depend on having 1 limb allocated to an mpz_t
,
preventing the "lazy" allocation scheme below.
mpz_[cft]div_ui
and maybe
mpz_[cft]div_r_ui
. A __gmp_divide_by_zero
would be needed for the divide by zero test, unless that could be left to
mpn_mod_1
(not sure currently whether all the risc chips
provoke the right exception there if using mul-by-inverse).
mpz_fits_s*_p
. The setups for
LONG_MAX
etc would need to go into gmp.h, and on Cray it
might, unfortunately, be necessary to forcibly include <limits.h>
since there's no apparent way to get SHRT_MAX
with an
expression (since short
and unsigned short
can
be different sizes).
mpz_powm
and mpz_powm_ui
aren't very
fast on one or two limb moduli, due to a lot of function call
overheads. These could perhaps be handled as special cases.
mpz_powm
and mpz_powm_ui
want better
algorithm selection, and the latter should use REDC. Both could
change to use an mpn_powm
and mpn_redc
.
mpz_powm
REDC should do multiplications by g[]
using the division method when they're small, since the REDC form of a
small multiplier is normally a full size product. Probably would need a
new tuned parameter to say what size multiplier is "small", as a function
of the size of the modulus.
mpz_powm
REDC should handle even moduli if possible. Maybe
this would mean for m=n*2^k doing mod n using REDC and an auxiliary
calculation mod 2^k, then putting them together at the end.
mpn_gcd
might be able to be sped up on small to
moderate sizes by improving find_a
, possibly just by
providing an alternate implementation for CPUs with slowish
count_leading_zeros
.
USE_MORE_MPN
could use a low to high cache localized
evaluate and interpolate. The necessary mpn_divexact_by3c
exists.
mpn_mul_basecase
on NxM with big N but small M could try for
better cache locality by taking N piece by piece. The current code could
be left available for CPUs without caching. Depending how karatsuba etc
is applied to unequal size operands it might be possible to assume M is
always smallish.
mpn_perfect_square_p
on small operands might be better off
skipping the residue tests and just taking a square root.
mpz_perfect_power_p
could be improved in a number of ways.
Test for Nth power residues modulo small primes like
mpn_perfect_square_p
does. Use p-adic arithmetic to find
possible roots. Divisibility by other primes should be tested by
grouping into a limb like PP
.
mpz_perfect_power_p
might like to use mpn_gcd_1
instead of a private GCD routine. The use it's put to isn't
time-critical, and it might help be ensure correctness to use the main GCD
routine.
mpz_perfect_power_p
could use
mpz_divisible_ui_p
instead of mpz_tdiv_ui
for
divisibility testing, the former is faster on a number of systems. (But
all that prime test stuff is going to be rewritten some time.)
PP
/PP_INVERTED
into an array of such
pairs, listing several hundred primes. Perhaps actually make the
products larger than one limb each.
PP
can have factors of 2 introduced in order to get the high
bit set and therefore a PP_INVERTED
existing. The factors
of 2 don't affect the way the remainder r = a % ((x*y*z)*2^n) is used,
further remainders r%x, r%y, etc, are the same since x, y, etc are odd.
The advantage of this is that mpn_preinv_mod_1
can then be
used if it's faster than plain mpn_mod_1
. This would be a
change only for 16-bit limbs, all the rest already have PP
in the right form.
PP
could have extra factors of 3 or 5 or whatever introduced
if they fit, and final remainders mod 9 or 25 or whatever used, thereby
making more efficient use of the mpn_mod_1
done. On a
16-bit limb it looks like PP
could take an extra factor of
3.
mpz_probab_prime_p
, mpn_perfect_square_p
and
mpz_perfect_power_p
could use mpn_mod_34lsub1
to take a remainder mod 2^24-1 or 2^48-1 and quickly get remainders mod
3, 5, 7, 13 and 17 (factors of 2^24-1). This could either replace the
PP
division currently done, or allow PP
to do
larger primes, depending how many residue tests seem worthwhile before
launching into full root extractions or Miller-Rabin etc.
mpz_probab_prime_p
(and maybe others) could code the
divisibility tests like n%7 == 0
in the form
#define MP_LIMB_DIVISIBLE_7_P(n) \ ((n) * MODLIMB_INVERSE_7 <= MP_LIMB_T_MAX/7)This would help compilers which don't know how to optimize divisions by constants, and would help current gcc (3.0) too since gcc forms a whole remainder rather than using a modular inverse and comparing. This technique works for any odd modulus, and with some tweaks for even moduli too. See Granlund and Montgomery "Division By Invariant Integers" section 9.
mpz_probab_prime_p
and mpz_nextprime
could
offer certainty for primes up to 2^32 by using a one limb miller-rabin
test to base 2, combined with a table of actual strong pseudoprimes in
that range (2314 of them). If that table is too big then both base 2 and
base 3 tests could be done, leaving a table of 104. The test could use
REDC and therefore be a modlimb_invert
a remainder (maybe)
then two multiplies per bit (successively dependent). Processors with
pipelined multipliers could do base 2 and 3 in parallel. Vector systems
could do a whole bunch of bases in parallel, and perhaps offer near
certainty up to 64-bits (certainty might depend on an exhaustive search
of pseudoprimes up to that limit). Obviously 2^32 is not a big number,
but an efficient and certain calculation is attractive. It might find
other uses internally, and could even be offered as a one limb prime test
mpn_probab_prime_1_p
or gmp_probab_prime_ui_p
perhaps.
mpz_probab_prime_p
doesn't need to make a copy of
n
when the input is negative, it can setup an
mpz_t
alias, same data pointer but a positive size. With no
need to clear before returning, the recursive function call could be
dispensed with too.
mpf_set_str
produces low zero limbs when a string has a
fraction but is exactly representable, eg. 0.5 in decimal. These could be
stripped to save work in later operations.
mpz_and
, mpz_ior
and mpz_xor
should
use mpn_and_n
etc for the benefit of the small number of
targets with native versions of those routines. Need to be careful not to
pass size==0. Is some code sharing possible between the mpz
routines?
mpf_add
: Don't do a copy to avoid overlapping operands
unless it's really necessary (currently only sizes are tested, not
whether r really is u or v).
mpf_add
: Under the check for v having no effect on the
result, perhaps test for r==u and do nothing in that case, rather than
currently it looks like an MPN_COPY_INCR
will be done to
reduce prec+1 limbs to prec.
mpn_divrem_2
could usefully accept unnormalized divisors and
shift the dividend on-the-fly, since this should cost nothing on
superscalar processors and avoid the need for temporary copying in
mpn_tdiv_qr
.
mpf_sqrt_ui
calculates prec+1 limbs, whereas just prec would
satisfy the application requested precision. It should suffice to simply
reduce the rsize temporary to 2*prec-1 limbs. mpf_sqrt
might be similar.
invert_limb
generic C: The division could use dividend
b*(b-d)-1 which is high:low of (b-1-d):(b-1), instead of the current
(b-d):0, where b=2^BITS_PER_MP_LIMB
and d=divisor. The
former is per the original paper and is used in the x86 code, the
advantage is that the current special case for 0x80..00 could be dropped.
The two should be equivalent, but a little check of that would be wanted.
mpq_cmp_ui
could form the num1*den2
and
num2*den1
products limb-by-limb from high to low and look at
each step for values differing by more than the possible carry bit from
the uncalculated portion.
mpq_cmp
could do the same high-to-low progressive multiply
and compare. The benefits of karatsuba and higher multiplication
algorithms are lost, but if it's assumed only a few high limbs will be
needed to determine an order then that's fine.
mpn_add_1
, mpn_sub_1
, mpn_add
,
mpn_sub
: Internally use __GMPN_ADD_1
etc
instead of the functions, so they get inlined on all compilers, not just
gcc and others with inline
recognised in gmp.h.
__GMPN_ADD_1
etc are meant mostly to support application
inline mpn_add_1
etc and if they don't come out good for
internal uses then special forms can be introduced, for instance many
internal uses are in-place. Sometimes a block of code is executed based
on the carry-out, rather than using it arithmetically, and those places
might want to do their own loops entirely.
__gmp_extract_double
on 64-bit systems could use just one
bitfield for the mantissa extraction, not two, when endianness permits.
Might depend on the compiler allowing long long
bit fields
when that's the only actual 64-bit type.
mpf_get_d
could be more like mpz_get_d
and do
more in integers and give the float conversion as such a chance to round
in its preferred direction. Some code sharing ought to be possible. Or
if nothing else then for consistency the two ought to give identical
results on integer operands (not clear if this is so right now).
usqr_ppm
or some such could do a widening square in the
style of umul_ppmm
. This would help 68000, and be a small
improvement for the generic C (which is used on UltraSPARC/64 for
instance). GCC recognises the generic C ul*vh and vl*uh are identical,
but does two separate additions to the rest of the result.
TMP_FREE
releases all memory, so
there's an allocate and free every time a top-level function using
TMP
is called. Would need
mp_set_memory_functions
to tell tal-notreent.c to release
any cached memory when changing allocation functions though.
__gmp_tmp_alloc
from tal-notreent.c could be partially
inlined. If the current chunk has enough room then a couple of pointers
can be updated. Only if more space is required then a call to some sort
of __gmp_tmp_increase
would be needed. The requirement that
TMP_ALLOC
is an expression might make the implementation a
bit ugly and/or a bit sub-optimal.
#define TMP_ALLOC(n) ((ROUND_UP(n) > current->end - current->point ? __gmp_tmp_increase (ROUND_UP (n)) : 0), current->point += ROUND_UP (n), current->point - ROUND_UP (n))
__mp_bases
has a lot of data for bases which are pretty much
never used. Perhaps the table should just go up to base 16, and have
code to generate data above that, if and when required. Naturally this
assumes the code would be smaller than the data saved.
__mp_bases
field big_base_inverted
is only used
if USE_PREINV_DIVREM_1
is true, and could be omitted
otherwise, to save space.
mpf_get_str
and mpf_set_str
call the
corresponding, much faster, mpn functions.
mpn_mod_1
could pre-calculate values of R mod N, R^2 mod N,
R^3 mod N, etc, with R=2^BITS_PER_MP_LIMB
, and use them to
process multiple limbs at each step by multiplying. Suggested by Peter
L. Montgomery.
mpz_get_str
, mtox
: For power-of-2 bases, which
are of course fast, it seems a little silly to make a second pass over
the mpn_get_str
output to convert to ASCII. Perhaps combine
that with the bit extractions.
mpz_gcdext
: If the caller requests only the S cofactor (of
A), and A<B, then the code ends up generating the cofactor T (of B) and
deriving S from that. Perhaps it'd be possible to arrange to get S in
the first place by calling mpn_gcdext
with A+B,B. This
might only be an advantage if A and B are about the same size.
mpn_toom3_mul_n
, mpn_toom3_sqr_n
: Temporaries
B
and D
are adjacent in memory and at the final
coefficient addition look like they could use a single
mpn_add_n
of l4
limbs rather than two of
l2
limbs.
udiv_qrnnd_preinv2norm
, the branch-free version of
udiv_qrnnd_preinv
, might be faster on various pipelined
chips. In particular the first if (_xh != 0)
in
udiv_qrnnd_preinv
might be roughly a 50/50 chance and might
branch predict poorly. (The second test is probably almost always
false.) Measuring with the tuneup program would be possible, but perhaps
a bit messy. In any case maybe the default should be the branch-free
version.
udiv_qrnnd_preinv2norm
implementation
assumes a right shift will sign extend, which is not guaranteed by the C
standards, and doesn't happen on Cray vector systems.
#ifdef (__GNUC__) #if __GNUC__ == 2 && __GNUC_MINOR__ == 7 ... #endif #if __GNUC__ == 2 && __GNUC_MINOR__ == 8 ... #endif #ifndef MUL_KARATSUBA_THRESHOLD /* Default GNUC values */ ... #endif #else /* system compiler */ ... #endif
invert_limb
on various processors might benefit from the
little Newton iteration done for alpha and ia64.
mpn_mul_1
,
mpn_addmul_1
, and mpn_submul_1
.
mpn_mul_1
, mpn_addmul_1
,
and mpn_submul_1
for the 21164. This should use both integer
multiplies and floating-point multiplies. For the floating-point
operations, the single-limb multiplier should be split into three 21-bit
chunks, or perhaps even better in four 16-bit chunks. Probably possible
to reach 9 cycles/limb.
ctlz
and cttz
for
count_leading_zeros
andcount_trailing_zeros
.
Use inline for gcc, probably want asm files for elsewhere.
umul_ppmm
to call
__umulsidi3
in libgcc. Could be copied straight across, but
perhaps ought to be tested.
clz
instruction can be used for
count_leading_zeros
.
mpn_divexact_by3
isn't particularly important, but
the generic C runs at about 27 c/l, whereas with the multiplies off the
dependent chain about 3 c/l ought to be possible.
mpn_hamdist
could be put together based on the
current mpn_popcount
.
popc_limb
in gmp-impl.h could use the
popcnt
insn.
mpn_submul_1
is not implemented directly, only via
a combination of mpn_mul_1
and mpn_sub_n
.
mpn_mul_1
, mpn_addmul_1
,
for s2 < 2^32 (or perhaps for any zero 16-bit s2 chunk). Not sure how
much this can improve the speed, though, since the symmetry that we rely
on is lost. Perhaps we can just gain cycles when s2 < 2^16, or more
accurately, when two 16-bit s2 chunks which are 16 bits apart are zero.
mpn_submul_1
, analogous to
mpn_addmul_1
.
umul_ppmm
. Using four
"mulx
"s either with an asm block or via the generic C code is
about 90 cycles. Try using fp operations, and also try using karatsuba
for just three "mulx
"s.
mpn_divrem_1
, mpn_mod_1
,
mpn_divexact_1
and mpn_modexact_1_odd
could
process 32 bits at a time when the divisor fits 32-bits. This will need
only 4 mulx
's per limb instead of 8 in the general case.
mpn_lshift
, mpn_rshift
.
Will give 2 cycles/limb. Trivial modifications of mpn/sparc64 should do.
mulx
for umul_ppmm
if
possible (see commented out code in longlong.h). This is unlikely to
save more than a couple of cycles, so perhaps isn't worth bothering with.
__sparc_v9__
or anything to indicate V9 support when -mcpu=v9 is selected. See
gcc/config/sol2-sld-64.h. Will need to pass something through from
./configure to select the right code in longlong.h. (Currently nothing
is lost because mulx
for multiplying is commented out.)
modlimb_invert
might save a few cycles from
masking down to just the useful bits at each point in the calculation,
since mulx
speed depends on the highest bit set. Either
explicit masks or small types like short
and
int
ought to work.
mpn_popcount
and mpn_hamdist
could use popc
currently commented out in gmp-impl.h. This
chip reputedly implements popc
properly (see gcc sparc.md),
would need to recognise the chip as sparchalr1
or something
in configure / config.sub / config.guess.
mpn_addmul_1
, mpn_submul_1
, and
mpn_mul_1
. The current code runs at 11 cycles/limb. It
should be possible to saturate the cache, which will happen at 8
cycles/limb (7.5 for mpn_mul_1). Write special loops for s2 < 2^32;
it should be possible to make them run at about 5 cycles/limb.
mpn_addmul_1
, mpn_submul_1
, and
mpn_mul_1
. Use both integer and floating-point operations,
possibly two floating-point and one integer limb per loop. Split operands
into four 16-bit chunks for fast fp operations. Should easily reach 9
cycles/limb (using one int + one fp), but perhaps even 7 cycles/limb
(using one int + two fp).
mpn_rshift
could do the same sort of unrolled loop
as mpn_lshift
. Some judicious use of m4 might let the two
share source code, or with a register to control the loop direction
perhaps even share object code.
mpn_rshift
should do the same sort of unrolled
loop as mpn_lshift
.
mpn_mul_basecase
and mpn_sqr_basecase
for important machines. Helping the generic sqr_basecase.c with an
mpn_sqr_diagonal
might be enough for some of the RISCs.
mpn_lshift
/mpn_rshift
.
Will bring time from 1.75 to 1.25 cycles/limb.
mpn_lshift
for shifts by 1. (See
Pentium code.)
rep
movs
would upset GCC register allocation for the whole function.
Is this still true in GCC 3? It uses rep movs
itself for
__builtin_memcpy
. Examine the code for some simple and
complex functions to find out. Inlining rep movs
would be
desirable, it'd be both smaller and faster.
mpn_lshift
and mpn_rshift
can come
down from 6.0 c/l to 5.5 or 5.375 by paying attention to pairing after
shrdl
and shldl
, see mpn/x86/pentium/README.
mpn_divexact_1
and
mpn_modexact_1_odd
on 16-bit divisors could use MMX
multiplies and run at around 16 cycles (down from 23).
mpn_lshift
and mpn_rshift
might benefit from some destination prefetching.
mpn_divrem_1
might be able to use a
mul-by-inverse, hoping for maybe 30 c/l.
mpn_add_n
and mpn_sub_n
should be able to go
faster than the generic x86 code at 3.5 c/l. The athlon code for instance
runs at about 2.7.
mpn_lshift
and mpn_rshift
might be able to
do something branch-free for unaligned startups, and shaving one insn
from the loop with alternative indexing might save a cycle.
mpn_lshift
.
The pipeline is now extremely deep, perhaps unnecessarily deep.
mpn_rshift
based on new mpn_lshift
.
mpn_add_n
and mpn_sub_n
. Should
run at just 3.25 cycles/limb.
mpn_mul_basecase
and
mpn_sqr_basecase
. This should use a "vertical multiplication
method", to avoid carry propagation. splitting one of the operands in
11-bit chunks.
mpn_lshift
by 31 should use the special rshift
by 1 code, and vice versa mpn_rshift
by 31 should use the
special lshift by 1. This would be best as a jump across to the other
routine, could let both live in lshift.asm and omit rshift.asm on finding
mpn_rshift
already provided.
mpn_com_n
and mpn_and_n
etc very probably
wants a pragma like MPN_COPY_INCR
.
mpn_lshift
, mpn_rshift
,
mpn_popcount
and mpn_hamdist
are nice and small
and could be inlined to avoid function calls.
TMP_ALLOC
to use them, or introduce a new scheme. Memory
blocks wanted unconditionally are easy enough, those wanted only
sometimes are a problem. Perhaps a special size calculation to ask for a
dummy length 1 when unwanted, or perhaps an inlined subroutine
duplicating code under each conditional. Don't really want to turn
everything into a dog's dinner just because Cray don't offer an
alloca
.
mpn_get_str
on power-of-2 bases ought to vectorize.
Does it? bits_per_digit
and the inner loop over bits in a
limb might prevent it. Perhaps special cases for binary, octal and hex
would be worthwhile (very possibly for all processors too).
popc_limb
could use the Cray _popc
intrinsic. That would help mpz_hamdist
and might make the
generic C versions of mpn_popcount
and
mpn_hamdist
suffice for Cray (if it vectorizes, or can be
given a hint to do so).
mpn_mul_1
could check for a 16-bit multiplier and use
two multiplies per limb, not four. Ditto mpn_addmul_1
and
mpn_submul_1
.
mpn_lshift
and mpn_rshift
could use a
roll
and mask instead of lsrl
and
lsll
. This promises to be a speedup, effectively trading a
6+2*n shift for one or two 4 cycle masks. Suggested by Jean-Charles
Meyrignac.
count_leading_zeros
for 64-bit machines:
if ((x >> 32) == 0) { x <<= 32; cnt += 32; } if ((x >> 48) == 0) { x <<= 16; cnt += 16; } ...
__inline
which could perhaps
be used in __GMP_EXTERN_INLINE
. What would be the right way
to identify suitable versions of that compiler?
double
floats are straightforward and
could perhaps be handled directly in __gmp_extract_double
and maybe in mpz_get_d
, rather than falling back on the
generic code. GCC defines __GFLOAT
when -mg has selected G
format (which would be possible via a user CFLAGS
).
mpn_get_str
final divisions by the base with
udiv_qrnd_unnorm
could use some sort of multiply-by-inverse
on suitable machines. This ends up happening for decimal by presenting
the compiler with a run-time constant, but the same for other bases would
be good. Perhaps use could be made of the fact base<256.
mp?_out_raw
and
mp?_inp_raw
.
mpz_get_nth_ui
. Return the nth word (not necessarily the
nth limb).
mpz_crr
(Chinese Remainder Reconstruction).
mpz_init
and mpq_init
could do lazy allocation.
Set ALLOC(var)
to 0 to indicate nothing allocated, and let
_mpz_realloc
do the initial alloc. Set
z->_mp_d
to a dummy that mpz_get_ui
and
similar can unconditionally fetch from. Niels Möller has had a go at
this.
mpz_init
and then
more or less immediately reallocating.
mpz_init
would only store magic values in the
mpz_t
fields, and could be inlined.
mpz_t z = MPZ_INITIALIZER;
, which might be convenient
for globals.
mpz_set_ui
and other similar routines needn't check the
size allocated and can just store unconditionally.
mpz_set_ui
and perhaps others like
mpz_tdiv_r_ui
and a prospective
mpz_set_ull
could be inlined.
mpf_out_raw
and mpf_inp_raw
. Make sure
format is portable between 32-bit and 64-bit machines, and between
little-endian and big-endian machines.
mpn_and_n
... mpn_copyd
: Perhaps make the mpn
logops and copys available in gmp.h, either as library functions or
inlines, with the availability of library functions instantiated in the
generated gmp.h at build time.
mpz_set_str
etc variants taking string lengths rather than
null-terminators.
<=
" rather than "<
", so a threshold can
be set to MP_SIZE_T_MAX
to get only the simpler code (the
compiler will know size <= MP_SIZE_T_MAX
is always true).
Alternately it looks like the ABOVE_THRESHOLD
and
BELOW_THRESHOLD
macros can do this adequately, and also pick
up cases where a threshold of zero should mean only the second algorithm.
mpz_nthprime
.
mpz_init2
, initializing and making initial room for
N bits. The actual size would be rounded up to a limb, and perhaps an
extra limb added since so many mpz
routines need that on
their destination.
mpz_andn
, mpz_iorn
, mpz_nand
,
mpz_nior
, mpz_xnor
might be useful additions,
if they could share code with the current such functions (which should be
possible).
mpz_and_ui
etc might be of use sometimes. Suggested by
Niels Möller.
mpf_set_str
and mpf_inp_str
could usefully
accept 0x, 0b etc when base==0. Perhaps the exponent could default to
decimal in this case, with a further 0x, 0b etc allowed there.
Eg. 0xFFAA@0x5A. A leading "0" for octal would match the integers, but
probably something like "0.123" ought not mean octal.
GMP_LONG_LONG_LIMB
or some such could become a documented
feature of gmp.h, so applications could know whether to
printf
a limb using %lu
or %Lu
.
PRIdMP_LIMB
and similar defines following C99
<inttypes.h> might be of use to applications printing limbs.
Perhaps they should be defined only if specifically requested, the way
<inttypes.h> does. But if GMP_LONG_LONG_LIMB
or
whatever is added then perhaps this can easily enough be left to
applications.
mpf_get_ld
and mpf_set_ld
converting
mpf_t
to and from long double
. Other
long double
routines would be desirable too, but these would
be a start. Often long double
is the same as
double
, which is easy but pretty pointless. Should
recognise the Intel 80-bit format on i386, and IEEE 128-bit quad on
sparc, hppa and power. Might like an ABI sub-option or something when
it's a compiler option for 64-bit or 128-bit long double
.
gmp_printf
could usefully accept an arbitrary base, for both
integer and float conversions. Either a number in the format string or a
*
to take a parameter should be allowed. Maybe
&13b
(b for base) or something like that.
gmp_printf
could perhaps have a type code for an
mp_limb_t
. That would save an application from having to
worry whether it's a long or a long long.
gmp_printf
could perhaps accept mpq_t
for float
conversions, eg. "%.4Qf"
. This would be merely for
convenience, but still might be useful. Rounding would be the same as
for an mpf_t
(ie. currently round-to-nearest, but not
actually documented). Alternately, perhaps a separate
mpq_get_str_point
or some such might be more use. Suggested
by Pedro Gimeno.
gmp_printf
could usefully accept a flag to control the
rounding of float conversions. The wouldn't do much for
mpf_t
, but would be good if mpfr_t
was
supported in the future, or perhaps for mpq_t
. Something
like &*r
(r for rounding, and mpfr style
GMP_RND
parameter).
mpz_togglebit
or mpz_chgbit
or some such might
be a good companion to mpz_setbit
and
mpz_clrbit
. Suggested by Niels Möller.
mpz_scan0_reverse
or mpz_scan0low
or some such
searching towards the low end of an integer might match
mpz_scan0
nicely. Likewise for scan1
.
Suggested by Roberto Bagnara.
mpz_bit_subset
or some such to test whether one integer is a
bitwise subset of another might be of use. Some sort of return value
indicating whether it's a proper or non-proper subset would be good and
wouldn't cost anything in the implementation. Suggested by Roberto
Bagnara.
gmp_randinit_r
and maybe gmp_randstate_set
to
init-and-copy or to just copy a gmp_randstate_t
. Suggested
by Pedro Gimeno.
mpf_get_ld
, mpf_set_ld
: Conversions between
mpf_t
and long double
, suggested by Dan
Christensen. There'd be some work to be done by configure
to recognise the format in use. xlc on aix for instance apparently has
an option for either plain double 64-bit or quad 128-bit precision. This
might mean library contents vary with the compiler used to build, which
is undesirable. It might be possible to detect the mode the application
is compiling with, and try to avoid mismatch problems.
mpz_sqrt_if_perfect_square
: When
mpz_perfect_square_p
does its tests it calculates a square
root and then discards it. For some applications it might be useful to
return that root. Suggested by Jason Moxham.
mpz_get_ull
, mpz_set_ull
,
mpz_get_sll
, mpz_get_sll
: Conversions for
long long
. These would aid interoperability, though a
mixture of GMP and long long
would probably not be too
common. Disadvantages of using long long
in libgmp.a would
be
#ifdef
block to decide if the
application compiler could take the long long
prototypes.
LIBGMP_HAS_LONGLONG
would be wanted to
indicate whether the functions are available. (Applications using
autoconf could probe the library too.)
long long
to
application compile time, by having something like
mpz_set_2ui
called with two halves of a long
long
. Disadvantages of this would be,
long
long
is normally passed as two halves anyway.
mpz_get_ull
would be a rather big inline, or would have
to be two function calls.
mpz_get_sll
would be a worse inline, and would put the
treatment of -0x10..00
into applications (see
mpz_get_si
correctness above).
long long
is probably the lesser evil, if only
because it makes best use of gcc.
mpz_strtoz
parsing the same as strtol
.
Suggested by Alexander Kruppa.
#ifdef
mess in gmp-impl.h. This is simple when doing a
native compile, but needs a reliable way to examine object files when
cross-compiling. Falling back on a run-time test would be reasonable, if
build time tests fail.
umul_ppmm
in longlong.h always uses umull
,
but is that available only for M series chips or some such? Perhaps it
should be configured in some way.
ia64-*-hpux*
. Does GMP need to know anything about that?
mpz_inp_raw
and
mpz_out_raw
optimization. It's hoped
AC_C_BIGENDIAN
can be relied on to interrogate the compiler.
*-*-aix*
. It might be more reliable to do some sort of
feature test or to examine the compiler output. It might also be nice to
merge the aix.m4 files into powerpc-defs.m4.
umul
but not udiv
, or something like
that. Check the mpn/sparc32/v8 code is suitable, and add -mcpu= options
for gcc.
udiv
should be selected
according to the CPU target. Currently floating point ends up being
used on all sparcs, which is probably not right for generic V7 and V8.
-xtarget=native
with cc
is
incorrect when cross-compiling, the target should be set according to the
configured $host
CPU.
.balignw <n>,0x4e7f
to get
nops, if ALIGN
is going to be used for anything that's
executed across.
umul
and udiv
code not being
used. Check all such for bit rot and then put umul and udiv in
$gmp_mpn_functions_optional
as "standard optional" objects.
mpn_umul_ppmm
and mpn_udiv_qrnnd
have a
different parameter order than those functions on other CPUs. It might
avoid confusion to have them under different names, maybe
mpn_umul_ppmm_r
or some such. Prototypes then wouldn't
be conditionalized, and the appropriate form could be selected with the
HAVE_NATIVE
scheme if/when the code switches to use a
PROLOGUE
style.
DItype
: The setup in gmp-impl.h for non-GCC could use an
autoconf test to determine whether long long
is available.
AC_OUTPUT
would work, but it might upset "make" to have things like L$
get into the Makefiles through AC_SUBST
.
AC_CONFIG_COMMANDS
would be the alternative. With some
careful m4 quoting the changequote
calls might not be
needed, which might free up the order in which things had to be output.
make distclean
: Only the mpn directory links which were
created are removed, but perhaps all possible links should be removed, in
case someone runs configure a second time without a
distclean
in between. The only tricky part would be making
sure all possible extra_functions
are covered.
-mno-cygwin
. For --host=*-*-mingw32*
it might
be convenient to automatically use that option, if it works. Needs
someone with a dual cygwin/mingw setup to test.
CCAS
, CCASFLAGS
scheme. Though we probably wouldn't be using its assembler support we
could do use those variables in compatible ways.
_gmp_rand
is not particularly fast on the linear
congruential algorithm and could stand various improvements.
gmp_randstate_t
(or
_mp_algdata
rather) to save some copying.
2exp
modulus, to
avoid mpn_mul
calls. Perhaps the same for two limbs.
lc
code, to avoid a function call and
TMP_ALLOC
for every chunk.
seedn==0
will be very rarely used,
and on that basis seems unnecessary.
2exp
and general LC cases should be split,
for clarity (if the general case is retained).
gmp_randinit_mm
(named after Mitchell and Moore) for the
55-element delayed Fibonacci generator from Knuth vol 2. Being additive
it should be fast, and might be random enough for GMP test program
purposes, if nothing else. Niels Möller has started on this.
gmp_randinit_lc
: Finish or remove. Doing a division for
every every step won't be very fast, so check whether the usefulness of
this algorithm can be justified.
gmp_randinit_bbs
would be wanted, not the currently
commented out case in gmp_randinit
.
_gmp_rand
could be done as a function pointer within
gmp_randstate_t
(or rather in the _mp_algdata
part), instead of switching on a gmp_randalg_t
. Likewise
gmp_randclear
, and perhaps gmp_randseed
if it
became algorithm-specific. This would be more modular, and would ensure
only code for the desired algorithms is dragged into the link.
mpz_urandomm
should do something for n<=0, but what?
mpz_urandomm
implementation looks like it could be improved.
Perhaps it's enough to calculate nbits
as ceil(log2(n)) and
call _gmp_rand
until a value <n
is obtained.
gmp_randstate_t
used for parameters perhaps should become
gmp_randstate_ptr
the same as other types.
mpz_div
and mpz_divmod
use rounding
analogous to mpz_mod
. Document, and list as an
incompatibility.
mpz_gcdext
and mpn_gcdext
ought to document
what range of values the generated cofactors can take, and preferably
ensure the definition uniquely specifies the cofactors for given inputs.
A basic extended Euclidean algorithm or multi-step variant leads to
|x|<|b| and |y|<|a| or something like that, but there's probably
two solutions under just those restrictions.
mpz_invert
should call mpn_gcdext
directly.
random
/ mrand48
.
mpz_divisible_ui_p
rather than
mpz_tdiv_qr_ui
. (Of course dividing multiple primes at a
time would be better still.)
libgmp
. This establishes good cross-checks, but it might be
better to use simple reference routines where possible. Where it's not
possible some attention could be paid to the order of the tests, so a
libgmp
routine is only used for tests once it seems to be
good.
mpf_set_q
is very similar to mpf_div
, it'd be
good for the two to share code. Perhaps mpf_set_q
should
make some mpf_t
aliases for its numerator and denominator
and just call mpf_div
. Both would be simplified a good deal
by switching to mpn_tdiv_qr
perhaps making them small enough
not to bother with sharing (especially since mpf_set_q
wouldn't need to watch out for overlaps).
mftb
and
mftbu
) could be used for the speed and tune programs. Would
need to know its frequency though, for instance it seemed to be 25 MHz on
a couple of Apples (compared to the CPU speed of 350 or 450 MHz).
MUL_FFT_THRESHOLD
etc: the FFT thresholds should allow a
return to a previous k at certain sizes. This arises basically due to
the step effect caused by size multiples effectively used for each k.
Looking at a graph makes it fairly clear.
__gmp_doprnt_mpf
does a rather unattractive round-to-nearest
on the string returned by mpf_get_str
. Perhaps some variant
of mpf_get_str
could be made which would better suit.
ASSERT
s at the start of each user-visible mpz/mpq/mpf
function to check the validity of each mp?_t
parameter, in
particular to check they've been mp?_init
ed. This might
catch elementary mistakes in user programs. Care would need to be taken
over MPZ_TMP_INIT
ed variables used internally. If nothing
else then consistency checks like size<=alloc, ptr not
NULL
and ptr+size not wrapping around the address space,
would be possible. A more sophisticated scheme could track
_mp_d
pointers and ensure only a valid one is used. Such a
scheme probably wouldn't be reentrant, not without some help from the
system.
getrusage
and gettimeofday
are reliable.
Currently we pretend in configure that the dodgy m68k netbsd 1.4.1
getrusage
doesn't exist. If a test might take a long time
to run then perhaps cache the result in a file somewhere.
unsigned long int
is used for
bit counts/ranges, and that mp_size_t
is used for limb counts.
mpz_inp_str
(etc) doesn't say when it stops reading digits.
mpn_umul_ppmm
, and the corresponding umul.asm file could be
included in libgmp only in that case, the same as is effectively done for
__clz_tab
. Likewise udiv.asm and perhaps cntlz.asm. This
would only be a very small space saving, so perhaps not worth the
complexity.
mpz_get_si
returns 0x80000000 for -0x100000000, whereas it's
sort of supposed to return the low 31 (or 63) bits. But this is
undocumented, and perhaps not too important.
mpz_*_ui
division routines currently return abs(a%b).
Perhaps make them return the real remainder instead? Return type would
be signed long int
. But this would be an incompatible
change, so it might have to be under newly named functions.
mpz_init_set*
and mpz_realloc
could allocate
say an extra 16 limbs over what's needed, so as to reduce the chance of
having to do a reallocate if the mpz_t
grows a bit more.
This could only be an option, since it'd badly bloat memory usage in
applications using many small values.
mpq
functions could perhaps check for numerator or
denominator equal to 1, on the assumption that integers or
denominator-only values might be expected to occur reasonably often.
count_trailing_zeros
is used on more or less uniformly
distributed numbers in a couple of places. For some CPUs
count_trailing_zeros
is slow and it's probably worth handling
the frequently occurring 0 to 2 trailing zeros cases specially.
mpf_t
might like to let the exponent be undefined when
size==0, instead of requiring it 0 as now. It should be possible to do
size==0 tests before paying attention to the exponent. The advantage is
not needing to set exp in the various places a zero result can arise,
which avoids some tedium but is otherwise perhaps not too important.
Currently mpz_set_f
and mpf_cmp_ui
depend on
exp==0, maybe elsewhere too.
__gmp_allocate_func
: Could use GCC __attribute__
((malloc))
on this, though don't know if it'd do much. GCC 3.0
allows that attribute on functions, but not function pointers (see info
node "Attribute Syntax"), so would need a new autoconf test. This can
wait until there's a GCC that supports it.
mpz_add_ui
contains two __GMPN_COPY
s, one from
mpn_add_1
and one from mpn_sub_1
. If those two
routines were opened up a bit maybe that code could be shared. When a
copy needs to be done there's no carry to append for the add, and if the
copy is non-empty no high zero for the sub. mpz_add
when one operand is longer than the other, and to mpz_com
since it's just -(x+1).
restrict
'ed pointers: Does the C99 definition of restrict
(one writer many readers, or whatever it is) suit the GMP style "same or
separate" function parameters? If so, judicious use might improve the
code generated a bit. Do any compilers have their own flavour of
restrict as "completely unaliased", and is that still usable?
ABI
option, selecting _SHORT_LIMB
in gmp.h. Naturally a new set
of asm subroutines would be necessary. Would need new
mpz_set_ui
etc since the current code assumes limb>=long,
but 2-limb operand forms would find a use for long long
on
other processors too.
BITS_PER_MP_LIMB
) mod
m", then for the input limbs x calculating an inner product "sum
p[i]*x[i]", and a final 3x1 limb remainder mod m. If those powers take
roughly N divide steps to calculate then there'd be an advantage any time
the same m is used three or more times. Suggested by Victor Shoup in
connection with chinese-remainder style decompositions, but perhaps with
other uses.