doc/tasks.html


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<html>
<head>
  <title>
    GMP Itemized Development Tasks
  </title>
</head>
<body bgcolor=lightgreen>

<center>
  <h1>
    GMP Itemized Development Tasks
  </h1>
</center>

<!-- NB. timestamp updated automatically by emacs -->
<comment>
  This file current as of 13 May 2001.  An up-to-date version is available at
  <a href="http://www.swox.com/gmp/tasks.html">http://www.swox.com/gmp/tasks.html</a>.
</comment>

<p> These are itemized GMP development tasks.  Not all the tasks
    listed here are suitable for volunteers, but many of them are.
    Please see the <a href="projects.html">projects file</a> for more
    sizeable projects.

<h4>Correctness and Completeness</h4>
<ul>
<li> mpn/generic/get_str.c stores <code>mp_size_t</code> data in several int
     variables.  (It is also really slow.  Perhaps just call
     <code>mpn_divrem_1</code>.  See also optimization item below.)
<li> The various reuse.c tests need to force reallocation by calling
     <code>_mpz_realloc</code> with a small (1 limb) size.
<li> One reuse case is missing from mpX/tests/reuse.c:
     <code>mpz_XXX(a,a,a)</code>.
<li> When printing <code>mpf_t</code> numbers with exponents &gt;2^53 on
     machines with 64-bit <code>mp_exp_t</code>, the precision of
     <code>__mp_bases[base].chars_per_bit_exactly</code> is insufficient and
     <code>mpf_get_str</code> aborts.  Detect and compensate.  Alternately,
     think seriously about using some sort of fixed-point integer value.
     Avoiding unnecessary floating point is probably a good thing in general,
     and it might be faster on some CPUs.
<li> Make the string reading functions allow the `0x' prefix when the base is
     explicitly 16.  They currently only allow that prefix when the base is
     unspecified (zero).
<li> <code>mpf_eq</code> is not always correct, when one operand is
     1000000000... and the other operand is 0111111111..., i.e., extremely
     close.  There is a special case in <code>mpf_sub</code> for this
     situation; put similar code in <code>mpf_eq</code>.
<li> <code>mpf_eq</code> doesn't implement what gmp.texi specifies.  It should
     not use just whole limbs, but partial limbs.
<li> NeXT might not have <code>__builtin_constant_p</code> available.  Same
     problem with MacOS X.
<li> <code>TMP_ALLOC</code> is not reentrant when using stack-alloc.c.  Perhaps
     make it so by default and leave a --enable-alloca=malloc-nonreentrant with
     the current code.  A reentrant version will probably be slower since it
     can't share malloced blocks across function calls.
<li> <code>_mpz_realloc</code> will reallocate to less than the current size
     of an mpz.  This is ok internally when a destination size is yet to be
     updated, but from user code it should probably refuse to go below the
     size of the current value.  Could switch internal uses across to a new
     function, and tighten up <code>_mpz_realloc</code>.
<li> <code>mpf_set_str</code> doesn't validate it's exponent, for instance
     123.456eX789X is accepted (and an exponent 0 used).
<li> <code>mpf_set_str</code> and <code>mpf_inp_str</code> could usefully
     accept 0x, 0b etc when base==0.  Perhaps the exponent could default to
     decimal in this case, with a further 0x, 0b etc allowed there.
     Eg. 0xFFAA@0x5A.
<li> <code>DIVIDE_BY_ZERO</code> on powerpc does nothing, because division by
     zero in the basic "divu" instruction isn't an exception.  Does falling
     though to subsequent code in the gmp routines upset anything?
</ul>


<h4>Machine Independent Optimization</h4>
<ul>
<li> <code>mpn_gcdext</code>, <code>mpz_get_d</code>: Don't test
     <code>count_leading_zeros</code> for zero, instead check the high bit of
     the operand and avoid invoking <code>count_leading_zeros</code>.  This is
     an optimization on all machines, and significant on machines with slow
     <code>count_leading_zeros</code>, though it's possible an already
     normalized operand might not be encountered very often.
<li> Rewrite <code>umul_ppmm</code> to use floating-point for generating the
     most significant limb (if <code>BITS_PER_MP_LIMB</code> &lt= 52 bits).
     (Peter Montgomery has some ideas on this subject.)
<li> Improve the default <code>umul_ppmm</code> code in longlong.h: Add partial
     products with fewer operations.
<li> Write new <code>mpn_get_str</code> and <code>mpn_set_str</code> running in
     the sub O(n^2) range, using some divide-and-conquer approach, preferably
     without using division.
<li> <code>mpn_get_str</code> should use a native <code>mpn_divrem_1</code>
     when available (athlon, p6mmx).  A new <code>mpn_preinv_divrem_1</code>
     entrypoint would suit, since the inverse and shift are known from
     <code>mp_bases</code>.  Don't do an <code>mpn_lshift</code> in
     <code>mpn_get_str</code>, leave it up to <code>mpn_divrem_1</code> to
     either make that call or do it on-the-fly.
<li> Copy tricky code for converting a limb from development version of
     <code>mpn_get_str</code> to mpf/get_str.  (Talk to Torbj�rn about this.)
<li> Consider inlining these functions:
     <code>mpz_init</code>, <code>mpz_clear</code>,
     <code>mpz_set_ui</code>,
     <code>mpz_init_set_ui</code>,
     <code>mpf_get_prec</code>, <code>mpf_get_default_prec</code>,
     <code>mpf_set_prec_raw</code>,
     <code>mpf_init</code>, <code>mpf_init2</code>, <code>mpf_clear</code>.
     <br>
     These will want to be done with <code>extern inline</code> in the style
     of the current <code>mpn_add</code> etc, since applications might be
     using them as function pointers.  The GLIBC trick of first defining a
     prototype then a macro looks like it won't work with the GMP
     <code>__gmpz</code> renamings.
<li> Consider inlining: <code>mpz_[cft]div_ui</code> and maybe
     <code>mpz_[cft]div_r_ui</code>.  A <code>__gmp_divide_by_zero</code>
     would be needed for the divide by zero test, unless that could be left to
     <code>mpn_mod_1</code> (not sure currently whether all the risc chips
     provoke the right exception there if using mul-by-inverse).
<li> Consider inlining: <code>mpz_fits_s*_p</code>.  The setups for
     <code>LONG_MAX</code> etc would need to go into gmp.h, and on Cray it
     might, unfortunately, be necessary to forcibly include &lt;limits.h&gt;
     since there's no apparent way to get <code>SHRT_MAX</code> with an
     expression (since <code>short</code> and <code>unsigned short</code> can
     be different sizes).
<li> <code>mpz_powm</code> and <code>mpz_powm_ui</code> aren't very
     fast on one or two limb moduli, due to a lot of function call
     overheads.  These could perhaps be handled as special cases.
<li> <code>mpz_powm</code> and <code>mpz_powm_ui</code> want better
     algorithm selection, and the latter should use REDC.  Both could
     change to use an <code>mpn_powm</code> and <code>mpn_redc</code>.
<li> <code>mpz_powm</code> REDC should do multiplications by <code>g[]</code>
     using the division method when they're small, since the REDC form of a
     small multiplier is normally a full size product.  Probably would need a
     new tuned parameter to say what size multiplier is "small", as a function
     of the size of the modulus.
<li> <code>mpz_powm</code> REDC should handle even moduli if possible.  Maybe
     this would mean for m=n*2^k doing mod n using REDC and an auxiliary
     calculation mod 2^k, then putting them together at the end.
<li> <code>mpn_gcd</code> might be able to be sped up on small to
     moderate sizes by improving <code>find_a</code>, possibly just by
     providing an alternate implementation for CPUs with slowish
     <code>count_leading_zeros</code>.
<li> Toom3 <code>USE_MORE_MPN</code> could use a low to high cache localized
     evaluate and interpolate.  The necessary <code>mpn_divexact_by3c</code>
     exists.
<li> <code>mpn_mul_basecase</code> on NxM with big N but small M could try for
     better cache locality by taking N piece by piece.  The current code could
     be left available for CPUs without caching.  Depending how karatsuba etc
     is applied to unequal size operands it might be possible to assume M is
     always smallish.
<li> <code>mpn_perfect_square_p</code> on small operands might be better off
     skipping the residue tests and just taking a square root.
<li> <code>mpn_perfect_square_p</code> could take a remainder mod 15 instead
     of 3 and 5 separately, and use a bit table which indicates whether an N
     mod 15 is a quadratic residue mod both 3 and 5.  More combining would
     work too, like say 3*11 and 5*7, in whatever good combinations, possibly
     aiming to keep under 64 or 128 bits.
<li> <code>mpz_perfect_power_p</code> could be improved in a number of ways.
     Test for Nth power residues modulo small primes like
     <code>mpn_perfect_square_p</code> does.  Use p-adic arithmetic to find
     possible roots.  Negatives should be handled, and used to know the power
     must be odd.  Divisibility by powers of 2 should be tested with
     <code>mpz_scan1</code> or similar.  Divisibility by other primes should
     be tested by grouping into a limb like <code>PP</code>.
<li> Change <code>PP</code>/<code>PP_INVERTED</code> into an array of such
     pairs, listing several hundred of primes.  Perhaps actually make the
     products larger than one limb each.
<li> <code>mpz_probab_prime_p</code>, <code>mpn_perfect_square_p</code> and
     <code>mpz_perfect_power_p</code> could take a remainder mod 2^24-1 to
     quickly get remainders mod 3, 5, 7, 13 and 17 (factors of 2^24-1).  A
     remainder mod 2^24-1 can be taken at adder throughput speed.
<li> <code>mpf_set_str</code> produces low zero limbs when a string has a
     fraction but is exactly representable, eg. 0.5 in decimal.  These could be
     stripped to save work in later operations.
<li> <code>mpz_and</code>, <code>mpz_ior</code> and <code>mpz_xor</code> should
     use <code>mpn_and</code> etc for the benefit of the small number of
     targets with native versions of those routines.  Need to be careful not to
     pass size==0.  Is some code sharing possible between the <code>mpz</code>
     routines?
<li> <code>mpn_bdivmod</code> should use a divide and conquer like the normal
     division.  See "Exact Division with Karatsuba Complexity" by Jebelean for
     a (brief) description.  This will benefit <code>mpz_divexact</code>
     immediately, and <code>mpn_gcd</code> on large unequal size operands.
     REDC should be able to use it too.
<li> <code>restrict</code>'ed pointers: Does the C99 definition of restrict
     (one writer many readers, or whatever it is) suit the GMP style "same or
     separate" function parameters?  If so, judicious use might improve the
     code generated a bit.  Do any compilers have their own flavour of
     restrict as "completely unaliased", and is that still usable?
<li> GCC <code>__attribute__ ((malloc))</code> could be used on
     <code>__gmp_allocate_func</code> and on the stack-alloc.c
     <code>TMP_ALLOC</code>.  Don't know if it'd do much.  Current pre-release
     gcc 3 doesn't like attaching function attributes to function pointers
     like <code>__gmp_allocate_func</code> (see "(gcc)Attribute Syntax"), this
     has to wait for the future.
<li> <code>mpf_add</code>: Don't do a copy to avoid overlapping operands
     unless it's really necessary (currently only sizes are tested, not
     whether r really is u or v).
<li> <code>mpf_add</code>: Under the check for v having no effect on the
     result, perhaps test for r==u and do nothing in that case, rather than
     currently it looks like an <code>MPN_COPY_INCR</code> will be done to
     reduce prec+1 limbs to prec.
</ul>


<h4>Machine Dependent Optimization</h4>
<ul>
<li> Run the `tune' utility for more compiler/CPU combinations.  We would like
     to have gmp-mparam.h files in practically every implementation specific
     mpn subdirectory, and repeat each *_THRESHOLD for gcc and the system
     compiler.  See the `tune' top-level directory for more information.
	<pre>
	#ifdef (__GNUC__)
	#if __GNUC__ == 2 && __GNUC_MINOR__ == 7
	...
	#endif
	#if __GNUC__ == 2 && __GNUC_MINOR__ == 8
	...
	#endif
	#ifndef KARATSUBA_MUL_THRESHOLD
	/* Default GNUC values */
	...
	#endif
	#else /* system compiler */
	...
	#endif	</pre>
<li> Alpha 21264: Improve feed-in code for <code>mpn_addmul_1</code> and
     <code>mpn_submul_1</code>.  Integrate new <code>mpn_mul_1</code> code.
<li> Alpha 21164: Rewrite <code>mpn_addmul_1</code>, <code>mpn_submul_1</code>,
     and <code>mpn_mul_1</code> for the 21164.  This should use both integer
     multiplies and floating-point multiplies.  For the floating-point
     operations, the single-limb multiplier should be split into three 21-bit
     chunks, or perhaps even better in four 16-bit chunks.  Probably possible
     to reach 9 cycles/limb.
<li> UltraSPARC/64: Rewrite <code>mpn_mul_1</code>, <code>mpn_addmul_1</code>,
     and <code>mpn_submul_1</code>.  Should use floating-point operations, and
     split the invariant single-limb multiplier into 16-bit chunks.  Will give
     14 cycles/limb, but with a deep pipeline.
<li> UltraSPARC/64: Write <code>umul_ppmm</code>.  Using four
     "<code>mulx</code>"s either with an asm block or via the generic C code is
     about 90 cycles.  Try using fp operations, and also try using karatsuba
     for just three "<code>mulx</code>"s.
<li> UltraSPARC/64: Write <code>mpn_sqr_diagonal</code>.  Since we're
     squaring each limb, it seems best to split limbs into one 22-bit chunk
     (for most significant part) and two 21 bit chunks.  Things magically
     align to 64-bits borders with that splitting.  Will need 19 memory
     operations, and won't therefore run faster than at about 20 cycles/limb.
     (The current umul_ppmm loop of mpn_sqr_basecase runs at around 60
     cycles/limb.)  If we use VIS <code>fand</code> for splitting operands,
     just 13 memory are needed.
<li> UltraSPARC/32: Rewrite <code>mpn_lshift</code>, <code>mpn_rshift</code>.
     Will give 2 cycles/limb.  Trivial modifications of mpn/sparc64 should do.
<li> UltraSPARC/32: Write special mpn_Xmul_1 loops for s2 &lt; 2^16.
<li> PA64: Improve <code>mpn_addmul_1</code>, <code>mpn_submul_1</code>, and
     <code>mpn_mul_1</code>.  The current code runs at 11 cycles/limb.  It
     should be possible to saturate the cache, which will happen at 8
     cycles/limb (7.5 for mpn_mul_1).  Write special loops for s2 &lt; 2^32;
     it should be possible to make them run at about 5 cycles/limb.
<li> PPC630: Rewrite <code>mpn_addmul_1</code>, <code>mpn_submul_1</code>, and
     <code>mpn_mul_1</code>.  Use both integer and floating-point operations,
     possibly two floating-point and one integer limb per loop.  Split operands
     into four 16-bit chunks for fast fp operations.  Should easily reach 9
     cycles/limb (using one int + one fp), but perhaps even 7 cycles/limb
     (using one int + two fp).
<li> PPC630: <code>mpn_rshift</code> could do the same sort of unrolled loop
     as <code>mpn_lshift</code>.  Some judicious use of m4 might let the two
     share source code, or with a register to control the loop direction
     perhaps even share object code.
<li> Implement <code>mpn_mul_basecase</code> and <code>mpn_sqr_basecase</code>
     for important machines.  Helping the generic sqr_basecase.c with an
     <code>mpn_sqr_diagonal</code> might be enough for some of the RISCs.
<li> POWER2/POWER2SC: Schedule <code>mpn_lshift</code>/<code>mpn_rshift</code>.
     Will bring time from 1.75 to 1.25 cycles/limb.
<li> X86: Optimize non-MMX <code>mpn_lshift</code> for shifts by 1.  (See
     Pentium code.)
<li> X86: Good authority has it that in the past an inline <code>rep
     movs</code> would upset GCC register allocation for the whole function.
     Is this still true in GCC 3?  Examine the code for some simple and
     complex functions to find out.  Inlining <code>rep movs</code> would be
     desirable, it'd be both smaller and faster.
<li> PentiumPro: <code>mpn_divrem_1</code> might be able to use a
     mul-by-inverse, hoping for maybe 30 c/l.
<li> P6: <code>mpn_add_n</code> and <code>mpn_sub_n</code> should be able to go
     faster than the generic x86 code at 3.5 c/l.  The athlon code for instance
     runs at about 2.7.
<li> PPC32: Try using fewer registers in the current <code>mpn_lshift</code>.
     The pipeline is now extremely deep, perhaps unnecessarily deep.
<li> PPC32: Write <code>mpn_rshift</code> based on new <code>mpn_lshift</code>.
<li> PPC32: Rewrite <code>mpn_add_n</code> and <code>mpn_sub_n</code>.  Should
     run at just 3.25 cycles/limb.
<li> Fujitsu VPP: Vectorize main functions, perhaps in assembly language.
<li> Fujitsu VPP: Write <code>mpn_mul_basecase</code> and
     <code>mpn_sqr_basecase</code>.  This should use a "vertical multiplication
     method", to avoid carry propagation.  splitting one of the operands in
     11-bit chunks.
<li> 68k, Pentium: <code>mpn_lshift</code> by 31 should use the special rshift
     by 1 code, and vice versa <code>mpn_rshift</code> by 31 should use the
     special lshift by 1.  This would be best as a jump across to the other
     routine, could let both live in lshift.asm and omit rshift.asm on finding
     <code>mpn_rshift</code> already provided.
<li> Cray T3E: Experiment with optimization options.  In particular,
     -hpipeline3 seems promising.  We should at least up -O to -O2 or -O3.
<li> Cray: Variable length arrays seem to be faster than the stack-alloc.c
     scheme.  Not sure why, maybe they merely give the compiler more
     information about aliasing (or the lack thereof).  Would like to modify
     <code>TMP_ALLOC</code> to use them, or introduce a new scheme.  Memory
     blocks wanted unconditionally are easy enough, those wanted only
     sometimes are a problem.  Perhaps a special size calculation to ask for a
     dummy length 1 when unwanted, or perhaps an inlined subroutine
     duplicating code under each conditional.  Don't really want to turn
     everything into a dog's dinner just because Cray don't offer an
     <code>alloca</code>.
<li> Improve <code>count_leading_zeros</code> for 64-bit machines:
  <pre>
	   if ((x &gt&gt 32) == 0) { x &lt&lt= 32; cnt += 32; }
	   if ((x &gt&gt 48) == 0) { x &lt&lt= 16; cnt += 16; }
	   ... </pre>

</ul>

<h4>New Functionality</h4>
<ul>
<li> Add in-memory versions of <code>mp?_out_raw</code> and
     <code>mp?_inp_raw</code>.
<li> <code>mpz_get_nth_ui</code>.  Return the nth word (not necessarily the
     nth limb).
<li> Maybe add <code>mpz_crr</code> (Chinese Remainder Reconstruction).
<li> Let `0b' and `0B' mean binary input everywhere.
<li> <code>mpz_init</code> and <code>mpq_init</code> could do lazy allocation.
     Set <code>ALLOC(var)</code> to 0, and have <code>mpz_realloc</code>
     special-handle that case.  Would mean the size required for the first use
     is the first allocated, rather than allocating a small size and then
     reallocing it.  Update functions that rely on a single limb like
     <code>mpz_set_ui</code>, <code>mpz_{t,f,c}div_{qr,r}_ui</code>, and
     others.  Would need the initial <code>z->_mp_d</code> to point to a dummy
     initial location, so it can be fetched from by <code>mpz_odd_p</code> and
     similar macros.
<li> Add <code>mpf_out_raw</code> and <code>mpf_inp_raw</code>.  Make sure
     format is portable between 32-bit and 64-bit machines, and between
     little-endian and big-endian machines.
<li> Handle numeric exceptions: Call an error handler, and/or set
     <code>gmp_errno</code>.  It should be possible to detect errors at the
     start of documented entrypoints and invoke an exception handler with a
     nice clean state, giving it the chance to recover with
     <code>longjmp</code> or a C++ throw or whatever.
<li> Handle out-of-memory exceptions: It'd be good to offer the custom
     allocation functions a way to <code>longjmp</code> or throw a C++
     exception on being out of memory (or out of the memory they care to offer
     GMP).  This would need GMP to ensure all variables are in a sensible
     state whenever attempting an alloc or realloc, and would need some way to
     record other temporary memory or temporary variables in use which should
     be freed.  The latter might be difficult, perhaps the C++ destructor
     mechanism would be necessary.
<li> <code>gmp_fprintf</code>, <code>gmp_sprintf</code>, and
     <code>gmp_snprintf</code>.  Think about some sort of wrapper around
     <code>printf</code> so it and its several variants don't have to be
     completely reimplemented.  Variants like <code>mpz_printf</code> etc
     supporting only <code>mpz</code> etc could exist to keep down the amount
     of library dragged in.
<li> <code>mpq</code> input and output functions.
<li> <code>mpn_and_n</code> ... <code>mpn_copyd</code>: Make the mpn logops
     and copys available in gmp.h, either as library functions or inlines,
     with the availability of library functions instantiated in the generated
     gmp.h at build time.
<li> <code>mpz_set_str</code> etc variants taking string lengths rather than
     null-terminators.
<li> Consider changing the thresholds to apply the simpler algorithm when
     "<code>&lt;=</code>" rather than "<code>&lt;</code>", so a threshold can
     be set to <code>MP_SIZE_T_MAX</code> to get only the simpler code (the
     compiler will know <code>size &lt;= MP_SIZE_T_MAX</code> is always true).
     Alternately it looks like the <code>ABOVE_THRESHOLD</code> and
     <code>BELOW_THRESHOLD</code> macros can do this adequately, and also pick
     up cases where a threshold of zero should mean only the second algorithm.
<li> <code>mpz_powm</code> could handle negative exponents by powering the
     modular inverse of the base.  But what to do if an inverse doesn't exist
     (base and modulus have common factors)?  Throw a divide by zero maybe, or
     return a flag like <code>mpz_invert</code> does.
<li> <code>mpz_nthprime</code>.
<li> Perhaps <code>mpz_init2</code>, initializing and making initial room for
     N bits.  The actual size would be rounded up to a limb, and perhaps an
     extra limb added since so many <code>mpz</code> routines need that on
     their destination.
</ul>


<h4>Configuration</h4>

<ul>
<li> Floating-point format: Determine this with a feature test.  Get rid of
     the <code>#ifdef</code> mess in gmp-impl.h.  This is simple when doing a
     native compile, but needs a reliable way to examine object files when
     cross-compiling.  Falling back on a run-time test would be reasonable, if
     build time tests fail.
<li> HPPA: config.guess should recognize 7000, 7100, 7200, and 8x00.
<li> Mips: config.guess should say mipsr3000, mipsr4000, mipsr10000, etc.
     "hinv -c processor" gives lots of information on Irix.  Standard
     config.guess etc append "el" to indicate endianness, but GMP probably
     doesn't care about endianness and could forget about that.
<li> Power: config.guess might be able to use AIX 4
     <code>_system_configuration.implementation</code> for more accurate CPU
     determination.
<li> Sparc: config.guess should say supersparc, microsparc, ultrasparc1,
     ultrasparc2, etc.  "prtconf -vp" gives lots of information about a
     Solaris system.
<li> Sparc32: floating point or integer <code>udiv</code> should be selected
     according to the CPU target.  Currently floating point ends up being
     used on all sparcs, which is probably not right for generic V7 and V8.
<li> m68k: config.guess can detect 68000, cpu32 and 68020, but relies on
     system information for 030, 040 and 060.  Can they be identified by
     running some code?
<li> demos/pexpr.c: is becoming the usual *nix mess of nested ifdefs.  Instead
     use the results of configure tests.  pexpr.c could be generated from a
     pexpr.in with a set of #undef's.
<li> Some CPUs have <code>umul</code> and <code>udiv</code> code not being
     used.  Check all such for bit rot and then put umul and udiv in
     <code>$gmp_mpn_functions_optional</code> as "standard optional" objects.
     <br> In particular Sparc and SparcV8 on non-gcc should benefit from
     umul.asm enabled; the generic umul is suspected to be making sqr_basecase
     slower than mul_basecase.
<li> HPPA <code>mpn_umul_ppmm</code> and <code>mpn_udiv_qrnnd</code> have a
     different parameter order than those functions on other CPUs.  It might
     avoid confusion to have them under different names, maybe
     <code>mpn_umul_ppmm_r</code> or some such.  Prototypes then wouldn't
     be conditionalized, and the appropriate form could be selected with the
     <code>HAVE_NATIVE</code> scheme if/when the code switches to use a
     <code>PROLOGUE</code> style.
<li> Combine the pa64 and pa64w directories.
<li> Rename `mips2' =&gt; `mips32' and `mips3' =&gt; `mips64'.
<li> <code>SItype</code> etc typedefs in gmp-impl.h could use an autoconf test
     to determine whether <code>__attribute__ (mode)</code> is available, and
     also whether <code>long long</code> is available for <code>DItype</code>
     on non-gcc compilers.
</ul>


<h4>Miscellaneous</h4>
<ul>
<li> Make <code>mpz_div</code> and <code>mpz_divmod</code> use rounding
     analogous to <code>mpz_mod</code>.  Document, and list as an
     incompatibility.
<li> <code>mpz_invert</code> should call <code>mpn_gcdext</code> directly.
<li> Merge mpn/pa64 and pa64w.
<li> <code>mpz_urandomm</code> should do something for n&lt;=0 (but what?),
     and the use of <code>mpz_mod</code> looks doubtful (does it suffice to
     generate numbers of <code>nbits</code> until getting one &lt;n?)
<li> demos/factorize.c should use the GMP random functions when restarting
     Pollard-Rho, not <code>random</code> / <code>mrand48</code>.
<li> The various test programs use quite a bit of the main
     <code>libgmp</code>.  This establishes good cross-checks, but it might be
     better to use simple reference routines where possible.  Where it's not
     possible some attention could be paid to the order of the tests, so a
     main <code>libgmp</code> is only used to construct tests once it seems to
     be good.
<li> <code>mpf_set_q</code> is very similar to <code>mpf_div</code>, it'd be
     good for the two to share code.  Perhaps <code>mpf_set_q</code> should
     make some <code>mpf_t</code> aliases for its numerator and denominator
     and just call <code>mpf_div</code>.
</ul>


<h4>Aids to Debugging</h4>
<ul>
<li> <code>TMP_ALLOC</code> could do a separate <code>malloc</code> for each
     allocated block, to help a redzoning malloc debugger.  Perhaps a config
     option --enable-alloca=debug.
<li> Add <code>ASSERT</code>s at the start of each user-visible mpz/mpq/mpf
     function to check the validity of each <code>mp?_t</code> parameter, in
     particular to check they've been <code>mp?_init</code>ed.  This might
     catch elementary mistakes in user programs.  Care would need to be taken
     over <code>MPZ_TMP_INIT</code>ed variables used internally.  If nothing
     else then consistency checks like size&lt;=alloc, ptr not
     <code>NULL</code> and ptr+size not wrapping around the address space,
     would be possible.
</ul>


<h4>Documentation</h4>
<ul>
<li> Document conventions, like that <code> unsigned long int</code> is used for
     bit counts/ranges, and that <code>mp_size_t</code> is used for limb counts.
<li> <code>mpz_inp_str</code> (etc) doesn't say when it stops reading digits.
</ul>


<h4>Bright Ideas</h4>

The following may or may not be feasible, and aren't likely to get done in the
near future, but are at least worth thinking about.

<ul>
<li> Reorganize longlong.h so that we can inline the operations even for the
     system compiler.  When there is no such compiler feature, make calls to
     stub functions.  Write such stub functions for as many machines as
     possible.
<li> Nx1 division (<code>divrem_1</code> or <code>mod_1</code>) could be done
     two limbs at a time, by following <code>udiv_qrnnd_preinv</code> but with
     two limbs everywhere one is used.  This would mean 3 or 4 multiplies each
     place 1 is currently done, but they'll be independent and so 2 limbs can
     be processed with the same latency as 1 would have been.  This idea would
     apply to CPUs with long latency but good thoughput multipliers.  Clearly
     it can be extended to 3 limbs at a time, or 4, or however many, though
     each time more are used then more time will be taken creating an initial
     multi-limb inverse (<code>invert_limb</code> style), and the
     quadratically increasing number of cross-products will at some point see
     throughput unable to offer a saving over latency.
<li> Build multiple variants of the library under certain systems.  An example
     is -n32 and -64 on Irix.  Autoconf/automake/libtool give pretty much no
     help for this, and currently the best that can be done is to recommend
     users rebuild in the various possible or desired configurations, perhaps
     installing with different <code>libdir</code> settings, or whatever.
<li> Enable support for FORTRAN versions of mpn files (eg. for
     mpn/cray/mulww.f).  Add "f" to MPN_SUFFIXES, run <code>AC_PROG_F77</code>
     if such a file is found.  Automake will generate some of what's needed in
     the makefiles, but libtool doesn't know fortran and so rules like the
     current ".asm.lo" will be needed (the multi-language branch of libtool
     doesn't look like it's got fortran yet either).
<li> <code>mpz_get_si</code> returns 0x80000000 for -0x100000000, whereas it's
     sort of supposed to return the low 31 (or 63) bits.  But this is
     undocumented, and perhaps not too important.
<li> <code>mpz_*_ui</code> division routines currently return abs(a%b).
     Perhaps make them return the real remainder instead?  Return type would
     be <code>signed long int</code>.  But this would be an incompatible
     change, so it might have to be under newly named functions.
<li> <code>mpz_init_set*</code> and <code>mpz_realloc</code> could allocate
     say an extra 16 limbs over what's needed, so as to reduce the chance of
     having to do a reallocate if the <code>mpz_t</code> grows a bit more.
     This could only be an option, since it'd badly bloat memory usage in
     applications using many small values.
<li> DLLs on cygwin, mingw and os2 would be possible, but the couple of global
     variables (<code>gmp_version</code>, <code>__gmp_allocate_func</code>
     used by GMP++, etc) would require the <code>__declspec</code> nonsense
     described in the Goat book.  Or maybe since <code>gmp_version</code> is a
     constant it can avoid that stuff.
<li> m68k: configure could accept <code>m68020fp</code> or similar to select
     68881 floating point.  config.guess could try to detect that too.  This
     would only be to add -m68881 to gcc, there's no gmp asm code using float,
     so perhaps it's just as easily left to the user to set
     <code>CFLAGS</code>.
<li> <code>mpq</code> functions could perhaps check for numerator or
     denominator equal to 1, on the assumption that integers or
     denominator-only values might be expected to occur reasonably often.
<li> <code>count_trailing_zeros</code> is used on more or less uniformly
     distributed numbers in a couple of places.  For some CPUs
     <code>count_trailing_zeros</code> is slow and it's probably worth handling
     the frequently occurring 0 to 2 trailing zeros cases specially.
<li> <code>mpf_t</code> might like to let the exponent be undefined when
     size==0, instead of requiring it 0 as now.  It should be possible to do
     size==0 tests before paying attention to the exponent.  The advantage is
     not needing to set exp in the various places a zero result can arise,
     which avoids some tedium but is otherwise perhaps not too important.
     Currently <code>mpz_set_f</code> and <code>mpf_cmp_ui</code> depend on
     exp==0, maybe elsewhere too.
</ul>


<hr>

<table width="100%">
  <tr>
    <td>
      <font size=2>
      Please send comments about this page to
      <a href="mailto:tege@swox.com">tege@swox.com</a>.<br>
      Copyright 1999, 2000, 2001 Torbj�rn Granlund.
      </font>
    </td>
    <td align=right>
    </td>
  </tr>
</table>

</body>
</html>

<!--
Local variables:
eval: (add-hook 'write-file-hooks 'time-stamp)
time-stamp-start: "This file current as of "
time-stamp-format: "%:d %3b %:y"
time-stamp-end: "\\."
time-stamp-line-limit: 50
End:
-->