summaryrefslogtreecommitdiff
path: root/doc/tasks.html
blob: 37336ad50a024e2a0be73c42471f5d80fd9aa55c (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<html>
<head>
  <title>
    GMP Itemized Development Tasks
  </title>
</head>
<body bgcolor=lightgreen>

<center>
  <h1>
    GMP Itemized Development Tasks
  </h1>
</center>

<font size=-1>
Copyright 2000, 2001, 2002 Free Software Foundation, Inc. <br><br>
This file is part of the GNU MP Library. <br><br>
The GNU MP Library is free software; you can redistribute it and/or modify
it under the terms of the GNU Lesser General Public License as published
by the Free Software Foundation; either version 2.1 of the License, or (at
your option) any later version. <br><br>
The GNU MP Library is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU Lesser General Public
License for more details. <br><br>
You should have received a copy of the GNU Lesser General Public License
along with the GNU MP Library; see the file COPYING.LIB.  If not, write to
the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston,
MA 02111-1307, USA.
</font>

<hr>
<!-- NB. timestamp updated automatically by emacs -->
<comment>
  This file current as of 7 May 2002.  An up-to-date version is available at
  <a href="http://www.swox.com/gmp/tasks.html">http://www.swox.com/gmp/tasks.html</a>.
  Please send comments about this page to
  <a href="mailto:bug-gmp@gnu.org">bug-gmp@gnu.org</a>.
</comment>

<p> These are itemized GMP development tasks.  Not all the tasks
    listed here are suitable for volunteers, but many of them are.
    Please see the <a href="projects.html">projects file</a> for more
    sizeable projects.

<h4>Correctness and Completeness</h4>
<ul>
<li> The various reuse.c tests need to force reallocation by calling
     <code>_mpz_realloc</code> with a small (1 limb) size.
<li> One reuse case is missing from mpX/tests/reuse.c:
     <code>mpz_XXX(a,a,a)</code>.
<li> When printing <code>mpf_t</code> numbers with exponents &gt;2^53 on
     machines with 64-bit <code>mp_exp_t</code>, the precision of
     <code>__mp_bases[base].chars_per_bit_exactly</code> is insufficient and
     <code>mpf_get_str</code> aborts.  Detect and compensate.  Alternately,
     think seriously about using some sort of fixed-point integer value.
     Avoiding unnecessary floating point is probably a good thing in general,
     and it might be faster on some CPUs.
<li> Make the string reading functions allow the `0x' prefix when the base is
     explicitly 16.  They currently only allow that prefix when the base is
     unspecified (zero).
<li> <code>mpf_eq</code> is not always correct, when one operand is
     1000000000... and the other operand is 0111111111..., i.e., extremely
     close.  There is a special case in <code>mpf_sub</code> for this
     situation; put similar code in <code>mpf_eq</code>.
<li> <code>mpf_eq</code> doesn't implement what gmp.texi specifies.  It should
     not use just whole limbs, but partial limbs.
<li> <code>_mpz_realloc</code> will reallocate to less than the current size
     of an mpz.  This is ok internally when a destination size is yet to be
     updated, but from user code it should probably refuse to go below the
     size of the current value.  Could switch internal uses across to a new
     function, and tighten up <code>_mpz_realloc</code>.
<li> <code>mpf_set_str</code> doesn't validate it's exponent, for instance
     garbage 123.456eX789X is accepted (and an exponent 0 used), and overflow
     of a <code>long</code> is not detected.
<li> <code>mpf_add</code> doesn't check for a carry from truncated portions of
     the inputs, and in that respect doesn't implement the "infinite precision
     followed by truncate" specified in the manual.
<li> <code>mpf_div</code> of x/x doesn't always give 1, reported by Peter
     Moulder.  Perhaps it suffices to put +1 on the effective divisor prec, so
     that data bits rather than zeros are shifted in when normalizing.  Would
     prefer to switch to <code>mpn_tdiv_qr</code>, where all shifting should
     disappear.
<li> Windows DLLs: tests/mpz/reuse.c and tests/mpf/reuse.c initialize global
     variables with pointers to <code>mpz_add</code> etc, which doesn't work
     when those routines are coming from a DLL (because they're effectively
     function pointer global variables themselves).  Need to rearrange perhaps
     to a set of calls to a test function rather than iterating over an array.
<li> demos/pexpr.c: The local variables in <code>main</code> might be
     clobbered by the <code>longjmp</code>.
</ul>



<h4>Machine Independent Optimization</h4>
<ul>
<li> <code>mpn_gcdext</code>, <code>mpz_get_d</code>,
     <code>mpf_get_str</code>: Don't test <code>count_leading_zeros</code> for
     zero, instead check the high bit of the operand and avoid invoking
     <code>count_leading_zeros</code>.  This is an optimization on all
     machines, and significant on machines with slow
     <code>count_leading_zeros</code>, though it's possible an already
     normalized operand might not be encountered very often.
<li> Rewrite <code>umul_ppmm</code> to use floating-point for generating the
     most significant limb (if <code>BITS_PER_MP_LIMB</code> &lt= 52 bits).
     (Peter Montgomery has some ideas on this subject.)
<li> Improve the default <code>umul_ppmm</code> code in longlong.h: Add partial
     products with fewer operations.
<li> Consider inlining <code>mpz_set_ui</code>.  This would be both small and
     fast, especially for compile-time constants, but would make application
     binaries depend on having 1 limb allocated to an <code>mpz_t</code>,
     preventing the "lazy" allocation scheme below.
<li> Consider inlining <code>mpz_[cft]div_ui</code> and maybe
     <code>mpz_[cft]div_r_ui</code>.  A <code>__gmp_divide_by_zero</code>
     would be needed for the divide by zero test, unless that could be left to
     <code>mpn_mod_1</code> (not sure currently whether all the risc chips
     provoke the right exception there if using mul-by-inverse).
<li> Consider inlining: <code>mpz_fits_s*_p</code>.  The setups for
     <code>LONG_MAX</code> etc would need to go into gmp.h, and on Cray it
     might, unfortunately, be necessary to forcibly include &lt;limits.h&gt;
     since there's no apparent way to get <code>SHRT_MAX</code> with an
     expression (since <code>short</code> and <code>unsigned short</code> can
     be different sizes).
<li> <code>mpz_powm</code> and <code>mpz_powm_ui</code> aren't very
     fast on one or two limb moduli, due to a lot of function call
     overheads.  These could perhaps be handled as special cases.
<li> <code>mpz_powm</code> and <code>mpz_powm_ui</code> want better
     algorithm selection, and the latter should use REDC.  Both could
     change to use an <code>mpn_powm</code> and <code>mpn_redc</code>.
<li> <code>mpz_powm</code> REDC should do multiplications by <code>g[]</code>
     using the division method when they're small, since the REDC form of a
     small multiplier is normally a full size product.  Probably would need a
     new tuned parameter to say what size multiplier is "small", as a function
     of the size of the modulus.
<li> <code>mpz_powm</code> REDC should handle even moduli if possible.  Maybe
     this would mean for m=n*2^k doing mod n using REDC and an auxiliary
     calculation mod 2^k, then putting them together at the end.
<li> <code>mpn_gcd</code> might be able to be sped up on small to
     moderate sizes by improving <code>find_a</code>, possibly just by
     providing an alternate implementation for CPUs with slowish
     <code>count_leading_zeros</code>.
<li> Toom3 <code>USE_MORE_MPN</code> could use a low to high cache localized
     evaluate and interpolate.  The necessary <code>mpn_divexact_by3c</code>
     exists.
<li> <code>mpn_mul_basecase</code> on NxM with big N but small M could try for
     better cache locality by taking N piece by piece.  The current code could
     be left available for CPUs without caching.  Depending how karatsuba etc
     is applied to unequal size operands it might be possible to assume M is
     always smallish.
<li> <code>mpn_perfect_square_p</code> on small operands might be better off
     skipping the residue tests and just taking a square root.
<li> <code>mpz_perfect_power_p</code> could be improved in a number of ways.
     Test for Nth power residues modulo small primes like
     <code>mpn_perfect_square_p</code> does.  Use p-adic arithmetic to find
     possible roots.  Divisibility by other primes should be tested by
     grouping into a limb like <code>PP</code>.
<li> <code>mpz_perfect_power_p</code> might like to use <code>mpn_gcd_1</code>
     instead of a private GCD routine.  The use it's put to isn't
     time-critical, and it might help be ensure correctness to use the main GCD
     routine.
<li> <code>mpz_perfect_power_p</code> could use
     <code>mpz_divisible_ui_p</code> instead of <code>mpz_tdiv_ui</code> for
     divisibility testing, the former is faster on a number of systems.  (But
     all that prime test stuff is going to be rewritten some time.)
<li> Change <code>PP</code>/<code>PP_INVERTED</code> into an array of such
     pairs, listing several hundred primes.  Perhaps actually make the
     products larger than one limb each.
<li> <code>PP</code> can have factors of 2 introduced in order to get the high
     bit set and therefore a <code>PP_INVERTED</code> existing.  The factors
     of 2 don't affect the way the remainder r = a % ((x*y*z)*2^n) is used,
     further remainders r%x, r%y, etc, are the same since x, y, etc are odd.
     The advantage of this is that <code>mpn_preinv_mod_1</code> can then be
     used if it's faster than plain <code>mpn_mod_1</code>.  This would be a
     change only for 16-bit limbs, all the rest already have <code>PP</code>
     in the right form.
<li> <code>PP</code> could have extra factors of 3 or 5 or whatever introduced
     if they fit, and final remainders mod 9 or 25 or whatever used, thereby
     making more efficient use of the <code>mpn_mod_1</code> done.  On a
     16-bit limb it looks like <code>PP</code> could take an extra factor of
     3.
<li> <code>mpz_probab_prime_p</code>, <code>mpn_perfect_square_p</code> and
     <code>mpz_perfect_power_p</code> could use <code>mpn_mod_34lsub1</code>
     to take a remainder mod 2^24-1 or 2^48-1 and quickly get remainders mod
     3, 5, 7, 13 and 17 (factors of 2^24-1).  This could either replace the
     <code>PP</code> division currently done, or allow <code>PP</code> to do
     larger primes, depending how many residue tests seem worthwhile before
     launching into full root extractions or Miller-Rabin etc.
<li> <code>mpz_probab_prime_p</code> (and maybe others) could code the
     divisibility tests like <code>n%7 == 0</code> in the form
<pre>
#define MP_LIMB_DIVISIBLE_7_P(n) \
  ((n) * MODLIMB_INVERSE_7 &lt;= MP_LIMB_T_MAX/7)
</pre>
     This would help compilers which don't know how to optimize divisions by
     constants, and would help current gcc (3.0) too since gcc forms a whole
     remainder rather than using a modular inverse and comparing.  This
     technique works for any odd modulus, and with some tweaks for even moduli
     too.  See Granlund and Montgomery "Division By Invariant Integers"
     section 9.
<li> <code>mpz_probab_prime_p</code> and <code>mpz_nextprime</code> could
     offer certainty for primes up to 2^32 by using a one limb miller-rabin
     test to base 2, combined with a table of actual strong pseudoprimes in
     that range (2314 of them).  If that table is too big then both base 2 and
     base 3 tests could be done, leaving a table of 104.  The test could use
     REDC and therefore be a <code>modlimb_invert</code> a remainder (maybe)
     then two multiplies per bit (successively dependent).  Processors with
     pipelined multipliers could do base 2 and 3 in parallel.  Vector systems
     could do a whole bunch of bases in parallel, and perhaps offer near
     certainty up to 64-bits (certainty might depend on an exhaustive search
     of pseudoprimes up to that limit).  Obviously 2^32 is not a big number,
     but an efficient and certain calculation is attractive.  It might find
     other uses internally, and could even be offered as a one limb prime test
     <code>mpn_probab_prime_1_p</code> or <code>gmp_probab_prime_ui_p</code>
     perhaps.
<li> <code>mpz_probab_prime_p</code> doesn't need to make a copy of
     <code>n</code> when the input is negative, it can setup an
     <code>mpz_t</code> alias, same data pointer but a positive size.  With no
     need to clear before returning, the recursive function call could be
     dispensed with too.
<li> <code>mpf_set_str</code> produces low zero limbs when a string has a
     fraction but is exactly representable, eg. 0.5 in decimal.  These could be
     stripped to save work in later operations.
<li> <code>mpz_and</code>, <code>mpz_ior</code> and <code>mpz_xor</code> should
     use <code>mpn_and_n</code> etc for the benefit of the small number of
     targets with native versions of those routines.  Need to be careful not to
     pass size==0.  Is some code sharing possible between the <code>mpz</code>
     routines?
<li> <code>mpf_add</code>: Don't do a copy to avoid overlapping operands
     unless it's really necessary (currently only sizes are tested, not
     whether r really is u or v).
<li> <code>mpf_add</code>: Under the check for v having no effect on the
     result, perhaps test for r==u and do nothing in that case, rather than
     currently it looks like an <code>MPN_COPY_INCR</code> will be done to
     reduce prec+1 limbs to prec.
<li> <code>mpn_divrem_2</code> could usefully accept unnormalized divisors and
     shift the dividend on-the-fly, since this should cost nothing on
     superscalar processors and avoid the need for temporary copying in
     <code>mpn_tdiv_qr</code>.
<li> <code>mpf_sqrt_ui</code> calculates prec+1 limbs, whereas just prec would
     satisfy the application requested precision.  It should suffice to simply
     reduce the rsize temporary to 2*prec-1 limbs.  <code>mpf_sqrt</code>
     might be similar.
<li> <code>invert_limb</code> generic C: The division could use dividend
     b*(b-d)-1 which is high:low of (b-1-d):(b-1), instead of the current
     (b-d):0, where b=2^<code>BITS_PER_MP_LIMB</code> and d=divisor.  The
     former is per the original paper and is used in the x86 code, the
     advantage is that the current special case for 0x80..00 could be dropped.
     The two should be equivalent, but a little check of that would be wanted.
<li> <code>mpq_cmp_ui</code> could form the <code>num1*den2</code> and
     <code>num2*den1</code> products limb-by-limb from high to low and look at
     each step for values differing by more than the possible carry bit from
     the uncalculated portion.
<li> <code>mpq_cmp</code> could do the same high-to-low progressive multiply
     and compare.  The benefits of karatsuba and higher multiplication
     algorithms are lost, but if it's assumed only a few high limbs will be
     needed to determine an order then that's fine.
<li> <code>mpn_add_1</code>, <code>mpn_sub_1</code>, <code>mpn_add</code>,
     <code>mpn_sub</code>: Internally use <code>__GMPN_ADD_1</code> etc
     instead of the functions, so they get inlined on all compilers, not just
     gcc and others with <code>inline</code> recognised in gmp.h.
     <code>__GMPN_ADD_1</code> etc are meant mostly to support application
     inline <code>mpn_add_1</code> etc and if they don't come out good for
     internal uses then special forms can be introduced, for instance many
     internal uses are in-place.  Sometimes a block of code is executed based
     on the carry-out, rather than using it arithmetically, and those places
     might want to do their own loops entirely.
<li> <code>__gmp_extract_double</code> on 64-bit systems could use just one
     bitfield for the mantissa extraction, not two, when endianness permits.
     Might depend on the compiler allowing <code>long long</code> bit fields
     when that's the only actual 64-bit type.
<li> <code>mpf_get_d</code> could be more like <code>mpz_get_d</code> and do
     more in integers and give the float conversion as such a chance to round
     in its preferred direction.  Some code sharing ought to be possible.  Or
     if nothing else then for consistency the two ought to give identical
     results on integer operands (not clear if this is so right now).
<li> <code>usqr_ppm</code> or some such could do a widening square in the
     style of <code>umul_ppmm</code>.  This would help 68000, and be a small
     improvement for the generic C (which is used on UltraSPARC/64 for
     instance).  GCC recognises the generic C ul*vh and vl*uh are identical,
     but does two separate additions to the rest of the result.
<li> tal-notreent.c could keep a block of memory permanently allocated.
     Currently the last nested <code>TMP_FREE</code> releases all memory, so
     there's an allocate and free every time a top-level function using
     <code>TMP</code> is called.  Would need
     <code>mp_set_memory_functions</code> to tell tal-notreent.c to release
     any cached memory when changing allocation functions though.
<li> <code>__gmp_tmp_alloc</code> from tal-notreent.c could be partially
     inlined.  If the current chunk has enough room then a couple of pointers
     can be updated.  Only if more space is required then a call to some sort
     of <code>__gmp_tmp_increase</code> would be needed.  The requirement that
     <code>TMP_ALLOC</code> is an expression might make the implementation a
     bit ugly and/or a bit sub-optimal.
<pre>
#define TMP_ALLOC(n)
  ((ROUND_UP(n) &gt; current-&gt;end - current-&gt;point ?
     __gmp_tmp_increase (ROUND_UP (n)) : 0),
     current-&gt;point += ROUND_UP (n),
     current-&gt;point - ROUND_UP (n))
</pre>
<li> <code>__mp_bases</code> has a lot of data for bases which are pretty much
     never used.  Perhaps the table should just go up to base 16, and have
     code to generate data above that, if and when required.  Naturally this
     assumes the code would be smaller than the data saved.
<li> <code>__mp_bases</code> field <code>big_base_inverted</code> is only used
     if <code>USE_PREINV_DIVREM_1</code> is true, and could be omitted
     otherwise, to save space.
<li> Make <code>mpf_get_str</code> and <code>mpf_set_str</code> call the
     corresponding, much faster, mpn functions.
<li> <code>mpn_mod_1</code> could pre-calculate values of R mod N, R^2 mod N,
     R^3 mod N, etc, with R=2^<code>BITS_PER_MP_LIMB</code>, and use them to
     process multiple limbs at each step by multiplying.  Suggested by Peter
     L. Montgomery.
<li> <code>mpz_get_str</code>, <code>mtox</code>: For power-of-2 bases, which
     are of course fast, it seems a little silly to make a second pass over
     the <code>mpn_get_str</code> output to convert to ASCII.  Perhaps combine
     that with the bit extractions.
<li> <code>mpz_gcdext</code>: If the caller requests only the S cofactor (of
     A), and A&lt;B, then the code ends up generating the cofactor T (of B) and
     deriving S from that.  Perhaps it'd be possible to arrange to get S in
     the first place by calling <code>mpn_gcdext</code> with A+B,B.  This
     might only be an advantage if A and B are about the same size.
<li> <code>mpn_toom3_mul_n</code>, <code>mpn_toom3_sqr_n</code>: Temporaries
     <code>B</code> and <code>D</code> are adjacent in memory and at the final
     coefficient addition look like they could use a single
     <code>mpn_add_n</code> of <code>l4</code> limbs rather than two of
     <code>l2</code> limbs.
</ul>


<h4>Machine Dependent Optimization</h4>
<ul>
<li> <code>udiv_qrnnd_preinv2norm</code>, the branch-free version of
     <code>udiv_qrnnd_preinv</code>, might be faster on various pipelined
     chips.  In particular the first <code>if (_xh != 0)</code> in
     <code>udiv_qrnnd_preinv</code> might be roughly a 50/50 chance and might
     branch predict poorly.  (The second test is probably almost always
     false.)  Measuring with the tuneup program would be possible, but perhaps
     a bit messy.  In any case maybe the default should be the branch-free
     version.
     <br>
     Note that the current <code>udiv_qrnnd_preinv2norm</code> implementation
     assumes a right shift will sign extend, which is not guaranteed by the C
     standards, and doesn't happen on Cray vector systems.
<li> Run the `tune' utility for more compiler/CPU combinations.  We would like
     to have gmp-mparam.h files in practically every implementation specific
     mpn subdirectory, and repeat each *_THRESHOLD for gcc and the system
     compiler.  See the `tune' top-level directory for more information.
	<pre>
	#ifdef (__GNUC__)
	#if __GNUC__ == 2 && __GNUC_MINOR__ == 7
	...
	#endif
	#if __GNUC__ == 2 && __GNUC_MINOR__ == 8
	...
	#endif
	#ifndef MUL_KARATSUBA_THRESHOLD
	/* Default GNUC values */
	...
	#endif
	#else /* system compiler */
	...
	#endif	</pre>
<li> <code>invert_limb</code> on various processors might benefit from the
     little Newton iteration done for alpha and ia64.
<li> Alpha 21264: Improve feed-in code for <code>mpn_mul_1</code>,
     <code>mpn_addmul_1</code>, and <code>mpn_submul_1</code>.
<li> Alpha 21164: Rewrite <code>mpn_mul_1</code>, <code>mpn_addmul_1</code>,
     and <code>mpn_submul_1</code> for the 21164.  This should use both integer
     multiplies and floating-point multiplies.  For the floating-point
     operations, the single-limb multiplier should be split into three 21-bit
     chunks, or perhaps even better in four 16-bit chunks.  Probably possible
     to reach 9 cycles/limb.
<li> Alpha 21264 ev67: Use <code>ctlz</code> and <code>cttz</code> for
     <code>count_leading_zeros</code> and<code>count_trailing_zeros</code>.
     Use inline for gcc, probably want asm files for elsewhere.
<li> ARC: gcc longlong.h sets up <code>umul_ppmm</code> to call
     <code>__umulsidi3</code> in libgcc.  Could be copied straight across, but
     perhaps ought to be tested.
<li> ARM: On v5 cpus see if the <code>clz</code> instruction can be used for
     <code>count_leading_zeros</code>.
<li> Itanium: <code>mpn_divexact_by3</code> isn't particularly important, but
     the generic C runs at about 27 c/l, whereas with the multiplies off the
     dependent chain about 3 c/l ought to be possible.
<li> Itanium: <code>mpn_hamdist</code> could be put together based on the
     current <code>mpn_popcount</code>.
<li> Itanium: <code>popc_limb</code> in gmp-impl.h could use the
     <code>popcnt</code> insn.
<li> Itanium: <code>mpn_submul_1</code> is not implemented directly, only via
     a combination of <code>mpn_mul_1</code> and <code>mpn_sub_n</code>.
<li> UltraSPARC/64: Optimize <code>mpn_mul_1</code>, <code>mpn_addmul_1</code>,
     for s2 &lt; 2^32 (or perhaps for any zero 16-bit s2 chunk).  Not sure how
     much this can improve the speed, though, since the symmetry that we rely
     on is lost.  Perhaps we can just gain cycles when s2 &lt; 2^16, or more
     accurately, when two 16-bit s2 chunks which are 16 bits apart are zero.
<li> UltraSPARC/64: Write native <code>mpn_submul_1</code>, analogous to
     <code>mpn_addmul_1</code>.
<li> UltraSPARC/64: Write <code>umul_ppmm</code>.  Using four
     "<code>mulx</code>"s either with an asm block or via the generic C code is
     about 90 cycles.  Try using fp operations, and also try using karatsuba
     for just three "<code>mulx</code>"s.
<li> UltraSPARC/64: <code>mpn_divrem_1</code>, <code>mpn_mod_1</code>,
     <code>mpn_divexact_1</code> and <code>mpn_modexact_1_odd</code> could
     process 32 bits at a time when the divisor fits 32-bits.  This will need
     only 4 <code>mulx</code>'s per limb instead of 8 in the general case.
<li> UltraSPARC/32: Rewrite <code>mpn_lshift</code>, <code>mpn_rshift</code>.
     Will give 2 cycles/limb.  Trivial modifications of mpn/sparc64 should do.
<li> UltraSPARC/32: Write special mpn_Xmul_1 loops for s2 &lt; 2^16.
<li> UltraSPARC/32: Use <code>mulx</code> for <code>umul_ppmm</code> if
     possible (see commented out code in longlong.h).  This is unlikely to
     save more than a couple of cycles, so perhaps isn't worth bothering with.
<li> UltraSPARC/32: On Solaris gcc doesn't give us <code>__sparc_v9__</code>
     or anything to indicate V9 support when -mcpu=v9 is selected.  See
     gcc/config/sol2-sld-64.h.  Will need to pass something through from
     ./configure to select the right code in longlong.h.  (Currently nothing
     is lost because <code>mulx</code> for multiplying is commented out.)
<li> UltraSPARC: <code>modlimb_invert</code> might save a few cycles from
     masking down to just the useful bits at each point in the calculation,
     since <code>mulx</code> speed depends on the highest bit set.  Either
     explicit masks or small types like <code>short</code> and
     <code>int</code> ought to work.
<li> Sparc64 HAL R1: <code>mpn_popcount</code> and <code>mpn_hamdist</code>
     could use <code>popc</code> currently commented out in gmp-impl.h.  This
     chip reputedly implements <code>popc</code> properly (see gcc sparc.md),
     would need to recognise the chip as <code>sparchalr1</code> or something
     in configure / config.sub / config.guess.
<li> PA64: Improve <code>mpn_addmul_1</code>, <code>mpn_submul_1</code>, and
     <code>mpn_mul_1</code>.  The current code runs at 11 cycles/limb.  It
     should be possible to saturate the cache, which will happen at 8
     cycles/limb (7.5 for mpn_mul_1).  Write special loops for s2 &lt; 2^32;
     it should be possible to make them run at about 5 cycles/limb.
<li> PPC630: Rewrite <code>mpn_addmul_1</code>, <code>mpn_submul_1</code>, and
     <code>mpn_mul_1</code>.  Use both integer and floating-point operations,
     possibly two floating-point and one integer limb per loop.  Split operands
     into four 16-bit chunks for fast fp operations.  Should easily reach 9
     cycles/limb (using one int + one fp), but perhaps even 7 cycles/limb
     (using one int + two fp).
<li> PPC630: <code>mpn_rshift</code> could do the same sort of unrolled loop
     as <code>mpn_lshift</code>.  Some judicious use of m4 might let the two
     share source code, or with a register to control the loop direction
     perhaps even share object code.
<li> PowerPC-32: <code>mpn_rshift</code> should do the same sort of unrolled
     loop as <code>mpn_lshift</code>.
<li> Implement <code>mpn_mul_basecase</code> and <code>mpn_sqr_basecase</code>
     for important machines.  Helping the generic sqr_basecase.c with an
     <code>mpn_sqr_diagonal</code> might be enough for some of the RISCs.
<li> POWER2/POWER2SC: Schedule <code>mpn_lshift</code>/<code>mpn_rshift</code>.
     Will bring time from 1.75 to 1.25 cycles/limb.
<li> X86: Optimize non-MMX <code>mpn_lshift</code> for shifts by 1.  (See
     Pentium code.)
<li> X86: Good authority has it that in the past an inline <code>rep
     movs</code> would upset GCC register allocation for the whole function.
     Is this still true in GCC 3?  It uses <code>rep movs</code> itself for
     <code>__builtin_memcpy</code>.  Examine the code for some simple and
     complex functions to find out.  Inlining <code>rep movs</code> would be
     desirable, it'd be both smaller and faster.
<li> Pentium P54: <code>mpn_lshift</code> and <code>mpn_rshift</code> can come
     down from 6.0 c/l to 5.5 or 5.375 by paying attention to pairing after
     <code>shrdl</code> and <code>shldl</code>, see mpn/x86/pentium/README.
<li> Pentium P55 MMX: <code>mpn_divexact_1</code> and
     <code>mpn_modexact_1_odd</code> on 16-bit divisors could use MMX
     multiplies and run at around 16 cycles (down from 23).
<li> Pentium P55 MMX: <code>mpn_lshift</code> and <code>mpn_rshift</code>
     might benefit from some destination prefetching.
<li> PentiumPro: <code>mpn_divrem_1</code> might be able to use a
     mul-by-inverse, hoping for maybe 30 c/l.
<li> P6: <code>mpn_add_n</code> and <code>mpn_sub_n</code> should be able to go
     faster than the generic x86 code at 3.5 c/l.  The athlon code for instance
     runs at about 2.7.
<li> K7: <code>mpn_lshift</code> and <code>mpn_rshift</code> might be able to
     do something branch-free for unaligned startups, and shaving one insn
     from the loop with alternative indexing might save a cycle.
<li> PPC32: Try using fewer registers in the current <code>mpn_lshift</code>.
     The pipeline is now extremely deep, perhaps unnecessarily deep.
<li> PPC32: Write <code>mpn_rshift</code> based on new <code>mpn_lshift</code>.
<li> PPC32: Rewrite <code>mpn_add_n</code> and <code>mpn_sub_n</code>.  Should
     run at just 3.25 cycles/limb.
<li> Fujitsu VPP: Vectorize main functions, perhaps in assembly language.
<li> Fujitsu VPP: Write <code>mpn_mul_basecase</code> and
     <code>mpn_sqr_basecase</code>.  This should use a "vertical multiplication
     method", to avoid carry propagation.  splitting one of the operands in
     11-bit chunks.
<li> 68k, Pentium: <code>mpn_lshift</code> by 31 should use the special rshift
     by 1 code, and vice versa <code>mpn_rshift</code> by 31 should use the
     special lshift by 1.  This would be best as a jump across to the other
     routine, could let both live in lshift.asm and omit rshift.asm on finding
     <code>mpn_rshift</code> already provided.
<li> Cray T3E: Experiment with optimization options.  In particular,
     -hpipeline3 seems promising.  We should at least up -O to -O2 or -O3.
<li> Cray: <code>mpn_com_n</code> and <code>mpn_and_n</code> etc very probably
     wants a pragma like <code>MPN_COPY_INCR</code>.
<li> Cray vector systems: <code>mpn_lshift</code>, <code>mpn_rshift</code>,
     <code>mpn_popcount</code> and <code>mpn_hamdist</code> are nice and small
     and could be inlined to avoid function calls.
<li> Cray: Variable length arrays seem to be faster than the tal-notreent.c
     scheme.  Not sure why, maybe they merely give the compiler more
     information about aliasing (or the lack thereof).  Would like to modify
     <code>TMP_ALLOC</code> to use them, or introduce a new scheme.  Memory
     blocks wanted unconditionally are easy enough, those wanted only
     sometimes are a problem.  Perhaps a special size calculation to ask for a
     dummy length 1 when unwanted, or perhaps an inlined subroutine
     duplicating code under each conditional.  Don't really want to turn
     everything into a dog's dinner just because Cray don't offer an
     <code>alloca</code>.
<li> Cray: <code>mpn_get_str</code> on power-of-2 bases ought to vectorize.
     Does it?  <code>bits_per_digit</code> and the inner loop over bits in a
     limb might prevent it.  Perhaps special cases for binary, octal and hex
     would be worthwhile (very possibly for all processors too).
<li> Cray: <code>popc_limb</code> could use the Cray <code>_popc</code>
     intrinsic.  That would help <code>mpz_hamdist</code> and might make the
     generic C versions of <code>mpn_popcount</code> and
     <code>mpn_hamdist</code> suffice for Cray (if it vectorizes, or can be
     given a hint to do so).
<li> 68000: <code>mpn_mul_1</code> could check for a 16-bit multiplier and use
     two multiplies per limb, not four.  Ditto <code>mpn_addmul_1</code> and
     <code>mpn_submul_1</code>.
<li> 68000: <code>mpn_lshift</code> and <code>mpn_rshift</code> could use a
     <code>roll</code> and mask instead of <code>lsrl</code> and
     <code>lsll</code>.  This promises to be a speedup, effectively trading a
     6+2*n shift for one or two 4 cycle masks.  Suggested by Jean-Charles
     Meyrignac.
<li> Improve <code>count_leading_zeros</code> for 64-bit machines:
  <pre>
	   if ((x &gt&gt 32) == 0) { x &lt&lt= 32; cnt += 32; }
	   if ((x &gt&gt 48) == 0) { x &lt&lt= 16; cnt += 16; }
	   ... </pre>
<li> IRIX 6 MIPSpro compiler has an <code>__inline</code> which could perhaps
     be used in <code>__GMP_EXTERN_INLINE</code>.  What would be the right way
     to identify suitable versions of that compiler?
<li> VAX D and G format <code>double</code> floats are straightforward and
     could perhaps be handled directly in <code>__gmp_extract_double</code>
     and maybe in <code>mpz_get_d</code>, rather than falling back on the
     generic code.  GCC defines <code>__GFLOAT</code> when -mg has selected G
     format (which would be possible via a user <code>CFLAGS</code>).
<li> <code>mpn_get_str</code> final divisions by the base with
     <code>udiv_qrnd_unnorm</code> could use some sort of multiply-by-inverse
     on suitable machines.  This ends up happening for decimal by presenting
     the compiler with a run-time constant, but the same for other bases would
     be good.  Perhaps use could be made of the fact base&lt;256.
</ul>

<h4>New Functionality</h4>
<ul>
<li> Add in-memory versions of <code>mp?_out_raw</code> and
     <code>mp?_inp_raw</code>.
<li> <code>mpz_get_nth_ui</code>.  Return the nth word (not necessarily the
     nth limb).
<li> Maybe add <code>mpz_crr</code> (Chinese Remainder Reconstruction).
<li> Let `0b' and `0B' mean binary input everywhere.
<li> <code>mpz_init</code> and <code>mpq_init</code> could do lazy allocation.
     Set <code>ALLOC(var)</code> to 0 to indicate nothing allocated, and let
     <code>_mpz_realloc</code> do the initial alloc.  Set
     <code>z-&gt;_mp_d</code> to a dummy that <code>mpz_get_ui</code> and
     similar can unconditionally fetch from.  Niels Möller has had a go at
     this.
     <br>
     The advantages of the lazy scheme would be:
     <ul>
     <li> Initial allocate would be the size required for the first value
          stored, rather than getting 1 limb in <code>mpz_init</code> and then
          more or less immediately reallocating.
     <li> <code>mpz_init</code> would only store magic values in the
          <code>mpz_t</code> fields, and could be inlined.
     <li> A fixed initializer could even be used by applications, like
          <code>mpz_t z = MPZ_INITIALIZER;</code>, which might be convenient
          for globals.
     </ul>
     The advantages of the current scheme are:
     <ul>
     <li> <code>mpz_set_ui</code> and other similar routines needn't check the
          size allocated and can just store unconditionally.
     <li> <code>mpz_set_ui</code> and perhaps others like
          <code>mpz_tdiv_r_ui</code> and a prospective
          <code>mpz_set_ull</code> could be inlined.
     </ul>
<li> Add <code>mpf_out_raw</code> and <code>mpf_inp_raw</code>.  Make sure
     format is portable between 32-bit and 64-bit machines, and between
     little-endian and big-endian machines.
<li> <code>mpn_and_n</code> ... <code>mpn_copyd</code>: Perhaps make the mpn
     logops and copys available in gmp.h, either as library functions or
     inlines, with the availability of library functions instantiated in the
     generated gmp.h at build time.
<li> <code>mpz_set_str</code> etc variants taking string lengths rather than
     null-terminators.
<li> Consider changing the thresholds to apply the simpler algorithm when
     "<code>&lt;=</code>" rather than "<code>&lt;</code>", so a threshold can
     be set to <code>MP_SIZE_T_MAX</code> to get only the simpler code (the
     compiler will know <code>size &lt;= MP_SIZE_T_MAX</code> is always true).
     Alternately it looks like the <code>ABOVE_THRESHOLD</code> and
     <code>BELOW_THRESHOLD</code> macros can do this adequately, and also pick
     up cases where a threshold of zero should mean only the second algorithm.
<li> <code>mpz_nthprime</code>.
<li> Perhaps <code>mpz_init2</code>, initializing and making initial room for
     N bits.  The actual size would be rounded up to a limb, and perhaps an
     extra limb added since so many <code>mpz</code> routines need that on
     their destination.
<li> <code>mpz_andn</code>, <code>mpz_iorn</code>, <code>mpz_nand</code>,
     <code>mpz_nior</code>, <code>mpz_xnor</code> might be useful additions,
     if they could share code with the current such functions (which should be
     possible).
<li> <code>mpz_and_ui</code> etc might be of use sometimes.  Suggested by
     Niels Möller.
<li> <code>mpf_set_str</code> and <code>mpf_inp_str</code> could usefully
     accept 0x, 0b etc when base==0.  Perhaps the exponent could default to
     decimal in this case, with a further 0x, 0b etc allowed there.
     Eg. 0xFFAA@0x5A.  A leading "0" for octal would match the integers, but
     probably something like "0.123" ought not mean octal.
<li> <code>GMP_LONG_LONG_LIMB</code> or some such could become a documented
     feature of gmp.h, so applications could know whether to
     <code>printf</code> a limb using <code>%lu</code> or <code>%Lu</code>.
<li> <code>PRIdMP_LIMB</code> and similar defines following C99
     &lt;inttypes.h&gt; might be of use to applications printing limbs.
     Perhaps they should be defined only if specifically requested, the way
     &lt;inttypes.h&gt; does.  But if <code>GMP_LONG_LONG_LIMB</code> or
     whatever is added then perhaps this can easily enough be left to
     applications.
<li> <code>mpf_get_ld</code> and <code>mpf_set_ld</code> converting
     <code>mpf_t</code> to and from <code>long double</code>.  Other
     <code>long double</code> routines would be desirable too, but these would
     be a start.  Often <code>long double</code> is the same as
     <code>double</code>, which is easy but pretty pointless.  Should
     recognise the Intel 80-bit format on i386, and IEEE 128-bit quad on
     sparc, hppa and power.  Might like an ABI sub-option or something when
     it's a compiler option for 64-bit or 128-bit <code>long double</code>.
<li> <code>gmp_printf</code> could usefully accept an arbitrary base, for both
     integer and float conversions.  Either a number in the format string or a
     <code>*</code> to take a parameter should be allowed.  Maybe
     <code>&amp;13b</code> (b for base) or something like that.
<li> <code>gmp_printf</code> could perhaps have a type code for an
     <code>mp_limb_t</code>.  That would save an application from having to
     worry whether it's a long or a long long.
<li> <code>gmp_printf</code> could perhaps accept <code>mpq_t</code> for float
     conversions, eg. <code>"%.4Qf"</code>.  This would be merely for
     convenience, but still might be useful.  Rounding would be the same as
     for an <code>mpf_t</code> (ie. currently round-to-nearest, but not
     actually documented).  Alternately, perhaps a separate
     <code>mpq_get_str_point</code> or some such might be more use.  Suggested
     by Pedro Gimeno.
<li> <code>gmp_printf</code> could usefully accept a flag to control the
     rounding of float conversions.  The wouldn't do much for
     <code>mpf_t</code>, but would be good if <code>mpfr_t</code> was
     supported in the future, or perhaps for <code>mpq_t</code>.  Something
     like <code>&amp;*r</code> (r for rounding, and mpfr style
     <code>GMP_RND</code> parameter).
<li> <code>mpz_togglebit</code> or <code>mpz_chgbit</code> or some such might
     be a good companion to <code>mpz_setbit</code> and
     <code>mpz_clrbit</code>.  Suggested by Niels Möller.
<li> <code>mpz_scan0_reverse</code> or <code>mpz_scan0low</code> or some such
     searching towards the low end of an integer might match
     <code>mpz_scan0</code> nicely.  Likewise for <code>scan1</code>.
     Suggested by Roberto Bagnara.
<li> <code>mpz_bit_subset</code> or some such to test whether one integer is a
     bitwise subset of another might be of use.  Some sort of return value
     indicating whether it's a proper or non-proper subset would be good and
     wouldn't cost anything in the implementation.  Suggested by Roberto
     Bagnara.
<li> <code>gmp_randinit_r</code> and maybe <code>gmp_randstate_set</code> to
     init-and-copy or to just copy a <code>gmp_randstate_t</code>.  Suggested
     by Pedro Gimeno.
<li> <code>mpf_get_ld</code>, <code>mpf_set_ld</code>: Conversions between
     <code>mpf_t</code> and <code>long double</code>, suggested by Dan
     Christensen.  There'd be some work to be done by <code>configure</code>
     to recognise the format in use.  xlc on aix for instance apparently has
     an option for either plain double 64-bit or quad 128-bit precision.  This
     might mean library contents vary with the compiler used to build, which
     is undesirable.  It might be possible to detect the mode the application
     is compiling with, and try to avoid mismatch problems.
<li> <code>mpz_sqrt_if_perfect_square</code>: When
     <code>mpz_perfect_square_p</code> does its tests it calculates a square
     root and then discards it.  For some applications it might be useful to
     return that root.  Suggested by Jason Moxham.
<li> <code>mpz_get_ull</code>, <code>mpz_set_ull</code>,
     <code>mpz_get_sll</code>, <code>mpz_get_sll</code>: Conversions for
     <code>long long</code>.  These would aid interoperability, though a
     mixture of GMP and <code>long long</code> would probably not be too
     common.  Disadvantages of using <code>long long</code> in libgmp.a would
     be
     <ul>
     <li> Library contents vary according to the build compiler.
     <li> gmp.h would need an ugly <code>#ifdef</code> block to decide if the
          application compiler could take the <code>long long</code>
          prototypes.
     <li> Some sort of <code>LIBGMP_HAS_LONGLONG</code> would be wanted to
          indicate whether the functions are available.  (Applications using
          autoconf could probe the library too.)
     </ul>
     It'd be possible to defer the need for <code>long long</code> to
     application compile time, by having something like
     <code>mpz_set_2ui</code> called with two halves of a <code>long
     long</code>.  Disadvantages of this would be,
     <ul>
     <li> Bigger code in the application, though perhaps not if a <code>long
          long</code> is normally passed as two halves anyway.
     <li> <code>mpz_get_ull</code> would be a rather big inline, or would have
          to be two function calls.
     <li> <code>mpz_get_sll</code> would be a worse inline, and would put the
          treatment of <code>-0x10..00</code> into applications (see
          <code>mpz_get_si</code> correctness above).
     <li> Although having libgmp.a independent of the build compiler is nice,
          it sort of sacrifices the capabilities of a good compiler to
          uniformity with inferior ones.
     </ul>
     Plain use of <code>long long</code> is probably the lesser evil, if only
     because it makes best use of gcc.
<li> <code>mpz_strtoz</code> parsing the same as <code>strtol</code>.
     Suggested by Alexander Kruppa.
</ul>


<h4>Configuration</h4>

<ul>
<li> Floating-point format: Determine this with a feature test.  Get rid of
     the <code>#ifdef</code> mess in gmp-impl.h.  This is simple when doing a
     native compile, but needs a reliable way to examine object files when
     cross-compiling.  Falling back on a run-time test would be reasonable, if
     build time tests fail.
<li> a29k: umul.s and udiv.s exist but don't get used.
<li> ARM: <code>umul_ppmm</code> in longlong.h always uses <code>umull</code>,
     but is that available only for M series chips or some such?  Perhaps it
     should be configured in some way.
<li> HPPA: config.guess should recognize 7000, 7100, 7200, and 8x00.
<li> HPPA 2.0w: gcc is rumoured to support 2.0w as of version 3, though
     perhaps just as a build-time choice.  In any case, figure out how to
     identify a suitable gcc or put it in the right mode, for the GMP compiler
     choices.
<li> IA64: Latest libtool has some nonsense to detect ELF-32 or ELF-64 on
     <code>ia64-*-hpux*</code>.  Does GMP need to know anything about that?
<li> Mips: config.guess should say mipsr3000, mipsr4000, mipsr10000, etc.
     "hinv -c processor" gives lots of information on Irix.  Standard
     config.guess appends "el" to indicate endianness.  GMP currently only
     cares about that for a small <code>mpz_inp_raw</code> and
     <code>mpz_out_raw</code> optimization.  It's hoped
     <code>AC_C_BIGENDIAN</code> can be relied on to interrogate the compiler.
<li> PowerPC-32: gmp-mparam.h comes out quite different for a 750 than a 604e,
     it'd be good to select the right one, probably by having CPU types
     powerpc604, powerpc750 etc.
<li> PowerPC: The crazy explicit TOC setups for AIX are currently driven by
     <code>*-*-aix*</code>.  It might be more reliable to do some sort of
     feature test or to examine the compiler output.  It might also be nice to
     merge the aix.m4 files into powerpc-defs.m4.
<li> Sparc: config.guess should say supersparc, microsparc, ultrasparc1,
     ultrasparc2, etc.  "prtconf -vp" gives lots of information about a
     Solaris system.
<li> Sparc: recognise sparclite and sparclet (which configfsf.sub accepts).
     These have <code>umul</code> but not <code>udiv</code>, or something like
     that.  Check the mpn/sparc32/v8 code is suitable, and add -mcpu= options
     for gcc.
<li> Sparc32: floating point or integer <code>udiv</code> should be selected
     according to the CPU target.  Currently floating point ends up being
     used on all sparcs, which is probably not right for generic V7 and V8.
<br> Sparc: The use of <code>-xtarget=native</code> with <code>cc</code> is
     incorrect when cross-compiling, the target should be set according to the
     configured <code>$host</code> CPU.
<li> m68k: config.guess can detect 68000, 68010, CPU32 and 68020, but relies
     on system information for 030, 040 and 060.  Can they be identified by
     running some code?
<li> m68k: gas 2.11.90.0.1 pads with zero bytes in text segments, which is not
     valid code.  Probably need <code>.balignw &lt;n&gt;,0x4e7f</code> to get
     nops, if <code>ALIGN</code> is going to be used for anything that's
     executed across.
<li> Some CPUs have <code>umul</code> and <code>udiv</code> code not being
     used.  Check all such for bit rot and then put umul and udiv in
     <code>$gmp_mpn_functions_optional</code> as "standard optional" objects.
     <br> In particular Sparc and SparcV8 on non-gcc should benefit from
     umul.asm enabled; the generic umul is suspected to be making sqr_basecase
     slower than mul_basecase.
<li> HPPA <code>mpn_umul_ppmm</code> and <code>mpn_udiv_qrnnd</code> have a
     different parameter order than those functions on other CPUs.  It might
     avoid confusion to have them under different names, maybe
     <code>mpn_umul_ppmm_r</code> or some such.  Prototypes then wouldn't
     be conditionalized, and the appropriate form could be selected with the
     <code>HAVE_NATIVE</code> scheme if/when the code switches to use a
     <code>PROLOGUE</code> style.
<li> <code>DItype</code>: The setup in gmp-impl.h for non-GCC could use an
     autoconf test to determine whether <code>long long</code> is available.
<li> m88k: Make the assembler code work on non-underscore systems.  Conversion
     to .asm would be desirable.  Ought to be easy, but would want to be
     tested.
<li> z8k: The use of a 32-bit limb in mpn/z8000x as opposed to 16-bits in
     mpn/z8000 could be an ABI choice.  But this chip is obsolete and nothing
     is likely to be done unless someone is actively using it.
<li> config.m4 is generated only by the configure script, it won't be
     regenerated by config.status.  Creating it as an <code>AC_OUTPUT</code>
     would work, but it might upset "make" to have things like <code>L$</code>
     get into the Makefiles through <code>AC_SUBST</code>.
     <code>AC_CONFIG_COMMANDS</code> would be the alternative.  With some
     careful m4 quoting the <code>changequote</code> calls might not be
     needed, which might free up the order in which things had to be output.
<li> <code>make distclean</code>: Only the mpn directory links which were
     created are removed, but perhaps all possible links should be removed, in
     case someone runs configure a second time without a
     <code>distclean</code> in between.  The only tricky part would be making
     sure all possible <code>extra_functions</code> are covered.
<li> MinGW: Apparently a Cygwin version of gcc can be used by passing
     <code>-mno-cygwin</code>.  For <code>--host=*-*-mingw32*</code> it might
     be convenient to automatically use that option, if it works.  Needs
     someone with a dual cygwin/mingw setup to test.
<li> Automake: Latest automake has a <code>CCAS</code>, <code>CCASFLAGS</code>
     scheme.  Though we probably wouldn't be using its assembler support we
     could do use those variables in compatible ways.
</ul>


<h4>Random Numbers</h4>
<ul>
<li> <code>_gmp_rand</code> is not particularly fast on the linear
     congruential algorithm and could stand various improvements.
     <ul>
     <li> Make a second seed area within <code>gmp_randstate_t</code> (or
          <code>_mp_algdata</code> rather) to save some copying.
     <li> Make a special case for a single limb <code>2exp</code> modulus, to
          avoid <code>mpn_mul</code> calls.  Perhaps the same for two limbs.
     <li> Inline the <code>lc</code> code, to avoid a function call and
          <code>TMP_ALLOC</code> for every chunk.
     <li> The special case for <code>seedn==0</code> will be very rarely used,
          and on that basis seems unnecessary.
     <li> Perhaps the <code>2exp</code> and general LC cases should be split,
          for clarity (if the general case is retained).
     </ul>
<li> <code>gmp_randinit_mm</code> (named after Mitchell and Moore) for the
     55-element delayed Fibonacci generator from Knuth vol 2.  Being additive
     it should be fast, and might be random enough for GMP test program
     purposes, if nothing else.  Niels Möller has started on this.
<li> <code>gmp_randinit_lc</code>: Finish or remove.  Doing a division for
     every every step won't be very fast, so check whether the usefulness of
     this algorithm can be justified.
<li> Blum-Blum-Shub: Finish or remove.  A separate
     <code>gmp_randinit_bbs</code> would be wanted, not the currently
     commented out case in <code>gmp_randinit</code>.
<li> <code>_gmp_rand</code> could be done as a function pointer within
     <code>gmp_randstate_t</code> (or rather in the <code>_mp_algdata</code>
     part), instead of switching on a <code>gmp_randalg_t</code>.  Likewise
     <code>gmp_randclear</code>, and perhaps <code>gmp_randseed</code> if it
     became algorithm-specific.  This would be more modular, and would ensure
     only code for the desired algorithms is dragged into the link.
<li> <code>mpz_urandomm</code> should do something for n&lt;=0, but what?
<li> <code>mpz_urandomm</code> implementation looks like it could be improved.
     Perhaps it's enough to calculate <code>nbits</code> as ceil(log2(n)) and
     call <code>_gmp_rand</code> until a value <code>&lt;n</code> is obtained.
<li> <code>gmp_randstate_t</code> used for parameters perhaps should become
     <code>gmp_randstate_ptr</code> the same as other types.
<li> Some of the empirical randomness tests could be included in a "make
     check".  They ought to work everywhere, for a given seed at least.
</ul>


<h4>Miscellaneous</h4>
<ul>
<li> Make <code>mpz_div</code> and <code>mpz_divmod</code> use rounding
     analogous to <code>mpz_mod</code>.  Document, and list as an
     incompatibility.
<li> <code>mpz_gcdext</code> and <code>mpn_gcdext</code> ought to document
     what range of values the generated cofactors can take, and preferably
     ensure the definition uniquely specifies the cofactors for given inputs.
     A basic extended Euclidean algorithm or multi-step variant leads to
     |x|&lt;|b| and |y|&lt;|a| or something like that, but there's probably
     two solutions under just those restrictions.
<li> <code>mpz_invert</code> should call <code>mpn_gcdext</code> directly.
<li> demos/factorize.c should use the GMP random functions when restarting
     Pollard-Rho, not <code>random</code> / <code>mrand48</code>.
<li> demos/factorize.c: use <code>mpz_divisible_ui_p</code> rather than
     <code>mpz_tdiv_qr_ui</code>.  (Of course dividing multiple primes at a
     time would be better still.)
<li> The various test programs use quite a bit of the main
     <code>libgmp</code>.  This establishes good cross-checks, but it might be
     better to use simple reference routines where possible.  Where it's not
     possible some attention could be paid to the order of the tests, so a
     <code>libgmp</code> routine is only used for tests once it seems to be
     good.
<li> <code>mpf_set_q</code> is very similar to <code>mpf_div</code>, it'd be
     good for the two to share code.  Perhaps <code>mpf_set_q</code> should
     make some <code>mpf_t</code> aliases for its numerator and denominator
     and just call <code>mpf_div</code>.  Both would be simplified a good deal
     by switching to <code>mpn_tdiv_qr</code> perhaps making them small enough
     not to bother with sharing (especially since <code>mpf_set_q</code>
     wouldn't need to watch out for overlaps).
<li> PowerPC: The cpu time base registers (per <code>mftb</code> and
     <code>mftbu</code>) could be used for the speed and tune programs.  Would
     need to know its frequency though, for instance it seemed to be 25 MHz on
     a couple of Apples (compared to the CPU speed of 350 or 450 MHz).
<li> <code>MUL_FFT_THRESHOLD</code> etc: the FFT thresholds should allow a
     return to a previous k at certain sizes.  This arises basically due to
     the step effect caused by size multiples effectively used for each k.
     Looking at a graph makes it fairly clear.
<li> <code>__gmp_doprnt_mpf</code> does a rather unattractive round-to-nearest
     on the string returned by <code>mpf_get_str</code>.  Perhaps some variant
     of <code>mpf_get_str</code> could be made which would better suit.
</ul>


<h4>Aids to Development</h4>
<ul>
<li> Add <code>ASSERT</code>s at the start of each user-visible mpz/mpq/mpf
     function to check the validity of each <code>mp?_t</code> parameter, in
     particular to check they've been <code>mp?_init</code>ed.  This might
     catch elementary mistakes in user programs.  Care would need to be taken
     over <code>MPZ_TMP_INIT</code>ed variables used internally.  If nothing
     else then consistency checks like size&lt;=alloc, ptr not
     <code>NULL</code> and ptr+size not wrapping around the address space,
     would be possible.  A more sophisticated scheme could track
     <code>_mp_d</code> pointers and ensure only a valid one is used.  Such a
     scheme probably wouldn't be reentrant, not without some help from the
     system.
<li> tune/time.c could try to determine at runtime whether
     <code>getrusage</code> and <code>gettimeofday</code> are reliable.
     Currently we pretend in configure that the dodgy m68k netbsd 1.4.1
     <code>getrusage</code> doesn't exist.  If a test might take a long time
     to run then perhaps cache the result in a file somewhere.
</ul>


<h4>Documentation</h4>
<ul>
<li> Document conventions, like that <code> unsigned long int</code> is used for
     bit counts/ranges, and that <code>mp_size_t</code> is used for limb counts.
<li> <code>mpz_inp_str</code> (etc) doesn't say when it stops reading digits.
</ul>


<h4>Bright Ideas</h4>

The following may or may not be feasible, and aren't likely to get done in the
near future, but are at least worth thinking about.

<ul>
<li> Reorganize longlong.h so that we can inline the operations even for the
     system compiler.  When there is no such compiler feature, make calls to
     stub functions.  Write such stub functions for as many machines as
     possible.
<li> longlong.h could declare when it's using, or would like to use,
     <code>mpn_umul_ppmm</code>, and the corresponding umul.asm file could be
     included in libgmp only in that case, the same as is effectively done for
     <code>__clz_tab</code>.  Likewise udiv.asm and perhaps cntlz.asm.  This
     would only be a very small space saving, so perhaps not worth the
     complexity.
<li> longlong.h could be built at configure time by concatenating or
     #including fragments from each directory in the mpn path.  This would
     select CPU specific macros the same way as CPU specific assembler code.
     Code used would no longer depend on cpp predefines, and the current
     nested conditionals could be flattened out.
<li> <code>mpz_get_si</code> returns 0x80000000 for -0x100000000, whereas it's
     sort of supposed to return the low 31 (or 63) bits.  But this is
     undocumented, and perhaps not too important.
<li> <code>mpz_*_ui</code> division routines currently return abs(a%b).
     Perhaps make them return the real remainder instead?  Return type would
     be <code>signed long int</code>.  But this would be an incompatible
     change, so it might have to be under newly named functions.
<li> <code>mpz_init_set*</code> and <code>mpz_realloc</code> could allocate
     say an extra 16 limbs over what's needed, so as to reduce the chance of
     having to do a reallocate if the <code>mpz_t</code> grows a bit more.
     This could only be an option, since it'd badly bloat memory usage in
     applications using many small values.
<li> <code>mpq</code> functions could perhaps check for numerator or
     denominator equal to 1, on the assumption that integers or
     denominator-only values might be expected to occur reasonably often.
<li> <code>count_trailing_zeros</code> is used on more or less uniformly
     distributed numbers in a couple of places.  For some CPUs
     <code>count_trailing_zeros</code> is slow and it's probably worth handling
     the frequently occurring 0 to 2 trailing zeros cases specially.
<li> <code>mpf_t</code> might like to let the exponent be undefined when
     size==0, instead of requiring it 0 as now.  It should be possible to do
     size==0 tests before paying attention to the exponent.  The advantage is
     not needing to set exp in the various places a zero result can arise,
     which avoids some tedium but is otherwise perhaps not too important.
     Currently <code>mpz_set_f</code> and <code>mpf_cmp_ui</code> depend on
     exp==0, maybe elsewhere too.
<li> <code>__gmp_allocate_func</code>: Could use GCC <code>__attribute__
     ((malloc))</code> on this, though don't know if it'd do much.  GCC 3.0
     allows that attribute on functions, but not function pointers (see info
     node "Attribute Syntax"), so would need a new autoconf test.  This can
     wait until there's a GCC that supports it.
<li> <code>mpz_add_ui</code> contains two <code>__GMPN_COPY</code>s, one from
     <code>mpn_add_1</code> and one from <code>mpn_sub_1</code>.  If those two
     routines were opened up a bit maybe that code could be shared.  When a
     copy needs to be done there's no carry to append for the add, and if the
     copy is non-empty no high zero for the sub. <br> An alternative would be
     to do a copy at the start and then an in-place add or sub.  Obviously
     that duplicates the fetches and stores for carry propagation, but that's
     normally only one or two limbs.  The same applies to <code>mpz_add</code>
     when one operand is longer than the other, and to <code>mpz_com</code>
     since it's just -(x+1).
<li> <code>restrict</code>'ed pointers: Does the C99 definition of restrict
     (one writer many readers, or whatever it is) suit the GMP style "same or
     separate" function parameters?  If so, judicious use might improve the
     code generated a bit.  Do any compilers have their own flavour of
     restrict as "completely unaliased", and is that still usable?
<li> 68000: A 16-bit limb might suit 68000 better than 32-bits, since the
     native multiply is only 16x16.  Could have this as an <code>ABI</code>
     option, selecting <code>_SHORT_LIMB</code> in gmp.h.  Naturally a new set
     of asm subroutines would be necessary.  Would need new
     <code>mpz_set_ui</code> etc since the current code assumes limb&gt;=long,
     but 2-limb operand forms would find a use for <code>long long</code> on
     other processors too.
<li> Nx1 remainders can be taken at multiplier throughput speed by
     pre-calculating an array "p[i] = 2^(i*<code>BITS_PER_MP_LIMB</code>) mod
     m", then for the input limbs x calculating an inner product "sum
     p[i]*x[i]", and a final 3x1 limb remainder mod m.  If those powers take
     roughly N divide steps to calculate then there'd be an advantage any time
     the same m is used three or more times.  Suggested by Victor Shoup in
     connection with chinese-remainder style decompositions, but perhaps with
     other uses.
</ul>
<hr>

</body>
</html>

<!--
Local variables:
eval: (add-hook 'write-file-hooks 'time-stamp)
time-stamp-start: "This file current as of "
time-stamp-format: "%:d %3b %:y"
time-stamp-end: "\\."
time-stamp-line-limit: 50
End:
-->