1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
|
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<html>
<head>
<title>
GMP Development Projects
</title>
</head>
<body bgcolor=pink>
<center>
<h1>
GMP Development Projects
</h1>
</center>
<font size=-1>
Copyright 2000, 2001, 2002, 2003 Free Software Foundation, Inc. <br><br>
This file is part of the GNU MP Library. <br><br>
The GNU MP Library is free software; you can redistribute it and/or modify
it under the terms of the GNU Lesser General Public License as published
by the Free Software Foundation; either version 2.1 of the License, or (at
your option) any later version. <br><br>
The GNU MP Library is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public
License for more details. <br><br>
You should have received a copy of the GNU Lesser General Public License
along with the GNU MP Library; see the file COPYING.LIB. If not, write to
the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston,
MA 02111-1307, USA.
</font>
<hr>
<!-- NB. timestamp updated automatically by emacs -->
<comment>
This file current as of 16 Sep 2003. An up-to-date version is available at
<a href="http://swox.com/gmp/projects.html">http://swox.com/gmp/projects.html</a>.
Please send comments about this page to
<a href="mailto:gmp-devel@swox.com">gmp-devel@swox.com</a>.
</comment>
<p> This file lists projects suitable for volunteers. Please see the
<a href="tasks.html">tasks file</a> for smaller tasks.
<p> If you want to work on any of the projects below, please let <a
href="mailto:gmp-devel@swox.com">gmp-devel@swox.com</a> know. If you want
to help with a project that already somebody else is working on, you will
get in touch through gmp-devel@swox.com. (There are no email addresses of
volunteers below, due to spamming problems.)
<ul>
<li> <strong>Faster multiplication</strong>
<p> The current multiplication code uses Karatsuba, 3-way Toom, and Fermat
FFT. Several new developments are desirable:
<ol>
<li> Handle multiplication of operands with different digit count better
than today. We now split the operands in a very inefficient way, see
mpn/generic/mul.c.
<li> Implement an FFT variant computing the coefficients mod m different
limb size primes of the form l*2^k+1. i.e., compute m separate FFTs.
The wanted coefficients will at the end be found by lifting with CRT
(Chinese Remainder Theorem). If we let m = 3, i.e., use 3 primes, we
can split the operands into coefficients at limb boundaries, and if
our machine uses b-bit limbs, we can multiply numbers with close to
2^b limbs without coefficient overflow. For smaller multiplication,
we might perhaps let m = 1, and instead of splitting our operands at
limb boundaries, split them in much smaller pieces. We might also use
4 or more primes, and split operands into bigger than b-bit chunks.
By using more primes, the gain in shorter transform length, but lose
in having to do more FFTs, but that is a slight total save. We then
lose in more expensive CRT.
<li> Perhaps consider N-way Toom. See Knuth's Seminumerical Algorithms for
details on the method. Code implementing it exists. This is
asymptotically inferior to FFTs, but is finer grained. A toom-4 might
fit in between toom-3 and the FFTs (or it might not).
<li> It's possible CPU dependent effects like cache locality will have a
greater impact on speed than algorithmic improvements.
<li> Add support for partial products, either a given number of low limbs
or high limbs of the result. A high partial product can be used by
<code>mpf_mul</code> now, a low half partial product might be of use
in a future sub-quadratic REDC. On small sizes a partial product
will be faster simply through fewer cross-products, similar to the
way squaring is faster. But work by Thom Mulders shows that for
Karatsuba and higher order algorithms the advantage is progressively
lost, so for large sizes partial products turn out to be no faster.
</ol>
<p> Another possibility would be an optimized cube. In the basecase that
should definitely be able to save cross-products in a similar fashion to
squaring, but some investigation might be needed for how best to adapt
the higher-order algorithms. Not sure whether cubing or further small
powers have any particularly important uses though.
<p> <li> <strong>Assembly routines</strong>
<p> Write new and improve existing assembly routines. The tests/devel
programs and the tune/speed.c and tune/many.pl programs are useful for
testing and timing the routines you write. See the README files in those
directories for more information.
<p> Please make sure your new routines are fast for these three situations:
<ol>
<li> Operands that fit into the cache.
<li> Small operands of less than, say, 10 limbs.
<li> Huge operands that does not fit into the cache.
</ol>
<p> The most important routines are mpn_addmul_1, mpn_mul_basecase and
mpn_sqr_basecase. The latter two don't exist for all machines, while
mpn_addmul_1 exists for almost all machines.
<p> Standard techniques for these routines are unrolling, software
pipelining, and specialization for common operand values. For machines with
poor integer multiplication, it is often possible to improve the performance
using floating-point operations, or SIMD operations such as MMX or Sun's VIS.
<p> Using floating-point operations is interesting but somewhat tricky.
Since IEEE double has 53 bit of mantissa, one has to split the operands in
small prices, so that no result is greater than 2^53. For 32-bit computers,
splitting one operand into 16-bit pieces works. For 64-bit machines, one
operand can be split into 21-bit pieces and the other into 32-bit pieces. (A
64-bit operand can be split into just three 21-bit pieces if one allows the
split operands to be negative!)
<p> <li> <strong>Faster extended GCD</strong>
<p> We currently have extended GCD based on Lehmer's method.
But the binary method can quite easily be improved for bignums
in a similar way Lehmer improved Euclid's algorithm. The result should
beat Lehmer's method by about a factor of 2. Furthermore, the multiprecision
step of Lehmer's method and the binary method will be identical, so the
current code can mostly be reused. It should be possible to share code
between GCD and GCDEXT, and probably with Jacobi symbols too.
<p> Paul Zimmermann has worked on sub-quadratic GCD and GCDEXT, but it seems
that the most likely algorithm doesn't kick in until about 3000 limbs.
<p> <li> <strong>Math functions for the mpf layer</strong>
<p> Implement the functions of math.h for the GMP mpf layer! Check the book
"Pi and the AGM" by Borwein and Borwein for ideas how to do this. These
functions are desirable: acos, acosh, asin, asinh, atan, atanh, atan2, cos,
cosh, exp, log, log10, pow, sin, sinh, tan, tanh.
<p> These are implemented in mpfr, redoing them in mpf might not be worth
bothering with, if the long term plan is to bring mpfr in as the new mpf.
<p> <li> <strong>Faster sqrt</strong>
<p> The current code uses divisions, which are reasonably fast, but it'd be
possible to use only multiplications by computing 1/sqrt(A) using this
formula:
<pre>
2
x = x (3 - A x )/2.
i+1 i i </pre>
And the final result
<pre>
sqrt(A) = A x .
n </pre>
That final multiply might be the full size of the input (though it might
only need the high half of that), so there may or may not be any speedup
overall.
<p> <li> <strong>Nth root</strong>
<p> Improve mpn_rootrem. The current code is really naive, using full
precision from the first iteration. Also, calling mpn_pow_1 isn't very
clever, as only 1/n of the result bits will be used; truncation after each
multiplication would be better. Avoiding division might also be possible.
<p> If the routine becomes fast enough, perhaps square roots could be computed
using this function.
<p> <li> <strong>Quotient-Only Division</strong>
<p> Some work can be saved when only the quotient is required in a division,
basically the necessary correction -0, -1 or -2 to the estimated
quotient can almost always be determined from only a few limbs of
multiply and subtract, rather than forming a complete remainder. The
greatest savings are when the quotient is small compared to the dividend
and divisor.
<p> Some code along these lines can be found in the current
<code>mpn_tdiv_qr</code>, though perhaps calculating bigger chunks of
remainder than might be strictly necessary. That function in its
current form actually then always goes on to calculate a full remainder.
Burnikel and Zeigler describe a similar approach for the divide and
conquer case.
<p> <li> <strong>Sub-Quadratic REDC and Exact Division</strong>
<p> <code>mpn_bdivmod</code> and the <code>redc</code> in
<code>mpz_powm</code> should use some sort of divide and conquer
algorithm. This would benefit <code>mpz_divexact</code>, and
<code>mpn_gcd</code> on large unequal size operands. See "Exact
Division with Karatsuba Complexity" by Jebelean for a (brief)
description.
<p> Failing that, some sort of <code>DIVEXACT_THRESHOLD</code> could be
added to control whether <code>mpz_divexact</code> uses
<code>mpn_bdivmod</code> or <code>mpn_tdiv_qr</code>, since the latter
is faster on large divisors.
<p> For the REDC, basically all that's needed is Montgomery's algorithm done
in multi-limb integers. R is multiple limbs, and the inverse and
operands are multi-precision.
<p> For exact division the time to calculate a multi-limb inverse is not
amortized across many modular operations, but instead will probably
create a threshold below which the current style
<code>mpn_bdivmod</code> is best. There's also Krandick and Jebelean,
"Bidirectional Exact Integer Division" to basically use a low to high
exact division for the low half quotient, and a quotient-only division
for the high half.
<p> It will be noted that low-half and high-half multiplies, and a low-half
square, can be used. These ought to each take as little as half the
time of a full multiply, or square, though work by Thom Mulders shows
the advantage is progressively lost as Karatsuba and higher algorithms
are applied.
<p> <li> <strong>Exceptions</strong>
<p> Some sort of scheme for exceptions handling would be desirable.
Presently the only thing documented is that divide by zero in GMP
functions provokes a deliberate machine divide by zero (on those systems
where such a thing exists at least). The global <code>gmp_errno</code>
is not actually documented, except for the old <code>gmp_randinit</code>
function. Being currently just a plain global means it's not
thread-safe.
<p> The basic choices for exceptions are returning an error code or having
a handler function to be called. The disadvantage of error returns is
they have to be checked, leading to tedious and rarely executed code,
and strictly speaking such a scheme wouldn't be source or binary
compatible. The disadvantage of a handler function is that a
<code>longjmp</code> or similar recovery from it may be difficult. A
combination would be possible, for instance by allowing the handler to
return an error code.
<p> Divide-by-zero, sqrt-of-negative, and similar operand range errors can
normally be detected at the start of functions, so exception handling
would have a clean state. What's worth considering though is that the
GMP function detecting the exception may have been called via some third
party library or self contained application module, and hence have
various bits of state to be cleaned up above it. It'd be highly
desirable for an exceptions scheme to allow for such cleanups.
<p> The C++ destructor mechanism could help with cleanups both internally
and externally, but being a plain C library we don't want to depend on
that.
<p> A C++ <code>throw</code> might be a good optional extra exceptions
mechanism, perhaps under a build option. For GCC
<code>-fexceptions</code> will add the necessary frame information to
plain C code, or GMP could be compiled as C++.
<p> Out-of-memory exceptions are expected to be handled by the
<code>mp_set_memory_functions</code> routines, rather than being a
prospective part of divide-by-zero etc. Some similar considerations
apply but what differs is that out-of-memory can arise deep within GMP
internals. Even fundamental routines like <code>mpn_add_n</code> and
<code>mpn_addmul_1</code> can use temporary memory (for instance on Cray
vector systems). Allowing for an error code return would require an
awful lot of checking internally. Perhaps it'd still be worthwhile, but
it'd be a lot of changes and the extra code would probably be rather
rarely executed in normal usages.
<p> A <code>longjmp</code> recovery for out-of-memory will currently, in
general, lead to memory leaks and may leave GMP variables operated on in
inconsistent states. Maybe it'd be possible to record recovery
information for use by the relevant allocate or reallocate function, but
that too would be a lot of changes.
<p> One scheme for out-of-memory would be to note that all GMP allocations
go through the <code>mp_set_memory_functions</code> routines. So if the
application has an intended <code>setjmp</code> recovery point it can
record memory activity by GMP and abandon space allocated and variables
initialized after that point. This might be as simple as directing the
allocation functions to a separate pool, but in general would have the
disadvantage of needing application-level bookkeeping on top of the
normal system <code>malloc</code>. An advantage however is that it
needs nothing from GMP itself and on that basis doesn't burden
applications not needing recovery. Note that there's probably some
details to be worked out here about reallocs of existing variables, and
perhaps about copying or swapping between "permanent" and "temporary"
variables.
<p> Applications desiring a fine-grained error control, for instance a
language interpreter, would very possibly not be well served by a scheme
requiring <code>longjmp</code>. Wrapping every GMP function call with a
<code>setjmp</code> would be very inconvenient.
<p> Another option would be to let <code>mpz_t</code> etc hold a sort of
NaN, a special value indicating an out-of-memory or other failure. This
would be similar to NaNs in MPFR. Unfortunately such a scheme could
only be used by programs prepared to handle such special values, since
for instance a program waiting for some condition to be satisfied could
become an infinite loop if it wasn't also watching for NaNs. The work
to implement this would be significant too, lots of checking of inputs
and intermediate results. And if <code>mpn</code> routines were to
participate in this (which they would have to internally) a lot of new
return values would need to be added, since of course there's no
<code>mpz_t</code> etc structure for them to indicate failure in.
<p> Stack overflow is another possible exception, but perhaps not one that
can be easily detected in general. On i386 GNU/Linux for instance GCC
normally doesn't generate stack probes for an <code>alloca</code>, but
merely adjusts <code>%esp</code>. A big enough <code>alloca</code> can
miss the stack redzone and hit arbitrary data. GMP stack usage is
normally a function of operand size, which might be enough for some
applications to know they'll be safe. Otherwise a fixed maximum usage
can probably be obtained by building with
<code>--enable-alloca=malloc-reentrant</code> (or
<code>notreentrant</code>). Arranging the default to be
<code>alloca</code> only on blocks up to a certain size and
<code>malloc</code> thereafter might be a better approach and would have
the advantage of not having calculations limited by available stack.
<p> Actually recovering from stack overflow is of course another problem.
It might be possible to catch a <code>SIGSEGV</code> in the stack
redzone and do something in a <code>sigaltstack</code>, on systems which
have that, but recovery might otherwise not be possible. This is worth
bearing in mind because there's no point worrying about tight and
careful out-of-memory recovery if an out-of-stack is fatal.
<p> Operand overflow is another exception to be addressed. It's easy for
instance to ask <code>mpz_pow_ui</code> for a result bigger than an
<code>mpz_t</code> can possibly represent. Currently overflows in limb
or byte count calculations will go undetected. Often they'll still end
up asking the memory functions for blocks bigger than available memory,
but that's by no means certain and results are unpredictable in general.
It'd be desirable to tighten up such size calculations. Probably only
selected routines would need checks, if it's assumed say that no input
will be more than half of all memory and hence size additions like say
<code>mpz_mul</code> won't overflow.
<p> <li> <strong>Test Suite</strong>
<p> Add a test suite for old bugs. These tests wouldn't loop or use
random numbers, but just test for specific bugs found in the
past.
<p> More importantly, improve the random number controls for test
collections:
<ol>
<li> Use the new random number framework.
<li> Have every test accept a seed parameter.
<li> Allow `make check' to set the seed parameter.
<li> Add more tests for important, now untested functions.
</ol>
<p> With this in place, it should be possible to rid GMP of
practically all bugs by having some dedicated GMP test machines
running "make check" with ever increasing seeds (and
periodically updating to the latest GMP).
<p> If a few more ASSERTs were sprinkled through the code, running
some tests with --enable-assert might be useful, though not a
substitute for tests on the normal build.
<p> An important feature of any random tests will be to record the
seeds used, and perhaps to snapshot operands before performing
each test, so any problem exposed can be reproduced.
<p> <li> <strong>Performance Tool</strong>
<p> It'd be nice to have some sort of tool for getting an overview of
performance. Clearly a great many things could be done, but some
primary uses would be,
<ol>
<li> Checking speed variations between compilers.
<li> Checking relative performance between systems or CPUs.
</ol>
<p> A combination of measuring some fundamental routines and some
representative application routines might satisfy these.
<p> The tune/time.c routines would be the easiest way to get good
accurate measurements on lots of different systems. The high level
<code>speed_measure</code> may or may not suit, but the basic
<code>speed_starttime</code> and <code>speed_endtime</code> would cover
lots of portability and accuracy questions.
<p> <li> <strong><code>restrict</code></strong>
<p> There might be some value in judicious use of C99 style
<code>restrict</code> on various pointers, but this would need some
careful thought about what it implies for the various operand overlaps
permitted in GMP.
<p> Rumour has it some pre-C99 compilers had <code>restrict</code>, but
expressing tighter (or perhaps looser) requirements. Might be worth
investigating that before using <code>restrict</code> unconditionally.
<p> Loops are presumably where the greatest benefit would be had, by
allowing the compiler to advance reads ahead of writes, perhaps as part
of loop unrolling. However critical loops are generally coded in
assembler, so there might not be very much to gain. And on Cray systems
the explicit use of <code>_Pragma</code> gives an equivalent effect.
<p> One thing to note is that Microsoft C headers (on ia64 at least) contain
<code>__declspec(restrict)</code>, so a <code>#define</code> of
<code>restrict</code> should be avoided. It might be wisest to setup a
<code>gmp_restrict</code>.
<p> <li> <strong>Nx1 Division</strong>
<p> The limb-by-limb dependencies in the existing Nx1 division (and
remainder) code means that chips with multiple execution units or
pipelined multipliers are not fully utilized.
<p> One possibility is to follow the current preinv method but taking two
limbs at a time. That means a 2x2->4 and a 2x1->2 multiply for
each two limbs processed, and because the 2x2 and 2x1 can each be done
in parallel the latency will be not much more than 2 multiplies for two
limbs, whereas the single limb method has a 2 multiply latency for just
one limb. A version of <code>mpn_divrem_1</code> doing this has been
written in C, but not yet tested on likely chips. Clearly this scheme
would extend to 3x3->9 and 3x1->3 etc, though with diminishing
returns.
<p> For <code>mpn_mod_1</code>, Peter L. Montgomery proposes the following
scheme. For a limb R=2^</code>bits_per_mp_limb</code>, pre-calculate
values R mod N, R^2 mod N, R^3 mod N, R^4 mod N. Then take dividend
limbs and multiply them by those values, thereby reducing them (moving
them down) by the corresponding factor. The products can be added to
produce an intermediate remainder of 2 or 3 limbs to be similarly
included in the next step. The point is that such multiplies can be
done in parallel, meaning as little as 1 multiply worth of latency for 4
limbs. If the modulus N is less than R/4 (or is it R/5?) the summed
products will fit in 2 limbs, otherwise 3 will be required, but with the
high only being small. Clearly this extends to as many factors of R as
a chip can efficiently apply.
<p> The logical conculsion for powers R^i is a whole array "p[i] = R^i mod
N" for i up to k, the size of the dividend. This could then be applied
at multiplier throughput speed like an inner product. If the powers
took roughly k divide steps to calculate then there'd be an advantage
any time the same N was used three or more times. Suggested by Victor
Shoup in connection with chinese-remainder style decompositions, but
perhaps with other uses.
<p> <code>mpn_modexact_1_odd</code> calculates an x in the range 0<=x<d
satisfying a = q*d + x*b^n, where b=2^bits_per_limb. The factor b^n
needed to get the true remainder r could be calculated by a powering
algorithm, allowing <code>mpn_modexact_1_odd</code> to be pressed into
service for an <code>mpn_mod_1</code>. <code>modexact_1</code> is
simpler and on some chips can run noticably faster than plain
<code>mod_1</code>, on Athlon for instance 11 cycles/limb instead of 17.
Such a difference could soon overcome the time to calculate b^n. The
requirement for an odd divisor in <code>modexact</code> can be handled
by some shifting on-the-fly, or perhaps by an extra partial-limb step at
the end.
<p> <li> <strong>Factorial</strong>
<p> The removal of twos in the current code could be extended to factors of
3 or 5. Taking this to its logical conclusion would be a
complete decomposition into powers of primes. The power for a prime p
is of course floor(n/p)+floor(n/p^2)+... Conrad Curry found this is
quite fast (using simultaneous powering as per Handbook of Applied
Cryptography algorithm 14.88).
<p> A difficulty with using all primes is that quite large n can be
calculated on a system with enough memory, larger than we'd probably
want for a table of primes, so some sort of sieving would be wanted.
Perhaps just taking out the factors of 3 and 5 would give most of the
speedup that a prime decomposition can offer.
<p> <li> <strong>Binomial Coefficients</strong>
<p> An obvious improvement to the current code would be to strip factors of
2 from each multiplier and divisor and count them separately, to be
applied with a bit shift at the end. Factors of 3 and perhaps 5 could
even be handled similarly.
<p> Conrad Curry reports a big speedup for binomial coefficients using a
prime powering scheme, at least for k near n/2. Of course this is only
practical for moderate size n since again it requires primes up to n.
<p> When k is small the current (n-k+1)...n/1...k will be fastest. Some
sort of rule would be needed for when to use this or when to use prime
powering. Such a rule will be a function of both n and k. Some
investigation is needed to see what sort of shape the crossover line
will have, the usual parameter tuning can of course find machine
dependent constants to fill in where necessary.
<p> An easier possibility also reported by Conrad Curry is that it may be
faster not to divide out the denominator (1...k) one-limb at a time, but
do one big division at the end. Is this because a big divisor in
<code>mpn_bdivmod</code> trades the latency of
<code>mpn_divexact_1</code> for the throughput of
<code>mpn_submul_1</code>? Overheads must hurt though.
<p> Another reason a big divisor might help is that
<code>mpn_divexact_1</code> won't be getting a full limb in
<code>mpz_bin_uiui</code>. It's called when the n accumulator is full
but the k may be far from full. Perhaps the two could be decoupled so k
is applied when full. It'd be necessary to delay consideration of k
terms until the corresponding n terms had been applied though, since
otherwise the division won't be exact.
<p> <li> <strong>Perfect Power Testing</strong>
<p> <code>mpz_perfect_power_p</code> could be improved in a number of ways,
for instance p-adic arithmetic to find possible roots.
<p> Non-powers can be quickly identified by checking for Nth power residues
modulo small primes, like <code>mpn_perfect_square_p</code> does for
squares. The residues to each power N for a given remainder could be
grouped into a bit mask, the masks for the remainders to each divisor
would then be "and"ed together to hopefully leave only a few candidate
powers. Need to think about how wide to make such masks, ie. how many
powers to examine in this way.
<p> Any zero remainders found in residue testing reveal factors which can be
divided out, with the multiplicity restricting the powers that need to
be considered, as per the current code. Further prime dividing should
be grouped into limbs like <code>PP</code>. Need to think about how
much dividing to do like that, probably more for bigger inputs, less for
smaller inputs.
<p> <code>mpn_gcd_1</code> would probably be better than the current private
GCD routine. The use it's put to isn't time-critical, and it might help
ensure correctness to just use the main GCD routine.
<p> <li> <strong>Prime Testing</strong>
<p> GMP is not really a number theory library and probably shouldn't have
large amounts of code dedicated to sophisticated prime testing
algorithms, but basic things well-implemented would suit. Tests
offering certainty are probably all too big or too slow (or both!) to
justify inclusion in the main library. Demo programs showing some
possibilities would be good though.
<p> The present "repetitions" argument to <code>mpz_probab_prime_p</code> is
rather specific to the Miller-Rabin tests of the current implementation.
Better would be some sort of parameter asking perhaps for a maximum
chance 1/2^x of a probable prime in fact being composite. If
applications follow the advice that the present reps gives 1/4^reps
chance then perhaps such a change is unnecessary, but an explicitly
described 1/2^x would allow for changes in the implementation or even
for new proofs about the theory.
<p> <code>mpz_probab_prime_p</code> always initializes a new
<code>gmp_randstate_t</code> for randomized tests, which unfortunately
means it's not really very random and in particular always runs the same
tests for a given input. Perhaps a new interface could accept an rstate
to use, so successive tests could increase confidence in the result.
<p> <code>mpn_mod_34lsub1</code> is an obvious and easy improvement to the
trial divisions. And since the various prime factors are constants, the
remainder can be tested with something like
<pre>
#define MP_LIMB_DIVISIBLE_7_P(n) \
((n) * MODLIMB_INVERSE_7 <= MP_LIMB_T_MAX/7)
</pre>
Which would help compilers that don't know how to optimize divisions by
constants, and is even an improvement on current gcc 3.2 code. This
technique works for any modulus, see Granlund and Montgomery "Division
by Invariant Integers" section 9.
<p> The trial divisions are done with primes generated and grouped at
runtime. This could instead be a table of data, with pre-calculated
inverses too. Storing deltas, ie. amounts to add, rather than actual
primes would save space. <code>udiv_qrnnd_preinv</code> style inverses
can be made to exist by adding dummy factors of 2 if necessary. Some
thought needs to be given as to how big such a table should be, based on
how much dividing would be profitable for what sort of size inputs. The
data could be shared by the perfect power testing.
<p> Jason Moxham points out that if a sqrt(-1) mod N exists then any factor
of N must be == 1 mod 4, saving half the work in trial dividing. (If
x^2==-1 mod N then for a prime factor p we have x^2==-1 mod p and so the
jacobi symbol (-1/p)=1. But also (-1/p)=(-1)^((p-1)/2), hence must have
p==1 mod 4.) But knowing whether sqrt(-1) mod N exists is not too easy.
A strong pseudoprime test can reveal one, so perhaps such a test could
be inserted part way though the dividing.
<p> Jon Grantham "Frobenius Pseudoprimes" (www.pseudoprime.com) describes a
quadratic pseudoprime test taking about 3x longer than a plain test, but
with only a 1/7710 chance of error (whereas 3 plain Miller-Rabin tests
would offer only (1/4)^3 == 1/64). Such a test needs completely random
parameters to satisfy the theory, though single-limb values would run
faster. It's probably best to do at least one plain Miller-Rabin before
any quadratic tests, since that can identify composites in less total
time.
<p> Some thought needs to be given to the structure of which tests (trial
division, Miller-Rabin, quadratic) and how many are done, based on what
sort of inputs we expect, with a view to minimizing average time.
<p> It might be a good idea to break out subroutines for the various tests,
so that an application can combine them in ways it prefers, if sensible
defaults in <code>mpz_probab_prime_p</code> don't suit. In particular
this would let applications skip tests it knew would be unprofitable,
like trial dividing when an input is already known to have no small
factors.
<p> For small inputs, combinations of theory and explicit search make it
relatively easy to offer certainty. For instance numbers up to 2^32
could be handled with a strong pseudoprime test and table lookup. But
it's rather doubtful whether a smallnum prime test belongs in a bignum
library. Perhaps if it had other internal uses.
<p> An <code>mpz_nthprime</code> might be cute, but is almost certainly
impractical for anything but small n.
<p> <li> <strong>Intra-Library Calls</strong>
<p> On various systems, calls within libgmp still go through the PLT, TOC or
other mechanism, which makes the code bigger and slower than it needs to
be.
<p> The theory would be to have all GMP intra-library calls resolved
directly to the routines in the library. An application wouldn't be
able to replace a routine, the way it can normally, but there seems no
good reason to do that, in normal circumstances.
<p> The <code>visibility</code> attribute in recent gcc is good for this,
because it lets gcc omit unnecessary GOT pointer setups or whatever if
it finds all calls are local and there's no global data references.
Documented entrypoints would be <code>protected</code>, and purely
internal things not wanted by test programs or anything can be
<code>internal</code>.
<p> Unfortunately, on i386 it seems <code>protected</code> ends up causing
text segment relocations within libgmp.so, meaning the library code
can't be shared between processes, defeating the purpose of a shared
library. Perhaps this is just a gremlin in binutils (debian packaged
2.13.90.0.16-1).
<p> The linker can be told directly (with a link script, or options) to do
the same sort of thing. This doesn't change the code emitted by gcc of
course, but it does mean calls are resolved directly to their targets,
avoiding a PLT entry.
<p> Keeping symbols private to libgmp.so is probably a good thing in general
too, to stop anyone even attempting to access them. But some
undocumented things will need or want to be kept visibible, for use by
mpfr, or the test and tune programs. Libtool has a standard option for
selecting public symbols (used now for libmp).
</ul>
<hr>
</body>
</html>
<!--
Local variables:
eval: (add-hook 'write-file-hooks 'time-stamp)
time-stamp-start: "This file current as of "
time-stamp-format: "%:d %3b %:y"
time-stamp-end: "\\."
time-stamp-line-limit: 50
End:
-->
|