diff options
author | Kevin Ryde <user42@zip.com.au> | 2000-10-21 01:16:40 +0200 |
---|---|---|
committer | Kevin Ryde <user42@zip.com.au> | 2000-10-21 01:16:40 +0200 |
commit | f0cb79dd8cc4dfdac3d299a5dae3de874ba20222 (patch) | |
tree | 9db3b5fdf70965bd7c5807c68cc0e16565d1a303 /doc | |
parent | d5a9059c5798bf912a7bca2fa8002c62edc576b9 (diff) | |
download | gmp-f0cb79dd8cc4dfdac3d299a5dae3de874ba20222.tar.gz |
* doc/multiplication: Remove file, now in the manual.
* doc/assembly_code: Ditto.
Diffstat (limited to 'doc')
-rw-r--r-- | doc/assembly_code | 57 | ||||
-rw-r--r-- | doc/multiplication | 416 |
2 files changed, 0 insertions, 473 deletions
diff --git a/doc/assembly_code b/doc/assembly_code deleted file mode 100644 index 3b371a560..000000000 --- a/doc/assembly_code +++ /dev/null @@ -1,57 +0,0 @@ -Most mpn subdirectories contain machine-dependent code, written in -assembly or C. The `generic' subdirectory contains default code, used -when there is no machine-dependent replacement for a particular -machine. - -There is one subdirectory for each ISA family. Note that e.g., 32-bit SPARC -and 64-bit SPARC are very different ISA's, and thus cannot share any code. - -A particular compile will only use code from one subdirectory, and the -`generic' subdirectory. The ISA-specific subdirectories contain hierarchies of -directories for various architecture variants and implementations; the -top-most level contains code that runs correctly on all variants. - -HOW TO WRITE FAST ASSEMBLY CODE FOR GMP - -[This should ultimately be made into a chapter of the GMP manual.] - -The most basic techniques are software pipelining and loop unrolling. - -Software pipelining is the technique of scheduling instructions around -the branch point in a loop, so that consecutive iterations overlap. -It is very much like juggling. - -Unrolling is useful when software pipelining does not get us close -enough to the peek performance of a processor's pipeline. Unrolling -decreases the loop overhead, but also often allows a more even load on -a processor's functional units. - -For processors with very few registers, software pipelining is not -feasible as it increases register pressure. - -For superscalar machines, it is often the case that all available -execution capabilities are not used. Scheduling some instructions -for these otherwise unused resources will never cost us anything. - -Try to determine the alternative instructions that can be used for a -particular processor. For GMP, the problem that presents most -challenges is propagating carry from one iteration to the next. -Explore the different possibilities for doing that with the available -instructions! - -For wide superscalar processors, the performance might be completely -determined by the number of dependent instruction required from -accepting carry-in from the previous iteration until producing -carry-out for the next iteration. This is particularly true for -simple operations like mpn_add_n and mpn_sub_n. Some carry -propagation schemes require 4 instructions, translating to at least -four cycles per iterations. Other schemes can propagate carry in two -cycles or even just one cycle. - -Therefore, for wide superscalar processors, finding methods with -"shallow" carry propagation given an instruction set is often the -central problem we need to address. The rest is just is hard coding -work. - -[Describe: First find issue maps with desired performance - Then schedule for latency] diff --git a/doc/multiplication b/doc/multiplication deleted file mode 100644 index f405370fc..000000000 --- a/doc/multiplication +++ /dev/null @@ -1,416 +0,0 @@ - - GMP MULTIPLICATION - - -This file describes briefly the multiplication and squaring used in GMP. -The code is likely to be hard to understand without knowing something about -the algorithms. - -GMP does NxN limb multiplications and squares using one of four algorithms, -according to the size N. - - Algorithm Sizes - - basecase < KARATSUBA_MUL_THRESHOLD - karatsuba >= KARATSUBA_MUL_THRESHOLD - toom3 >= TOOM3_MUL_THRESHOLD - fft >= FFT_MUL_THRESHOLD - -Similarly for squaring, with the SQR thresholds. Note though that the FFT -is only used if GMP is configured with --enable-fft. - -MxN multiplications of operands with different sizes are currently done by -splitting it up into various NxN pieces while above KARATSUBA_MUL_THRESHOLD. -The karatsuba and toom3 routines then operate only on equal size operands. -This is rather inefficient, and is slated for improvement in the future. - - - -BASECASE MULTIPLY - -This a straightforward rectangular set of cross-products, the same as long -multiplication done by hand and for that reason sometimes known as the -schoolbook or grammar school method. - -See Knuth (reference in gmp.texi) volume 2 section 4.3.1 algorithm M, or the -mpn/generic/mul_basecase.c code. - -Assembler implementations of mul_basecase are essentially the same as the -generic C version, but have all the usual assembler tricks and obscurities -introduced for speed. - - - -BASECASE SQUARE - -A square can be done in roughly half the time of a multiply, by using the -fact that the cross products above and below the diagonal are the same. In -practice squaring isn't 2x faster than multiplying, but it's always faster -by a decent factor. - - u0 u1 u2 u3 u4 - +---+---+---+---+---+ - u0 | x | | | | | - +---+---+---+---+---+ - u1 | | x | | | | - +---+---+---+---+---+ - u2 | | | x | | | - +---+---+---+---+---+ - u3 | | | | x | | - +---+---+---+---+---+ - u4 | | | | | x | - +---+---+---+---+---+ - -The basic algorithm is to calculate a triangle of products below the -diagonal, double it (left shift by one bit), and add in the products on the -diagonal. This can be seen in mpn/generic/sqr_basecase.c. Again the -assembler implementations take essentially this same approach. - - - - -KARATSUBA MULTIPLY - -The Karatsuba multiplication algorithm is described in Knuth volume 2 -section 4.3.3 part A, and in other texts like Geddes et al. section 4.3 -(reference below). A brief description is given here. - -The Karatsuba algorithm treats its inputs x and y as each split in two parts -of equal length (or the most significant part one limb shorter if N is odd). - - high low - +----------+----------+ - | x1 | x0 | - +----------+----------+ - - +----------+----------+ - | y1 | y0 | - +----------+----------+ - -Let b be the power of 2 where the split occurs, ie. if x0 is k limbs (y0 the -same) then b=2^(k*mp_bits_per_limb). Then x=x1*b+x0 and y=y1*b+y0, and the -following holds, - - x*y = (b^2+b)*x1*y1 + b*(x1-x0)*(y1-y0) + (b+1)*x0*y0 - -This formula means doing only three multiplies of (N/2)x(N/2) limbs, whereas -a basecase multiply of NxN limbs is roughly equivalent to four multiplies of -(N/2)x(N/2). - -The factors (b^2+b) etc represent the positions where the three products -must be added in. - - high low - +--------+--------+ +--------+--------+ - | x1*y1 | | x0*y0 | - +--------+--------+ +--------+--------+ - +--------+--------+ - | x1*y1 | - +--------+--------+ - +--------+--------+ - | x0*y0 | - +--------+--------+ - +--------+--------+ - | (x1-x0)*(y1-y0) | - +--------+--------+ - -The term (x1-x0)*(y1-y0) can be negative, but the final result is of course -always positive. Notice that high(x0*y0)+low(x1*y1) occurs twice, so it's -possible to do 5*k limb additions, rather than 6*k, however in GMP extra -function call overheads outweigh the saving. - -The use of three multiplies of N/2 limbs each leads to an asymptotic speed -O(N^1.585). (The exponent is log(3)/log(2).) This is a big improvement -over the basecase multiply at O(N^2) and the algorithmic advantage soon -overcomes the extra additions Karatsuba must perform. - - - - -KARATSUBA SQUARE - -A square is very similar to a multiply, but with x==y the formula reduces to -an equivalent with three squares, - - x^2 = (b^2+b)*x1^2 + b*(x1-x0)^2 + (b+1)*x0^2 - -The final result is accumulated from those three squares the same way as for -the three multiplies above. The middle term (x1-x0)^2 however is now always -positive. - - - - -TOOM-COOK 3-WAY MULTIPLY - -The Karatsuba formula is part of a general approach to splitting inputs -leading to both Toom-Cook and FFT algorithms. A description of Toom-Cook -can be found in Knuth volume 2 section 4.3.3, with an example 3-way -calculation after Theorem A. - -Toom-Cook 3-way treats the operands as split into 3 pieces of equal size (or -the most significant part 1 or 2 limbs shorter than the others). - - high low - +----------+----------+----------+ - | x2 | x1 | x0 | - +----------+----------+----------+ - - +----------+----------+----------+ - | y2 | y1 | y0 | - +----------+----------+----------+ - -These parts are treated as the coefficients of two polynomials - - X(t) = x2*t^2 + x1*t + x0 - Y(t) = y2*t^2 + y1*t + y0 - -Again let b equal the power of 2 which is the size of the x0, x1, y0 and y1 -pieces, ie. if they're k limbs each then b=2^(k*mp_bits_per_limb). With -this then x=X(b) and y=Y(b). - -Let a polynomial W(t)=X(t)*Y(t) and suppose it's coefficients are - - W(t) = w4*t^4 + w3*t^3 + w2*t^2 + w1*t + w0 - -The w[i] are going to be determined, and when they are they'll give the -final result using w=W(b), since x*y=X(b)*Y(b)=W(b). The coefficients will -each be roughly b^2, so the final W(b) will be an addition like, - - high low - +-------+-------+ - | w4 | - +-------+-------+ - +--------+-------+ - | w3 | - +--------+-------+ - +--------+-------+ - | w2 | - +--------+-------+ - +--------+-------+ - | w1 | - +--------+-------+ - +-------+-------+ - | w0 | - +-------+-------+ - ------------------------------------------------- - - -The w[i] coefficients could be formed by a simple set of cross products, -like w4=x2*x2, w3=x2*y1+x1*y2, w2=x2*y0+x1*y1+x0*y2 etc, but this would need -all nine x[i]*y[j] for i,j=0,1,2, and would be equivalent merely to a -basecase multiply. Instead the following approach is used. - -X(t) and Y(t) are evaluated and multiplied at 5 points, giving values of -W(t) at those points. The points used can be chosen in various ways, but in -GMP the following are used - - t=0 meaning x0*y0, which gives w0 immediately - t=2 meaning (4*x2+2*x1*x0)*(4*y2+2*y1+y0) - t=1 meaning (x2+x1+x0)*(y2+y1+y0) - t=1/2 meaning (x2+2*x1+4*x0)*(y2+2*y1+4*y0) - t=inf meaning x2*y2, which gives w4 immediately - -At t=1/2 the value calculated is actually 4*X(1/2)*4*Y(1/2), giving a value -for 16*W(1/2), and this is always an integer. At t=inf the value is -actually X(t)*Y(t)/t^2 in the limit as t approaches infinity, but it's much -easier to think of it as simply x2*y2 giving w4 immediately (much like at -t=0 x0*y0 gives w0 immediately). - -Now each of the points substituted into W(t)=w4*t^4+...+w0 gives a linear -combination of the w[i] coefficients, and the value of those combinations -has just been calculated. - - W(0) = w0 - 16*W(1/2) = w4 + 2*w3 + 4*w2 + 8*w1 + 16*w0 - W(1) = w4 + w3 + w2 + w1 + w0 - W(2) = 16*w4 + 8*w3 + 4*w2 + 2*w1 + w0 - W(inf) = w4 - -This is a set of five equations in five unknowns, and some elementary linear -algebra quickly isolates each w[i], by subtracting multiples of one equation -from another. - -In the code the set of five values W(0),...,W(inf) will represent those -certain linear combinations. By adding or subtracting one from another as -necessary, values which are each w[i] alone are arrived at. This involves -only a few subtractions of small multiples (some of which are powers of 2), -and so is very fast. A couple of divisions remain by powers of 2 and one -division by 3 (or by 6 rather), and that last uses the special fast -mpn_divexact_by3. - -In the code the values w4, w2 and w0 are formed in the destination, and w3 -and w1 are added to them. There's an extra word at the high end of w3, w2 -and w1 that are handled separately. With A=w0,B=w1,...,E=w4, the additions -are as follows. - - high low - +-------+-------+-------+-------+-------+-------+ - | E | C | A | - +-------+-------+-------+-------+-------+-------+ - +------+-------++------+-------+ - | D || B | - +------+-------++------+-------+ - -- -- -- - |tD| |tC| |tB| - -- -- -- - ------------------------------------------------- - - -The conversion of W(t) values to the coefficients is called interpolation. -A polynomial of degree 4 like W(t) is uniquely determined by values known at -5 different points. The points can be chosen to make the linear equations -come out with a convenient set of steps for isolating the w[i]s. - -In mpn/generic/mul_n.c the interpolate3() routine performs the -interpolation. The open-coded one-pass version may be a bit hard to -understand, the steps performed can be better seen in the USE_MORE_MPN -version. - -The use of five multiplies of N/3 limbs each leads to an asymptotic speed -O(N^1.465). (The exponent is log(5)/log(3).) This is an improvement over -Karatsuba at O(N^1.585), though Toom-Cook does more work in the evaluation -and interpolation and so it's only above a certain size that Toom-Cook -realizes its advantage. - -The formula given above for the Karatsuba algorithm has an equivalent for -Toom-Cook 3-way, involving only five multiplies, but this would be -complicated and unenlightening. - -An alternate view of Toom-Cook 3-way can be found in Zuras (reference -below). He uses a vector to represent the x and y splits and a matrix -multiplication for the evaluation and interpolation stages. The matrix -inverses are not meant to be actually used, and they have elements with -values much greater than in fact arise in the interpolation steps. The -diagram shown for the 3-way is attractive, but again it doesn't have to be -implemented that way and for example with a bit of rearrangement just one -division by 6 (not two of them) can be done. - - - - -TOOM-COOK 3-WAY SQUARE - -Toom-Cook squaring follows the same procedure as multiplication, but there's -only one X(t) and it's evaluated at 5 points, and those values squared to -give values of W(t). The interpolation is then identical, and in fact the -same interpolate3() subroutine is used for both squaring and multiplying. - - - - -FFT MULTIPLY - -At large to very large sizes a Fermat style FFT is used, meaning an FFT in a -ring of integers modulo 2^M+1. This is asymptotically fast, but overheads -mean it's only worthwhile for large products. - -Some brief notes are given here. Full explanations can be found in various -texts. Knuth section 4.3.3 part C describes the method, but using complex -numbers. In the references below Schonhage and Strassen is the original -paper, and Pollard gives some of the mathematics for a finite field as used -here. - -A Fermat FFT does its multiplication modulo 2^N+1, but by choosing -N>=bits(A)+bits(B), a full product A*B is obtained. The algorithm splits -the inputs into 2^k pieces, for a chosen k, and will recursively perform 2^k -pointwise multiplications modulo 2^M+1, where M=N/2^k. N must be a multiple -of 2^k. Those multiplications are either done by recursing into a further -FFT, or by a plain toom3 etc multiplication, whichever is optimal at the -resultant size. Note that in the current implementation M is always a -multiple of the limb size, and hence N is too. - -The steps leading to the pointwise products are like the evaluation and -interpolation stages of the Karatsuba and Toom-Cook algorithms, but use a -convolution, which can be efficiently calculated because 2^(2M) (which is -2^(2N/2^k)) is a 2^k'th root of unity modulo 2^N+1. - -As operand sizes gets bigger, bigger splits are used. Each time a bigger k -is used some multiplying is effectively swapped for some shifts, adds and -overheads. A table of thresholds gives the points where a k+1 FFT is faster -than a k FFT. A separate threshold gives the point where a mod 2^N+1 FFT -first becomes faster than a toom3 multiply of that size, and this normally -happens in the k=4 range. A further threshold gives the point where an -N/2xN/2 multiply done with an FFT mod 2^N+1 is faster than a toom3 of that -size, and this is normally in the k=7 or k=8 range. - - - -FFT SQUARE - -FFT squaring follows the same procedure as multiplication, but there's only -one operand which is split, and the pointwise multiplications become -squares. The convolutions performed are the same, both at the evaluate and -interpolate stages, except of course there's only one operand at the -evaluate stage. - - - - -TOOM-COOK N-WAY (NOT USED) - -The 3-way Toom-Cook procedure described above generalizes to split into an -arbitrary number of pieces, as per Knuth volume 2 section 4.3.3 algorithm C. -Some code implementing this for GMP exists, but is not present since it has -yet to prove worthwhile. The notes here are merely for interest. - -In general a split into r+1 pieces will be made, and evaluations and -pointwise multiplications done at 2*r+1 points. So a 4-way split does 7 -pointwise multiplies, 5-way does 9, etc. - -Asymptotically an r+1-way algorithm is O(n^(log(2*r+1)/log(r+1)). Only the -pointwise multiplications count towards big O() complexity, but the time -spent in the evaluate and interpolate stages grows with r and has a -significant practical impact, with the asymptotic advantage of each r -realized only at bigger and bigger sizes. - -Knuth algorithm C presents a version evaluating at points 0,1,2,...,2*r, but -exercise 4 uses -r,...,0,...,r and the latter saves some small multiplies in -the evaluate stage (or rather trades them for additions), and has a further -saving of nearly half the interpolate steps. The answer to the exercise -doesn't give full details of the interpolate, but essentially the idea is to -separate odd and even final coefficients and then perform algorithm C steps -C7 and C8 on them separately. The multipliers and divisors at step C7 then -become j^2 and 2*t*j-j*j respectively. - -In the references below, Zuras presents 3-way and 4-way Toom-Cook methods, -and compares them to small order FFTs. Hollerbach presents an N-way -algorithm from first principles. Bernstein presents the N-way algorithm in -a style that facilitates comparing its mathematics to other multiplication -algorithms. - - - - -REFERENCES - -"Algorithms for Computer Algebra", Keith O. Geddes, Stephen R. Czapor, -George Labahn, Kluwer Academic Publishers, 1992, ISBN 0-7923-9259-0. - -"Schnelle Multiplikation grosser Zahlen", by Arnold Schonhage and Volker -Strassen, Computing 7, p. 281-292, 1971. - -"The Fast Fourier Transform in a Finite Field", J.M. Pollard, Mathematics of -Computation, vol 25, num 114, April 1971. - -"On Squaring and Multiplying Large Integers", Dan Zuras, ARITH-11: IEEE -Symposium on Computer Arithmetic, 1993, pages 260 to 271. And reprinted as -"More on Multiplying and Squaring Large Integers", IEEE Transactions on -Computers, August 1994. - -"Fast Multiplication & Division of Very Large Numbers", Uwe Hollerbach, post -to sci.math.research, Jan 1996, archived at Swarthmore, -http://forum.swarthmore.edu/epigone/sci.math.research/zhouyimpzimp/x1ybdbxz5w4v@forum.swarthmore.edu - -"Multidigit Multiplication for Mathematicians", Daniel J. Bernstein, -preprint available at http://koobera.math.uic.edu/www/papers. (Every known -multiplication technique, many references.) - - - - ----------------- -Local variables: -mode: text -fill-column: 76 -End: |