* doc/multiplication: Remove file, now in the manual.

* doc/assembly_code: Ditto.
author: Kevin Ryde <user42@zip.com.au> 2000-10-21 01:16:40 +0200
committer: Kevin Ryde <user42@zip.com.au> 2000-10-21 01:16:40 +0200
commit: f0cb79dd8cc4dfdac3d299a5dae3de874ba20222 (patch)
tree: 9db3b5fdf70965bd7c5807c68cc0e16565d1a303 /doc
parent: d5a9059c5798bf912a7bca2fa8002c62edc576b9 (diff)
download: gmp-f0cb79dd8cc4dfdac3d299a5dae3de874ba20222.tar.gz
2 files changed, 0 insertions, 473 deletions
diff --git a/doc/assembly_code b/doc/assembly_code
deleted file mode 100644
index 3b371a560..000000000
--- a/doc/assembly_code
+++ /dev/null
@@ -1,57 +0,0 @@
-Most mpn subdirectories contain machine-dependent code, written in
-assembly or C.  The `generic' subdirectory contains default code, used
-when there is no machine-dependent replacement for a particular
-machine.
-
-There is one subdirectory for each ISA family.  Note that e.g., 32-bit SPARC
-and 64-bit SPARC are very different ISA's, and thus cannot share any code.
-
-A particular compile will only use code from one subdirectory, and the
-`generic' subdirectory.  The ISA-specific subdirectories contain hierarchies of
-directories for various architecture variants and implementations; the
-top-most level contains code that runs correctly on all variants.
-
-HOW TO WRITE FAST ASSEMBLY CODE FOR GMP
-
-[This should ultimately be made into a chapter of the GMP manual.]
-
-The most basic techniques are software pipelining and loop unrolling.
-
-Software pipelining is the technique of scheduling instructions around
-the branch point in a loop, so that consecutive iterations overlap.
-It is very much like juggling.
-
-Unrolling is useful when software pipelining does not get us close
-enough to the peek performance of a processor's pipeline.  Unrolling
-decreases the loop overhead, but also often allows a more even load on
-a processor's functional units.
-
-For processors with very few registers, software pipelining is not
-feasible as it increases register pressure.
-
-For superscalar machines, it is often the case that all available
-execution capabilities are not used.  Scheduling some instructions
-for these otherwise unused resources will never cost us anything.
-
-Try to determine the alternative instructions that can be used for a
-particular processor.  For GMP, the problem that presents most
-challenges is propagating carry from one iteration to the next.
-Explore the different possibilities for doing that with the available
-instructions!
-
-For wide superscalar processors, the performance might be completely
-determined by the number of dependent instruction required from
-accepting carry-in from the previous iteration until producing
-carry-out for the next iteration.  This is particularly true for
-simple operations like mpn_add_n and mpn_sub_n.  Some carry
-propagation schemes require 4 instructions, translating to at least
-four cycles per iterations.  Other schemes can propagate carry in two
-cycles or even just one cycle.
-
-Therefore, for wide superscalar processors, finding methods with
-"shallow" carry propagation given an instruction set is often the
-central problem we need to address.  The rest is just is hard coding
-work.
-
-[Describe: First find issue maps with desired performance
-	   Then schedule for latency]
diff --git a/doc/multiplication b/doc/multiplication
deleted file mode 100644
index f405370fc..000000000
--- a/doc/multiplication
+++ /dev/null
@@ -1,416 +0,0 @@
-
-                         GMP MULTIPLICATION
-
-
-This file describes briefly the multiplication and squaring used in GMP.
-The code is likely to be hard to understand without knowing something about
-the algorithms.
-
-GMP does NxN limb multiplications and squares using one of four algorithms,
-according to the size N.
-
-         Algorithm      Sizes
-
-         basecase    < KARATSUBA_MUL_THRESHOLD
-	 karatsuba   >= KARATSUBA_MUL_THRESHOLD
-	 toom3       >= TOOM3_MUL_THRESHOLD
-	 fft         >= FFT_MUL_THRESHOLD
-
-Similarly for squaring, with the SQR thresholds.  Note though that the FFT
-is only used if GMP is configured with --enable-fft.
-
-MxN multiplications of operands with different sizes are currently done by
-splitting it up into various NxN pieces while above KARATSUBA_MUL_THRESHOLD.
-The karatsuba and toom3 routines then operate only on equal size operands.
-This is rather inefficient, and is slated for improvement in the future.
-
-
-
-BASECASE MULTIPLY
-
-This a straightforward rectangular set of cross-products, the same as long
-multiplication done by hand and for that reason sometimes known as the
-schoolbook or grammar school method.
-
-See Knuth (reference in gmp.texi) volume 2 section 4.3.1 algorithm M, or the
-mpn/generic/mul_basecase.c code.
-
-Assembler implementations of mul_basecase are essentially the same as the
-generic C version, but have all the usual assembler tricks and obscurities
-introduced for speed.
-
-
-
-BASECASE SQUARE
-
-A square can be done in roughly half the time of a multiply, by using the
-fact that the cross products above and below the diagonal are the same.  In
-practice squaring isn't 2x faster than multiplying, but it's always faster
-by a decent factor.
-
-             u0  u1  u2  u3  u4
-           +---+---+---+---+---+
-        u0 | x |   |   |   |   |
-           +---+---+---+---+---+
-        u1 |   | x |   |   |   |
-           +---+---+---+---+---+
-        u2 |   |   | x |   |   |
-           +---+---+---+---+---+
-        u3 |   |   |   | x |   |
-           +---+---+---+---+---+
-        u4 |   |   |   |   | x |
-           +---+---+---+---+---+
-
-The basic algorithm is to calculate a triangle of products below the
-diagonal, double it (left shift by one bit), and add in the products on the
-diagonal.  This can be seen in mpn/generic/sqr_basecase.c.  Again the
-assembler implementations take essentially this same approach.
-
-
-
-
-KARATSUBA MULTIPLY
-
-The Karatsuba multiplication algorithm is described in Knuth volume 2
-section 4.3.3 part A, and in other texts like Geddes et al. section 4.3
-(reference below).  A brief description is given here.
-
-The Karatsuba algorithm treats its inputs x and y as each split in two parts
-of equal length (or the most significant part one limb shorter if N is odd).
-
-         high              low
-        +----------+----------+
-        |    x1    |    x0    |
-        +----------+----------+
-
-        +----------+----------+
-        |    y1    |    y0    |
-        +----------+----------+
-
-Let b be the power of 2 where the split occurs, ie. if x0 is k limbs (y0 the
-same) then b=2^(k*mp_bits_per_limb).  Then x=x1*b+x0 and y=y1*b+y0, and the
-following holds,
-
-	x*y = (b^2+b)*x1*y1 + b*(x1-x0)*(y1-y0) + (b+1)*x0*y0
-
-This formula means doing only three multiplies of (N/2)x(N/2) limbs, whereas
-a basecase multiply of NxN limbs is roughly equivalent to four multiplies of
-(N/2)x(N/2).
-
-The factors (b^2+b) etc represent the positions where the three products
-must be added in.
-
-         high                              low
-        +--------+--------+ +--------+--------+
-        |      x1*y1      | |      x0*y0      |
-        +--------+--------+ +--------+--------+
-                  +--------+--------+
-                  |      x1*y1      |
-                  +--------+--------+
-                  +--------+--------+
-                  |      x0*y0      |
-                  +--------+--------+
-                  +--------+--------+
-                  | (x1-x0)*(y1-y0) |
-                  +--------+--------+
-
-The term (x1-x0)*(y1-y0) can be negative, but the final result is of course
-always positive.  Notice that high(x0*y0)+low(x1*y1) occurs twice, so it's
-possible to do 5*k limb additions, rather than 6*k, however in GMP extra
-function call overheads outweigh the saving.
-
-The use of three multiplies of N/2 limbs each leads to an asymptotic speed
-O(N^1.585).  (The exponent is log(3)/log(2).)  This is a big improvement
-over the basecase multiply at O(N^2) and the algorithmic advantage soon
-overcomes the extra additions Karatsuba must perform.
-
-
-
-
-KARATSUBA SQUARE
-
-A square is very similar to a multiply, but with x==y the formula reduces to
-an equivalent with three squares,
-
-        x^2 = (b^2+b)*x1^2 + b*(x1-x0)^2 + (b+1)*x0^2
- 
-The final result is accumulated from those three squares the same way as for
-the three multiplies above.  The middle term (x1-x0)^2 however is now always
-positive.
-
-
-
-
-TOOM-COOK 3-WAY MULTIPLY
-
-The Karatsuba formula is part of a general approach to splitting inputs
-leading to both Toom-Cook and FFT algorithms.  A description of Toom-Cook
-can be found in Knuth volume 2 section 4.3.3, with an example 3-way
-calculation after Theorem A.
-
-Toom-Cook 3-way treats the operands as split into 3 pieces of equal size (or
-the most significant part 1 or 2 limbs shorter than the others).
-
-         high                         low
-        +----------+----------+----------+
-        |    x2    |    x1    |    x0    |
-        +----------+----------+----------+
-
-        +----------+----------+----------+
-        |    y2    |    y1    |    y0    |
-        +----------+----------+----------+
-
-These parts are treated as the coefficients of two polynomials
-
-	X(t) = x2*t^2 + x1*t + x0
-	Y(t) = y2*t^2 + y1*t + y0
-
-Again let b equal the power of 2 which is the size of the x0, x1, y0 and y1
-pieces, ie. if they're k limbs each then b=2^(k*mp_bits_per_limb).  With
-this then x=X(b) and y=Y(b).
-
-Let a polynomial W(t)=X(t)*Y(t) and suppose it's coefficients are
-
-	W(t) = w4*t^4 + w3*t^3 + w2*t^2 + w1*t + w0
-
-The w[i] are going to be determined, and when they are they'll give the
-final result using w=W(b), since x*y=X(b)*Y(b)=W(b).  The coefficients will
-each be roughly b^2, so the final W(b) will be an addition like,
-
-         high                                        low
-        +-------+-------+
-        |       w4      |
-        +-------+-------+
-               +--------+-------+
-               |        w3      |
-               +--------+-------+
-                       +--------+-------+
-                       |        w2      |
-                       +--------+-------+
-                               +--------+-------+
-                               |        w1      |
-                               +--------+-------+
-                                        +-------+-------+
-                                        |       w0      |
-                                        +-------+-------+
-        -------------------------------------------------
-
-
-The w[i] coefficients could be formed by a simple set of cross products,
-like w4=x2*x2, w3=x2*y1+x1*y2, w2=x2*y0+x1*y1+x0*y2 etc, but this would need
-all nine x[i]*y[j] for i,j=0,1,2, and would be equivalent merely to a
-basecase multiply.  Instead the following approach is used.
-
-X(t) and Y(t) are evaluated and multiplied at 5 points, giving values of
-W(t) at those points.  The points used can be chosen in various ways, but in
-GMP the following are used
-  
-	t=0    meaning x0*y0, which gives w0 immediately
-        t=2    meaning (4*x2+2*x1*x0)*(4*y2+2*y1+y0)
-        t=1    meaning (x2+x1+x0)*(y2+y1+y0)
-        t=1/2  meaning (x2+2*x1+4*x0)*(y2+2*y1+4*y0)
-        t=inf  meaning x2*y2, which gives w4 immediately
-
-At t=1/2 the value calculated is actually 4*X(1/2)*4*Y(1/2), giving a value
-for 16*W(1/2), and this is always an integer.  At t=inf the value is
-actually X(t)*Y(t)/t^2 in the limit as t approaches infinity, but it's much
-easier to think of it as simply x2*y2 giving w4 immediately (much like at
-t=0 x0*y0 gives w0 immediately).
-
-Now each of the points substituted into W(t)=w4*t^4+...+w0 gives a linear
-combination of the w[i] coefficients, and the value of those combinations
-has just been calculated.
-
-           W(0)   =                                 w0
-        16*W(1/2) =    w4 + 2*w3 + 4*w2 + 8*w1 + 16*w0
-           W(1)   =    w4 +   w3 +   w2 +   w1 +    w0
-           W(2)   = 16*w4 + 8*w3 + 4*w2 + 2*w1 +    w0
-           W(inf) =    w4
-
-This is a set of five equations in five unknowns, and some elementary linear
-algebra quickly isolates each w[i], by subtracting multiples of one equation
-from another.
-
-In the code the set of five values W(0),...,W(inf) will represent those
-certain linear combinations.  By adding or subtracting one from another as
-necessary, values which are each w[i] alone are arrived at.  This involves
-only a few subtractions of small multiples (some of which are powers of 2),
-and so is very fast.  A couple of divisions remain by powers of 2 and one
-division by 3 (or by 6 rather), and that last uses the special fast
-mpn_divexact_by3.
-
-In the code the values w4, w2 and w0 are formed in the destination, and w3
-and w1 are added to them.  There's an extra word at the high end of w3, w2
-and w1 that are handled separately.  With A=w0,B=w1,...,E=w4, the additions
-are as follows.
-
-         high                                        low
-        +-------+-------+-------+-------+-------+-------+
-        |       E       |       C       |       A       |
-        +-------+-------+-------+-------+-------+-------+
-                 +------+-------++------+-------+
-                 |      D       ||      B       |
-                 +------+-------++------+-------+
-              --      --      --
-             |tD|    |tC|    |tB|        
-              --      --      --
-        -------------------------------------------------
-
-
-The conversion of W(t) values to the coefficients is called interpolation.
-A polynomial of degree 4 like W(t) is uniquely determined by values known at
-5 different points.  The points can be chosen to make the linear equations
-come out with a convenient set of steps for isolating the w[i]s.
-
-In mpn/generic/mul_n.c the interpolate3() routine performs the
-interpolation.  The open-coded one-pass version may be a bit hard to
-understand, the steps performed can be better seen in the USE_MORE_MPN
-version.
-
-The use of five multiplies of N/3 limbs each leads to an asymptotic speed
-O(N^1.465).  (The exponent is log(5)/log(3).)  This is an improvement over
-Karatsuba at O(N^1.585), though Toom-Cook does more work in the evaluation
-and interpolation and so it's only above a certain size that Toom-Cook
-realizes its advantage.
-
-The formula given above for the Karatsuba algorithm has an equivalent for
-Toom-Cook 3-way, involving only five multiplies, but this would be
-complicated and unenlightening.
-
-An alternate view of Toom-Cook 3-way can be found in Zuras (reference
-below).  He uses a vector to represent the x and y splits and a matrix
-multiplication for the evaluation and interpolation stages.  The matrix
-inverses are not meant to be actually used, and they have elements with
-values much greater than in fact arise in the interpolation steps.  The
-diagram shown for the 3-way is attractive, but again it doesn't have to be
-implemented that way and for example with a bit of rearrangement just one
-division by 6 (not two of them) can be done.
-
-
-
-
-TOOM-COOK 3-WAY SQUARE
-
-Toom-Cook squaring follows the same procedure as multiplication, but there's
-only one X(t) and it's evaluated at 5 points, and those values squared to
-give values of W(t).  The interpolation is then identical, and in fact the
-same interpolate3() subroutine is used for both squaring and multiplying.
-
-
-
-
-FFT MULTIPLY
-
-At large to very large sizes a Fermat style FFT is used, meaning an FFT in a
-ring of integers modulo 2^M+1.  This is asymptotically fast, but overheads
-mean it's only worthwhile for large products.
-
-Some brief notes are given here.  Full explanations can be found in various
-texts.  Knuth section 4.3.3 part C describes the method, but using complex
-numbers.  In the references below Schonhage and Strassen is the original
-paper, and Pollard gives some of the mathematics for a finite field as used
-here.
-
-A Fermat FFT does its multiplication modulo 2^N+1, but by choosing
-N>=bits(A)+bits(B), a full product A*B is obtained.  The algorithm splits
-the inputs into 2^k pieces, for a chosen k, and will recursively perform 2^k
-pointwise multiplications modulo 2^M+1, where M=N/2^k.  N must be a multiple
-of 2^k.  Those multiplications are either done by recursing into a further
-FFT, or by a plain toom3 etc multiplication, whichever is optimal at the
-resultant size.  Note that in the current implementation M is always a
-multiple of the limb size, and hence N is too.
-
-The steps leading to the pointwise products are like the evaluation and
-interpolation stages of the Karatsuba and Toom-Cook algorithms, but use a
-convolution, which can be efficiently calculated because 2^(2M) (which is
-2^(2N/2^k)) is a 2^k'th root of unity modulo 2^N+1.
-
-As operand sizes gets bigger, bigger splits are used.  Each time a bigger k
-is used some multiplying is effectively swapped for some shifts, adds and
-overheads.  A table of thresholds gives the points where a k+1 FFT is faster
-than a k FFT.  A separate threshold gives the point where a mod 2^N+1 FFT
-first becomes faster than a toom3 multiply of that size, and this normally
-happens in the k=4 range.  A further threshold gives the point where an
-N/2xN/2 multiply done with an FFT mod 2^N+1 is faster than a toom3 of that
-size, and this is normally in the k=7 or k=8 range.
-
-
-
-FFT SQUARE
-
-FFT squaring follows the same procedure as multiplication, but there's only
-one operand which is split, and the pointwise multiplications become
-squares.  The convolutions performed are the same, both at the evaluate and
-interpolate stages, except of course there's only one operand at the
-evaluate stage.
-
-
-
-
-TOOM-COOK N-WAY (NOT USED)
-
-The 3-way Toom-Cook procedure described above generalizes to split into an
-arbitrary number of pieces, as per Knuth volume 2 section 4.3.3 algorithm C.
-Some code implementing this for GMP exists, but is not present since it has
-yet to prove worthwhile.  The notes here are merely for interest.
-
-In general a split into r+1 pieces will be made, and evaluations and
-pointwise multiplications done at 2*r+1 points.  So a 4-way split does 7
-pointwise multiplies, 5-way does 9, etc.
-
-Asymptotically an r+1-way algorithm is O(n^(log(2*r+1)/log(r+1)).  Only the
-pointwise multiplications count towards big O() complexity, but the time
-spent in the evaluate and interpolate stages grows with r and has a
-significant practical impact, with the asymptotic advantage of each r
-realized only at bigger and bigger sizes.
-
-Knuth algorithm C presents a version evaluating at points 0,1,2,...,2*r, but
-exercise 4 uses -r,...,0,...,r and the latter saves some small multiplies in
-the evaluate stage (or rather trades them for additions), and has a further
-saving of nearly half the interpolate steps.  The answer to the exercise
-doesn't give full details of the interpolate, but essentially the idea is to
-separate odd and even final coefficients and then perform algorithm C steps
-C7 and C8 on them separately.  The multipliers and divisors at step C7 then
-become j^2 and 2*t*j-j*j respectively.
-
-In the references below, Zuras presents 3-way and 4-way Toom-Cook methods,
-and compares them to small order FFTs.  Hollerbach presents an N-way
-algorithm from first principles.  Bernstein presents the N-way algorithm in
-a style that facilitates comparing its mathematics to other multiplication
-algorithms.
-
-
-
-
-REFERENCES
-
-"Algorithms for Computer Algebra", Keith O. Geddes, Stephen R. Czapor,
-George Labahn, Kluwer Academic Publishers, 1992, ISBN 0-7923-9259-0.
-
-"Schnelle Multiplikation grosser Zahlen", by Arnold Schonhage and Volker
-Strassen, Computing 7, p. 281-292, 1971.
-
-"The Fast Fourier Transform in a Finite Field", J.M. Pollard, Mathematics of
-Computation, vol 25, num 114, April 1971.
-
-"On Squaring and Multiplying Large Integers", Dan Zuras, ARITH-11: IEEE
-Symposium on Computer Arithmetic, 1993, pages 260 to 271.  And reprinted as
-"More on Multiplying and Squaring Large Integers", IEEE Transactions on
-Computers, August 1994.
-
-"Fast Multiplication & Division of Very Large Numbers", Uwe Hollerbach, post
-to sci.math.research, Jan 1996, archived at Swarthmore,
-http://forum.swarthmore.edu/epigone/sci.math.research/zhouyimpzimp/x1ybdbxz5w4v@forum.swarthmore.edu
-
-"Multidigit Multiplication for Mathematicians", Daniel J. Bernstein,
-preprint available at http://koobera.math.uic.edu/www/papers.  (Every known
-multiplication technique, many references.)
-
-
-
-
-----------------
-Local variables:
-mode: text
-fill-column: 76
-End:
author	Kevin Ryde <user42@zip.com.au>	2000-10-21 01:16:40 +0200
committer	Kevin Ryde <user42@zip.com.au>	2000-10-21 01:16:40 +0200
commit	f0cb79dd8cc4dfdac3d299a5dae3de874ba20222 (patch)
tree	9db3b5fdf70965bd7c5807c68cc0e16565d1a303 /doc
parent	d5a9059c5798bf912a7bca2fa8002c62edc576b9 (diff)
download	gmp-f0cb79dd8cc4dfdac3d299a5dae3de874ba20222.tar.gz