1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
|
Implementation of mpfr_sum: sum of n mpfr_t values
══════════════════════════════════════════════════
Introduction
────────────
The original mpfr_sum algorithm, used up to GNU MPFR 3.1.x, was very
inefficient on some inputs and could take all the memory, making the
whole program crash (and possibly other programs on the system too,
due to the lack of memory at some critical point). It was based on
the article:
James Demmel and Yozo Hida, Fast and accurate floating-point summation
with application to computational geometry, Numerical Algorithms,
volume 37, number 1-4, pages 101--112, 2004.
which consists in adding the inputs one by one in decreasing magnitude.
But in MPFR, each number has its own precision and the exponent range is
large, so that this algorithm doesn't work well. The computation of the
sum was done with mpfr_add's in some fixed precision, in a conventional
Ziv loop. The inefficiency (with possible crash) was seen when a large
precision was needed due to the Table Maker's Dilemma. A huge precision
could be needed in case of internal cancellations and numbers of very
different order of magnitudes. This problem was not easily fixable to
handle all the possible cases.
Moreover, it was decided to drop Demmel's algorithm because it is based
on floating-point operations while the basic blocks of MPFR are integer
operations, and they could be used directly to compute a sum. Another
drawback of Demmel's algorithm for the particular case of MPFR is that
there is no good way to avoid the Ziv loop:
* One could consider the largest precision among all the numbers
(each input and the output), but this could mean computations in
an internal precision much larger than actually necessary.
Note: This is what was done in the trunk in r8699, before this
rewrite of mpfr_sum. MPFR 3.1.3 and earlier are buggy:
https://sympa.inria.fr/sympa/arc/mpfr/2013-10/msg00015.html
https://sympa.inria.fr/sympa/arc/mpfr/2013-11/msg00001.html
* Split the inputs so that all the values have the same precision
(suggestion from Demmel), but, in particular in small precision,
this introduces an important overhead. Moreover, the lack of
subnormals makes this tricky as this could generate numbers that
are not representable due to a too small exponent; this might be
solved with internal versions of raw addition and subtraction
that support a wider exponent range (they are considered for a
correct fmma implementation).
This allows one to get an accurate result, but not correct rounding.
A Ziv loop as done with the previous mpfr_sum implementation would be
too costly due to the possibility of having an exact result extremely
close to a breakpoint (which is either a machine number or a midpoint
between two consecutive machine numbers, depending on the considered
rounding mode). A simple solution for this problem could be to create
a new array with the opposite of this breakpoint as an additional
input and redo the sum entirely in order to get an approximation to
the corresponding error term, thus the sign of this error term. But
the error term may not be representable due to the lack of subnormals
in MPFR (in this case, one would get underflows, with no information
on the sign of the error term), unless raw addition and subtraction
with a wider exponent range are implemented (as mentioned above). And
even when underflows do not occur, this would have poor performance
as all the computations need to be redone.
The time/memory complexity must no longer depend on the value of the
exponents of the inputs, i.e., their order of magnitude.
Instead of being high level (based on mpfr_add), the algorithm and
implementation should be low level, based on integer operations,
equivalently seen as fixed-point operations. They must be efficient
both for the "general case" and for particular cases, with global
cancellations (the result being much smaller than some of the inputs)
and/or the Table Maker's Dilemma. So, we will use functions from the
mpn layer of GMP only, involving virtual splitting of MPFR numbers.
As a bonus, this will also solve overflow, underflow and normalization
issues, since everything is done in fixed point and the output exponent
will be considered only at the end (early overflow detection could also
be done, but this would probably not be very useful in practice).
The idea is the following. After handling special cases (NaN, infinities,
only zeros, and fewer than 3 regular inputs), we apply the generic case,
which more or less consists in a fixed-point accumulation by blocks: we
take into account the bits of the inputs whose weight is in some window
⟦minexp,maxexp⟦, and if this is not sufficient due to cancellation, we
reiterate with some less significant window. Once we have obtained an
accurate sum, if one still cannot round correctly because the result is
too close to a rounding boundary (i.e., a machine number or the middle
of two consecutive machine numbers), which is the problem known as the
Table Maker's Dilemma (TMD), then this problem is solved by determining
the sign of the remainder by using the same method in a low precision.
Specification
─────────────
The prototype:
int
mpfr_sum (mpfr_ptr sum, mpfr_ptr *const x, unsigned long n, mpfr_rnd_t rnd)
where sum will contain the result, x is an array of pointers to the
inputs, n is the length of this array, and rnd is the rounding mode.
The return value of type int will be the usual ternary value.
If n = 0, then the result is +0, whatever the rounding mode. This is
equivalent to mpfr_set_ui and mpfr_set_si on the integer 0, and this
choice is consistent with IEEE 754-2008's sum reduction operation of
vector length 0 (this result +0 for n = 0 is the only requirement of
IEEE 754-2008 for the sum reduction operation).
Otherwise the result must be the same as the one that would have been
obtained with
* if n = 1, mpfr_set;
* if n > 1, a succession of additions (mpfr_add) done in infinite
precision, then rounded (the order of these additions doesn't
matter).
This is equivalent to apply the following steps:
1. If an input is NaN, then the result is NaN.
2. If there are at least a +inf and a −inf, then the result is NaN.
3. If there is at least an infinity (in which case all the infinities
have the same sign), then the result is this infinity.
4. If the result is an exact zero:
* if all the inputs have the same sign (i.e., all +0 or all −0),
then the result has the same sign as the inputs;
* otherwise, either because all inputs are zeros with at least
a +0 and a −0, or because some inputs are nonzero (but they
globally cancel), the result is +0, except for the MPFR_RNDD
rounding mode, where it is −0.
5. Otherwise the exact result is a nonzero real number, and the
conventional rounding function is applied.
Concerning the exceptions, one generates a NaN exception in cases (1)
and (2). Case (5) potentially yields inexact / underflow / overflow
exceptions, as specified by the rounding rules.
See also:
https://sympa.inria.fr/sympa/arc/mpfr/2014-03/msg00000.html
Note: The final standard IEEE 1788-2015 specifies in §12.12.12:
An exact zero shall be returned as +0 in all rounding directions,
except for roundTowardNegative, where −0 shall be returned.
However, such a choice for MPFR would not have been consistent with
the extension of the usual IEEE 754 and MPFR rules (see above).
Moreover, IEEE 1788-2015 is already incompatible with IEEE 754-2008
on the case n = 0 and roundTowardNegative.
Preliminaries About the Representation of MPFR Numbers
──────────────────────────────────────────────────────
In MPFR, exponents are stored in a signed integer type mpfr_exp_t.
If this type has N value bits, i.e., the maximum value is 2^N−1, then
the maximum exponent range is defined so that any valid exponent fits
in N−1 bits (sign bit excluded), i.e., it is ⟦1−2^(N−1),2^(N−1)−1⟧.
This implies a huge gap between the minimum value MPFR_EXP_MIN of
the type and the minimum valid exponent MPFR_EMIN_MIN.
This allows the following algorithm & implementation to be valid in
practical cases. Assertion failures could occur in cases involving
extremely huge precisions (detected mainly for security reasons).
In practice, these failures are not possible with a 64-bit ABI due
to memory limits. With a 32-bit ABI, users would probably reach other
system limits first (mainly due to the limit on the address space);
the best solution would be to switch to a 64-bit for such computations.
MPFR code of some other functions have similar requirements, which are
often not documented.
TODO: Give more details (e.g. a bound on the precision, but this also
depends on the exponents of the numbers).
Algorithm and Implementation: Preliminary Steps
───────────────────────────────────────────────
We start by detecting the special cases. The mpfr_sum function does the
following.
1. If n ≤ 2, we can use existing MPFR functions and macros, mainly for
better efficiency since the algorithm described below can work with
any number of inputs (only minor changes would be needed).
* If n = 0, return +0 (by using MPFR macros).
* If n = 1, use mpfr_set.
* If n = 2, use mpfr_add.
2. Iterate over the n input numbers to:
(A) detect singular values (NaN, infinity, zero);
(B) among the regular values, get the maximum exponent.
Such information can be retrieved very quickly and this does not need
to look at the significand. Moreover, in the current internal number
representation, the kind of a singular value is represented as special
values of the exponent field, so that (B) does not need to fetch more
data in memory after doing (A).
In detail, during this iteration, 4 variables will be set, but the
loop will terminate earlier if one can determine that the result
will be NaN, either because of a NaN input or because of infinity
inputs of opposite signs:
* maxexp, which will contain the maximum exponent of the inputs.
Thus it is initialized to MPFR_EXP_MIN.
* rn, which will contain the number n' of regular inputs, i.e.,
those which are nonzero finite numbers.
* sign_inf, which will contain the sign of an infinity result. It
is initialized to 0, meaning no infinities yet. When the first
infinity is encountered, this value is changed to the sign of
this infinity (+1 or −1). When a new infinity is encountered,
either it has the same sign of sign_inf, in which case nothing
changes, or it has the opposite sign, in which case the loop
terminates immediately and a NaN result is returned.
* sign_zero, which will contain the sign of the zero result in
the case where all the inputs are zeros. Thanks to the IEEE 754
rules, this can be tracked with this variable alone: There is a
weak sign (−1, except for MPFR_RNDD, where it is +1), which can
be obtained only when all the inputs are zeros of this sign, and
a strong sign (+1, except for MPFR_RNDD, where it is −1), which
is obtained in all the other cases, i.e., when there is at least
a zero of this sign. One could have initialized the value of
sign_zero to the weak sign. But we have chosen to initialize it
to 0, which means that the sign is currently unknown, and do an
additional test in the loop. In practice, one should not see the
difference; this second solution was chosen just because it was
implemented first, and on a test, it appears that it makes the
code slightly shorter.
When the loop has terminated "normally", the result cannot be NaN.
One does in the following order:
1. If sign_inf ≠ 0, then the result is an infinity of this sign.
2. If rn = 0, then all the inputs are zeros, so that the result
is a zero whose sign is given by sign_zero.
3. If rn ≤ 2, then one can use mpfr_set (if rn = 1) or mpfr_add
(if rn = 2) as an optimization, similarly to what was done for
n ≤ 2. We reiterate in order to find the concerned input(s).
4. Otherwise one calls a function sum_aux implementing the generic
case, more precisely:
sum_aux (sum, x, n, rnd, maxexp, rn);
i.e., in addition to the parameters of mpfr_sum, we pass:
* the maximum exponent;
* the number of regular inputs, i.e., the number of nonzero
inputs. This number will be used instead of n to determine
bounds on the sum (to avoid internal overflows) and error
bounds.
Algorithm and Implementation: The Generic Case
──────────────────────────────────────────────
Let us define logn = ceil(log2(n')).
The basic idea is to compute a truncated sum in the two's complement
representation, by using a fixed-point accumulator stored in a fixed
memory area.
Two's complement is preferred to the sign + magnitude representation
because the signs of the temporary (and final) results are not known
in advance, and the computations (additions and subtractions) in
two's complement are more natural here; the usual sign + magnitude
representation for the accumulator would have been possible with about
the same number of elementary operations (even in case of cancellation),
but the code would have been more complex, with more subcases; it is not
obvious which choice would have led to faster code, and this may depend
on many factors (the input data, the processor, the GMP version, the
compiler, etc.), but in any case, the difference probably remains small.
There will be a conversion to sign + magnitude (i.e., the representation
used by MPFR numbers) at the end, but in general, this should not take
much time compared to the other calculations. Two's complement will also
make some internal parts of the algorithm simpler: we will use the fact
that a number starting with a sequence of 1's is close to 0 (secondary
term when the TMD occurs) and that each bit except the first one has a
positive value (for the error bound).
The precision of the accumulator needs to be a bit larger than the
output precision, denoted sq, for two reasons:
* We need some additional bits on the side of the most significant
part due to the accumulation of n' values, which can make the sum
grow and overflow without enough extra bits. The absolute value of
the sum is less than n' * 2^maxexp, taking up to logn extra bits,
and one needs one more bit to be able to determine the sign due to
two's complement, so that a total of cq = logn + 1 extra bits will
be necessary.
* We need some additional bits on the side of the least significant
part to take into account the accumulation of the truncation errors.
The choice of this number dq of bits is quite arbitrary: the larger
this value is, the longer an iteration will take, but conversely,
the less likely a costly new iteration (due to cancellations and/or
the Table Maker's Dilemma) will be needed. In order to make the
implementation simpler, the precision of the accumulator will be a
multiple of the word size (GMP_NUMB_BITS). Moreover, the algorithm
will need at least 4 bits. The final choice should be done after
testing various applications. In the current implementation, we
choose the smallest value ≥ logn + 2 such that the precision of the
accumulator is a multiple of GMP_NUMB_BITS. Since logn ≥ 2, we have
dq ≥ 4 as wanted.
Accumulator precision: wq = cq + maxexp − minexp = cq + sq + dq.
┌──────────────────── wq ────────────────────┐
Accumulator: [--------]-----------------------------------]
cq └─ maxexp sq + dq minexp ─┘
This idea is not sufficient because if the Table Maker's Dilemma occurs,
the needed precision for the truncated sum will grow. In particular, one
can easily reach a huge precision with a few small-precision inputs. For
instance, in directed rounding modes, sum(2^E,2^F) with F much smaller
than E. We want to avoid increasing the precision of the accumulator.
This will be done by detecting the Table Maker's Dilemma, and when it
occurs, solving it consists in determining the sign of some error term.
This will be done by computing an approximation to the error term in a
small precision. The algorithm to compute this error term is the same
as the one to compute an approximation to the sum, the main difference
being that we just need a small precision here. In the implementation,
we will use a function sum_raw for both computations.
Algorithm and Implementation: The sum_raw Function
──────────────────────────────────────────────────
In summary:
The sum_raw function will work in a fixed-point accumulator, having a
fixed precision (= size in limbs * GMP_NUMB_BITS). It will accumulate
the bits of the inputs whose exponents are in ⟦minexp,maxexp⟦, where
maxexp − minexp is less than the precision of the accumulator: as said
above, we need some additional bits in order to avoid overflows during
the accumulation. On the entry, the accumulator may already contain a
value from previous computations (it is the caller that clears it if
need be).
During the accumulation, the bits of an input x[i] whose exponents are
less than minexp form the "tail" of this input. When the tail of x[i]
is not empty, its exponent is defined as e[i] = min(E(x[i]),minexp).
The absolute value of each tail is strictly less than 2^e[i]. This will
give an error bound on the computed sum at each iteration:
rn · 2^(sup(e[i])) ≤ 2^(sup(e[i]) + logn).
If the computed result is 0 (meaning full cancellation), set maxexp to
the maximum exponent of the tails, set minexp so that it corresponds to
the least significant bit of the accumulator, and reiterate. Otherwise,
let e and err denote the exponents of the computed result and of the
error bound respectively. While e − err is less than some given bound
denoted prec, shift the accumulator, update maxexp and minexp, and
reiterate. For the caller, this bound must be large enough in order to
reach some wanted accuracy. However it cannot be too large since the
accumulator has a limited precision: we will need to make sure that
if a reiteration is needed, then the cause is a partial cancellation,
so that the shift count is nonzero, otherwise minexp would not change
and one would get an infinite loop. Details and formal definitions are
given later.
Basically, this can be regarded as a sum by block, where each block
has the form ⟦minexp,maxexp⟦.
Notes:
* The reiterations will end when there are no more tails, but in the
code, this is detected only when needed.
* The definition of the tails allows one to skip gaps between inputs
in case of full cancellation (e.g. 1 + (−1) + 2^tiny_exponent).
* We choose not to include maxexp in the exponent interval in order
to match the convention chosen to represent floating-point numbers
in MPFR, where the significand is in [1/2,1[, i.e., the exponent of
a floating-point number is the one of the most significant bit + 1.
This also means that the current minexp will be maxexp at the next
iteration, unless there is a gap between the inputs, as explained
above.
Inputs of the sum_raw function (see the code for the types):
* wp: pointer to the accumulator (least significant limb first).
* ws: size of the accumulator (in limbs).
* wq: precision of the accumulator (ws * GMP_NUMB_BITS).
* x: array of the input numbers.
* n: size of this array (number of inputs, regular or not).
* minexp: exponent of the least significant bit of the first block.
* maxexp: exponent of the first block (exponent of its MSB + 1).
* tp: pointer to a temporary area (preallocated).
* ts: size of this temporary area.
* logn: ceil(log2(rn)), where rn is the number of regular inputs.
* prec: lower bound for e − err (as described above).
* ep: pointer to mpfr_exp_t (see below), or a null pointer.
* minexpp: pointer to mpfr_exp_t (see below), or a null pointer.
* maxexpp: pointer to mpfr_exp_t (see below), or a null pointer.
We require as preconditions (explanations are given later):
* prec ≥ 1
* wq ≥ logn + prec + 2
This function returns 0 if the accumulator is 0 (which implies that
the exact sum for this sum_raw invocation is 0), otherwise the number
of cancelled bits (>= 1), defined as the number of identical bits on
the most significant part of the accumulator. In the latter case, it
also returns the following data in variables passed by reference,
unless ep, minexpp and maxexpp are null pointers (such data are useful
only after the first invocation of sum_raw):
* in ep: the exponent e of the computed result;
* in minexpp: the last value of minexp;
* in maxexpp: the last value of maxexp2 (which would be the new value
of maxexp for the next iteration, i.e. the first iteration of the
second invocation of sum_raw in case of TMD resolution).
Notes:
* All the bits of the accumulator are meaningful; minexp is also
the exponent of the least significant bit of the accumulator.
* The temporary area must be large enough to hold a shifted input
block, and the value of ts is used only when the full assertions
are checked (i.e., with the --enable-assert configure option), to
check that a buffer overflow doesn't occur. In the code, ts will
be chosen to hold at least wq − cq + GMP_NUMB_BITS − 1 bits (it
will not be modified for the second sum_raw invocation, but this
is fine as wq is not increased for the second sum_raw invocation).
We just need to check that this choice is correct each time the
accumulator is shifted. Here for sum_raw, we will need to prove
that maxexp − minexp ≤ wq − cq at each iteration.
* Contrary to the returned value of minexp (the value in the last
iteration), the returned value of maxexp is the one for the next
iteration (= maxexp2 of the last iteration, see below).
* By definition, the exponent of a positive number is the one of the
least significant 0 in the semi-infinite leading string of 0's of
the significand. But in two's complement, the negative significands
are represented with a leading string of 1's, and we keep a similar,
simple definition for the exponent of a negative number: it is the
one of the least significant 1 in the leading string of 1's. Thus
a value represented by a finite number of bits is in [2^(e−1),2^e[
for positive values, and in [−2^e,−2^(e−1)[ for negative values;
the interval for the sign + magnitude representation would be:
]−2^e,−2^(e−1)] for negative values.
Note: This definition can also be justified as follows. Negating a
number in two's complement (in order to obtain its magnitude) can
be done by complementing all the bits of the infinite series, and
the intervals would be the same if one considered noncanonical
binary expansions, i.e. with an infinite trailing string of 1's:
[−2^e,−2^(e−1)]. The difference between the open/closed bounds for
[−2^e,−2^(e−1)[ and ]−2^e,−2^(e−1)] is due to the fact that a number
represented by a finite number of bits (i.e., a canonical binary
expansion) is changed to a noncanonical binary expansion after
complement: the infinite trailing string of 0's is changed to an
infinite trailing string of 1's, and the conversion to a canonical
binary expansion could change the exponent.
Details about the sum_raw implementation are given below. Here is some
notation:
E(v): the exponent of a MPFR number v.
P(v): the precision of a MPFR number v.
Q(v) = E(v) − P(v): the exponent of the ulp of a MPFR number v.
A maxexp2 variable will contain the maximum exponent of the tails.
Thus it is initialized to the minimum value of the exponent type:
MPFR_EXP_MIN; this choice means that at the end of the loop below,
maxexp2 will be MPFR_EXP_MIN iff there are no more tails (this case
implies that the truncated sum is exact). If a new iteration is needed,
then maxexp2 will be assigned to the maxexp variable for this new
iteration.
Then one has a loop over the inputs x[i]. Each input is processed with
the following steps:
1. If the input is not regular, skip it.
Note: if there are many singular inputs, it may be more efficient
to have an array pointing to the regular inputs only, but such a
case is assumed to be rare, and the number of iterations of this
inner loop is also limited by the relatively small number of regular
inputs.
2. If E(x[i]) ≤ minexp, then no bits of x[i] need to be considered here.
We set maxexp2 ← max(maxexp2,E(x[i])), then go to the next input.
3. Now, we have: E(x[i]) > minexp. If the tail of x[i] is not empty,
i.e., if Q(x[i]) < minexp, then we set maxexp2 ← minexp.
4. We prepare the input for the accumulation. First, this means that if
its significand is not aligned with the accumulator, then we need to
align it by shifting a part of the significand (containing bits that
will be accumulated at this iteration), storing the result to the
temporary area tp. In both cases, we compute:
* the address and size of the destination, corresponding either to
the full accumulator or to some most significant part (which can
occur if Q(x[i]) > minexp);
* the address and size of the block (of the input or its shifted
version) to accumulate to the destination;
* the number of the most significant bits of the block that need
to be masked. Masking may be needed if the input was not shifted
because in such as case, this is regarded as a read-only area.
5. If x[i] is positive: an addition with carry out is done with
mpn_add_n; if the most significant limb needs to be masked, then it
isn't taken into account in the addition, but the masked part is just
added to the carry; carry propagation is done with mpn_add_1 if the
size of the destination is larger than the size of the block.
Note: There may be still be a carry out, but it is just ignored. This
occurs when a negative value in the accumulator becomes nonnegative,
and this fact is part of the usual two's complement arithmetic.
If x[i] is negative, we do similar computations by using mpn_sub_n
for the subtraction and mpn_sub_1 to propagate borrows.
Note: The steps 2, 3, and 4 above are currently done by distinguishing
two cases:
A. Q(x[i]) < minexp, covering cases like:
[-+- accumulator ---]
[---│----- x[i] ------│--]
│ [----- x[i] --│--]
│ │[----- x[i] -----]
│ │ [----- x[i] -----]
maxexp minexp
B. Q(x[i]) ≥ minexp, covering cases like:
[-+- accumulator ---]
[- x[i] -] │ │
[---│-- x[i] ------] │
[------│-- x[i] ---------]
│ [- x[i] -] │
maxexp minexp
It might be possible to merge these cases in a future version of the
code.
After the loop over the inputs, we need to see whether the accuracy
of the truncated sum is sufficient. We first determine the number of
cancelled bits, defined as the number of consecutive identical bits
starting with the most significant one in the accumulator. At the
same time, we can determine whether the truncated sum is 0 (all the
bits are identical and their value is 0). If it is 0, we have two
cases: if maxexp2 is equal to MPFR_EXP_MIN (meaning no more tails),
then we return 0, otherwise we reiterate at the beginning of sum_raw
with:
* minexp = cq + maxexp2 − wq
* maxexp = maxexp2
so that we still have in this subcase: wq = cq + maxexp − minexp.
We can now assume that the truncated sum is not 0.
Let us note that our computation of the number of cancelled bits
(variable cancel) was limited to the accumulator representation,
while from a mathematical point of view, the binary expansion is
unlimited and the bits of exponent less than minexp are regarded
as 0's:
[--- accumulator ---]000000...
minexp ─┘└─ first nonrepresented bit = 0
So, we need to check that the computed value of cancel matches this
mathematical point of view:
* If the cancelled bits are 0's, then, since the truncated sum is
not 0, there must be at least a bit 1 in the accumulator.
* If the cancelled bits are 1's, then this sequence entirely fits
in the accumulator, as the first nonrepresented bit is a 0.
The analysis below maps the truncated sum to the destination without
considering rounding yet.
Let us denote:
* e = maxexp + cq − cancel = minexp + wq − cancel,
* err = maxexp2 + logn.
Then e is the exponent of the least significant cancelled bit, thus the
absolute value of the truncated sum is in [2^(e−1),2^e] (binade closed
on both ends due to two's complement). Since there are at most 2^logn
regular inputs and the absolute value of each tail is < 2^maxexp2, the
absolute value of the error is strictly less than 2^err. If maxexp2 is
equal to MPFR_EXP_MIN (meaning no more tails), then the error is 0.
Note: The C implementation requires prec ≥ 0 due to the use of unsigned
integer arithmetic in the ≥ test between e − err and prec in order to
avoid a potential integer overflow. This is not an issue as to be able
to get the sign of the result, one needs prec ≥ 1. Actually, for this
reason, the C code uses prec ≥ 1 as a precondition.
If e − err ≥ prec, then the sum_raw function returns as described above.
Otherwise, due to cancellation, we need to reiterate after shifting the
value of the accumulator to the left and updating minexp and maxexp. Let
shiftq denote the shift count, which must satisfy: 0 < shiftq < cancel.
The left inequality must be strict to ensure termination, and the right
inequality ensures that the value of the accumulator will not change
with the updated minexp: shiftq is subtracted from minexp at the same
time. The reiteration is done with maxexp = maxexp2, as said above.
Example. If there is an additional iteration with maxexp2 = minexp − 4
and a shift of 26 bits (due to cancellation), here is the accumulator
before and after the shift:
Before shift: [--------]-----------------------------------]
cq ├─ maxexp sq + dq minexp ─┤
00000000000000000000000000001----------------]
<--- identical bits (0) --->
<------- 26 zeros ------->
After shift: 001-----------------00000000000000000000000000
This iteration: minexp ─┘ ├─ maxexp2 │
Next iteration: └─ maxexp minexp ─┘
We now need to determine the value of shiftq. We prefer it to be as
large as possible: this is some form of normalization. Moreover, it
must satisfy the above double inequality and be such that:
(1) the new value of minexp is smaller than the new value of maxexp,
i.e., minexp − shiftq < maxexp2, i.e., shiftq > minexp − maxexp2;
(2) overflows will still be impossible in the new iteration.
Note that since maxexp2 ≤ minexp, (1) will imply shiftq > 0. And (2)
is an extension of shiftq < cancel. Thus the double inequality above
is a weak form of what is actually required.
Since we prefer shiftq to be maximum, we focus on (2) first. The
absolute value of the accumulator at the end of the next iteration
will be strictly bounded by 2^e + 2^err ≤ 2^(e + 1 + max(0, err − e)).
This means that if we do not shift the value in the accumulator, then
at the end of the next iteration, the accumulator will contain at least
cancel − 1 − max(0, err − e) identical bits on its most significant
part. Only the last of these bits is needed (which gives the sign) and
the other ones are redundant. Therefore, in order to satisfy (2), we
can choose:
shiftq = cancel − 2 − max(0, err − e).
Now, let us prove that for this value, (1) is satisfied. At the same
time, let us also prove that maxexp2 − (minexp − shiftq) ≤ wq − cq to
guarantee that the chosen size ts is still sufficient (intuitively,
this is implied by the impossibility of overflow).
* If err − e ≥ 0, then:
shiftq = cancel − 2 − (err − e)
= cancel − 2 − (maxexp2 + logn) + (minexp + wq − cancel)
= minexp − maxexp2 + (wq − logn − 2) > minexp − maxexp2
since wq ≥ logn + prec + 2 ≥ logn + 3 on input.
Starting with the above equality,
maxexp2 − (minexp − shiftq) = wq − logn − 2 ≤ wq − cq.
* If err − e < 0, then this is the case where the error can be
potentially small: to be able to prove the inequality, we need
to use the fact that the stop condition was not satisfied, i.e.,
e − err < prec. Thus
(minexp + wq − cancel) − (maxexp2 + logn) < prec
and
shiftq − (minexp − maxexp2) = cancel − 2 − (minexp − maxexp2)
> wq − logn − prec − 2 ≥ 0
since wq ≥ logn + prec + 2 on input.
Since err − e < 0 and cancel = shiftq + 2, we have:
(maxexp2 + logn) − (minexp + wq − shiftq − 2) < 0,
thus
maxexp2 − (minexp − shiftq) < wq − logn − 2 ≤ wq − cq.
QED.
Note: It is expected in general that when a cancellation occurs so that
a new iteration is needed, the cancellation is not very large (but this
really depends on the problem), in which case the new additions will
take place only in a small part of the accumulator, except in case of
long carry propagation (see the end of this file).
Old discussion about the choice of the shift count (not much useful
any longer, but it can help to understand, with its examples):
Concerning the choice of the shift count shiftq, letting only one
identical bit in the most significant part may not be sufficient;
for instance, if
a. maxexp2 = minexp;
b. the accumulator contains:
0000000011111111111111111111111111110101
This iteration: minexp ─┘
c. the accumulator is shifted to:
0111111111111111111111111111101010000000
Next iteration: └─ maxexp
d. there are at least 12 numbers to add at the next iteration;
then one could end up with something like:
1000000000000000000000000000000000010001
└─ maxexp
i.e., an overflow in two's complement. But leaving at least
2 + max(0, err − e) identical bits in the most significant part,
such as
0011111111111111111111111111110101000000
└─ maxexp
is sufficient. The second term of the max covers cases like:
┌─ err = maxexp2 + logn
0000000000000000000000000000000000000111
e ─┘
(2^err being a bound on the error term, thus a bound on the value
that will be added to the accumulator at the next iteration), for
which the accumulator can be shifted to:
┌─ err
0000000111000000000000000000000000000000
e ─┘
without triggering an overflow at the next iteration, but
┌─ err
0000001110000000000000000000000000000000
e ─┘
is incorrect as a 1 could appear in the MSB (making the accumulator
value negative) just with additions of positive numbers.
Said otherwise, leaving at least i identical bits allows one to
represent numbers in [−2^(e+i−1),2^(e+i−1)[. The current sum is in
[−2^e,2^e[, and taking into account the error terms, one wants to
be able to represent arbitrary numbers in ]−2^e−2^err,2^e+2^err[.
So, i must be chosen in such a way that 2^e + 2^err ≤ 2^(e+i−1),
i.e., 2^0 + 2^(err−e) ≤ 2^(i−1). The smallest power of two larger
than or equal to 2^0 + 2^(err−e) has exponent 1 + max(0, err − e).
Hence the choice i ≥ 2 + max(0, err − e).
Algorithm and Implementation: Back to the Generic Case
──────────────────────────────────────────────────────
Let us recall the form of the accumulator for the summation itself,
i.e., the first sum_raw invocation:
Accumulator: [--------]-----------------------------------]
cq └─ maxexp sq + dq minexp ─┘
where cq = logn + 1, sq is the target precision, and dq ≥ logn + 2.
A single chunk of memory is allocated both for the accumulator and for
the temporary area needed by sum_raw. We also chose to allocate memory
for the possible TMD resolution (as explained later) in the same chunk;
this second accumulator will be useless in most cases (it is necessary
only if the TMD occurs and some input is reused as the output), but in
the current implementation, it takes at most two limbs in practice,
so that this does not make a noticeable difference. For performance
reasons, the memory is allocated in the stack instead of the heap if
its size is small enough (see MPFR_TMP_LIMBS_ALLOC macro). No other
memory allocation will be needed (except for auto variables).
Note: Having a small-size accumulator for sum_raw, either for the main
computation or for the TMD resolution, is not the best choice for the
worst-case complexity. For the time being, we focus on correctness and
make sure that the implementation is fast on almost all cases and not
too slow on corner cases. In the future, we may want to fix a minimal
size for the accumulator or allow it to grow dynamically, for instance
in a geometric progression after a few iterations (similarly to what is
done for Ziv loops in the TMD resolution for mathematical functions).
The accumulator is zeroed and sum_raw is invoked to compute an
accurate approximation of the sum. Among its parameters, maxexp was
computed during the preliminary steps, minexp = maxexp − (wq − cq),
and prec = sq + 3, which satisfies the wq ≥ logn + prec + 2 sum_raw
precondition:
wq = cq + sq + dq ≥ logn + 1 + sq + 4 = logn + prec + 2.
If the sum_raw function returns 0, then the exact sum is 0, so that we
just set the MPFR number sum to 0 with the correct sign according to the
IEEE 754 rules (positive, except for MPFR_RNDD, where it is negative),
and return with ternary value = 0.
Now, the accumulator contains the significand of a good approximation
to the nonzero exact sum. The corresponding exponent is e and the sign
is determined from one of the cancelled bits. The exponent of the ulp
for the target precision is u = e − sq. We have:
* err = maxexp + logn (where maxexp is the maxexp2 value returned by
sum_raw) as in sum_raw;
* e − err ≥ prec = sq + 3.
Thus err ≤ u − 3, i.e. the absolute value of the error is strictly less
than ulp(computed value) × 2^(−3) = 2^(u−3).
Here is a representation of the accumulator and the cancelled bits,
with the two cases depending on the sign of the truncated sum, where
the x's correspond to the sq−1 represented bits following the initial
value bit (1 if positive sum, 0 if negative), r is the rounding bit,
and the bits f are the following bits:
][------------------- accumulator -------------------]
][---- cancel ----]----------------------------------]
]0000000000000000001xxxxxxxxxxxxxxxxxxxxxrffffffffffff0000...
]1111111111111111110xxxxxxxxxxxxxxxxxxxxxrffffffffffff0000...
└─ e u ─┘ minexp ─┘
Note that the iterations in sum_raw could have stopped even in case
of important cancellation: it suffices that the error term be small
enough, i.e. where the tails for the last iteration consisted only of
inputs x[i] whose exponent was very small compared to minexp. In such
a case, the bit r and some of the least significant bits x may fall
outside of the accumulator, in which case they are regarded as 0's
(still due to truncation). In the following, we will make sure that
we do not try to read nonrepresented bits.
When maxexp (corresponding to the maxexp2 value returned by sum_raw)
is different from MPFR_EXP_MIN, i.e., when some bits of the inputs
have still not been considered, we will need to determine whether the
TMD occurs. In this case, we will compute d = u − err, which is ≥ 3,
and can be very large if maxexp is very small; nevertheless, d is
representable in a mpfr_exp_t since:
* If maxexp < minexp, then maxexp is the exponent of an input x[i],
so that maxexp ≥ MPFR_EMIN_MIN; and since u ≤ MPFR_EMAX_MAX, we
have d ≤ MPFR_EMAX_MAX − MPFR_EMIN_MIN, which is representable in
a mpfr_exp_t as per definition of MPFR_EMIN_MIN and MPFR_EMAX_MAX
in MPFR (this choice was made to simplify internal computations).
* If maxexp = minexp, then d ≤ (minexp + wq) − maxexp = wq, which
is representable in a mpfr_exp_t since this type can contain all
precision values (type mpfr_prec_t).
The TMD occurs when the sum is close enough to a "breakpoint", which is
defined as a discontinuity point of the function that maps a real input
to the correctly rounded value and the ternary value. This is either a
machine number (i.e., a number whose significand fits on sq bits) or a
midpoint between two consecutive machine numbers, depending on the
rounding mode:
┌───────────────┬────────────────┐
│ Rounding mode │ Breakpoint │
├───────────────┼────────────────┤
│ to nearest │ midpoint │
│ to nearest │ machine number │
│ directed │ machine number │
└───────────────┴────────────────┘
(in the second case, the correctly rounded sum can be determined, but
not the ternary value, and this is why the TMD occurs). More precisely,
the TMD occurs when:
* directed rounding modes: the d bits following the ulp bit are
identical;
* round-to-nearest mode: the d−1 bits following the rounding bit are
identical.
Several things need to be considered, in arbitrary order:
* the copy to the destination (if the destination is used by an input,
the TMD may need to be resolved first);
* a shift (for the alignment), if the shift count is nonzero
(this is the most probable case);
* a negation/complement if the value is negative (cancelled bits = 1),
since the significand of MPFR numbers uses the conventional
sign + absolute value representation;
* rounding (the TMD needs to be resolved first if it occurs).
It is more efficient to merge some of these operations, i.e., do them at
the same time, and this possibility depends on the operations provided
by the mpn layer of GMP. Ideally, all these operations should be merged
together, but this is not possible with the current version of GMP
(6.1.1).
For practical reasons, the shift should be done before the rounding,
so that all the bits are represented for the rounding. The copy itself
should be done together with the shift or the negation, because this
is where most of the limbs are changed in general. We chose to do it
with the shift as it is assumed that the proportion of nonzero shift
counts is higher than the proportion of negations.
Moreover, for negative values, the difference between negation and
complement is similar to the difference between rounding directions
(these operations are identical on the real numbers, i.e., in infinite
precision), so that negation/complement and rounding can naturally be
merged, as detailed later.
We will do the following:
1. Determine how the result will be rounded. If the TMD occurs, it is
resolved at this step.
2. Copy the truncated accumulator (shifted) to the destination. For
simplicity, after this step, the trailing bits of the destination
(present when the precision is not a multiple of GMP_NUMB_BITS)
contain garbage. Since rounding needs a specific operation on the
least significant limb, these trailing bits (located in this limb)
will be zeroed in the next step.
3. Take the complement if the result is negative, and at the same time,
do the rounding and zero the trailing bits.
4. Set the exponent and handle a possible overflow/underflow.
Details for each of these 4 steps are given below.
Algorithm and Implementation: Rounding Information
──────────────────────────────────────────────────
The values of 3 variables are determined:
* inex: 0 if the final sum is known to be exact (which can be the
case only if maxexp = MPFR_EXP_MIN), otherwise 1.
* rbit: the rounding bit (0 or 1) of the truncated sum, changed to 0
for halfway cases that will round toward −infinity if the rounding
mode is to nearest (so that this bit gives the rounding direction),
as explained below.
* tmd: three possible values:
0: the TMD does not occur.
1: the TMD occurs on a machine number.
2: the TMD occurs on a midpoint.
Note: The value of inex will be used only if the TMD does not occur
(i.e. tmd = 0). So, inex could be left indeterminate when tmd ≠ 0,
but this would not simplify the current code.
This is done by considering two cases:
* u > minexp. The rounding bit, which is represented, is read. Then
there are two subcases:
* maxexp = MPFR_EXP_MIN. The sum in the accumulator is exact.
Thus inex will be the value of rounding bit || sticky bit.
In round to nearest, rbit = 1 will mean that the value is
to be rounded toward +infinity, even for halfway cases as
it is easier to handle these cases now. The variable rbit
is initially set to the value of the rounding bit. We need
to determine the sticky bit if
* rbit = 0, or
* rbit = 1 and rnd = MPFR_RNDN and the least significant bit
of the truncated sq-bit significand (i.e., the bit before
the rounding bit) is 0; if the sticky bit is 0, this
halfway value will have to be rounded toward −infinity,
so that rbit is changed to 0. Note that for sq ≥ 2, the
parity of the rounded significand does not depend on the
representation (two's complement or sign + magnitude);
that is why, even though the significand is currently
represented in two's complement, we round to even. To
illustrate this point, let us give an example with a
negative value:
…1110.1100[100000] (two's complement)
…1110.1100 (rounded with rbit = 0, to even)
…0001.0100 (magnitude)
where the bits inside the brackets are those after the
truncated sq-bit significand. If we had converted the
accumulator first, we would have obtained:
…0001.0011[100000] (magnitude)
…0001.0100 (rounded with rbit = 1, to even)
i.e., the same result.
For sq = 1, the IEEE 754 rule for halfway cases is to choose
the value larger in magnitude, i.e., round away from zero.
Discussion about this rule and errata:
http://grouper.ieee.org/groups/754/email/msg03907.html
http://speleotrove.com/misc/IEEE754-errata.html
therefore, in this case, we want to keep rbit to 1 for
positive values, and set it to 0 for negative values; but
it happens that this corresponds to the rule chosen for
sq ≥ 2 (since the only bit of the truncated significand
is $1$ for positive values and $0$ for negative values),
so that there is no need to distinguish cases in the code.
And tmd is set to 0 because one can round correctly, knowing
the exact sum.
* maxexp ≠ MPFR_EXP_MIN. We do not know whether the final sum is
exact, so that we set inex to 1. We also determine the value of
tmd as briefly described above (the code is quite complex since
we need to take into account the fact that not all the bits are
represented).
* u ≤ minexp. The rounding bit is not represented (its value is 0),
thus rbit is set to 0. If maxexp = MPFR_EXP_MIN, then both inex and
tmd are set to 0; otherwise they are set to 1 (the bits following
the ulp bit are not represented, thus are all 0's, implying that
the TMD occurs on a machine number).
We also determine the sign of the result: a variable neg is set to the
value of the most significant bit of the accumulator, and a variable
sgn to the corresponding sign. In short:
┌──────────┬─────┬─────┐
│ number │ neg │ sgn │
├──────────┼─────┼─────┤
│ positive │ 0 │ 1 │
│ negative │ 1 │ −1 │
└──────────┴─────┴─────┘
Now we seek to determine how the value will be rounded, more precisely,
what correction will be done to the significand that will be copied.
We currently have a significand, a trailing term t in the accumulator
such that 0 ≤ t < 1 ulp (nonnegative thanks to the two's complement
representation), and an error on the trailing term bounded by
t' ≤ 2^(u−3) = 2^(−3) ulp in absolute value, so that the error ε on the
significand satisfies −t' ≤ ε < 1 ulp + t'. Thus one has 4 correction
cases, denoted by an integer value corr between −1 and 2, which depends
on ε, the sign of the result (neg / sgn), rbit, and the rounding mode:
−1: equivalent to nextbelow;
0: no correction;
+1: equivalent to nextabove.
+2: equivalent to 2 consecutive nextabove.
We also determine the ternary value inex that will be provided as the
second argument of mpfr_check_range.
We distinguish two cases:
* tmd = 0. The TMD does not occur, so that the error has no influence
on the rounding and the ternary value (one can assume t' = 0). One
has inex = 0 if and only if t = 0, so that inex is currently the
absolute value of the ternary value. Therefore:
* for MPFR_RNDD, corr ← 0;
* for MPFR_RNDU, corr ← inex;
* for MPFR_RNDZ, corr ← inex && neg;
* for MPFR_RNDA, corr ← inex && !neg;
* for MPFR_RNDN, corr ← rbit.
If inex ≠ 0 and corr = 0, we set inex to −1. The variable inex now
contains the ternary value.
* tmd ≠ 0. The TMD occurs and will be solved by determining the sign
(−1, 0 or +1) of a secondary term with a second sum_raw invocation
using a small-precision accumulator.
Note: In the code written before the support of reused inputs as the
output, the accumulator had already been copied to the destination,
so that a part of the memory of this accumulator could be reused for
the small-size accumulator for the TMD resolution. This is no longer
possible, but currently not a problem since the accumulator for the
TMD resolution takes at most only 2 limbs in practice; however, in
the future, we might want the accumulators to grow dynamically, as
explained above.
We set up a new accumulator of size cq + dq (= wq − sq) rounded up
to the next multiple of the word size (GMP_NUMB_BITS); let us call
this size zq (it will correspond to wq in sum_raw). From the old
accumulator, bits whose exponent is in ⟦minexp,u⟦ (when u > minexp)
will not be copied to the destination; these bits will be taken
into account as described below.
Let us recall that the d−1 bits from exponent u−2 to u−d (= err)
are identical. We distinguish two subcases:
* err ≥ minexp. The last two over the d−1 identical bits and the
following bits, i.e., the bits from err+1 to minexp, are copied
(possibly with a shift) to the most significant part of the new
accumulator.
The minexp value of this new accumulator is thus defined as:
minexp = err + 2 − zq,
so that:
maxexp − minexp = (err − logn) − (err + 2 − zq)
= zq − logn − 2 ≤ zq − cq,
satisfying the condition for ts.
* err < minexp. Here at least one of the identical bits is not
represented, meaning that it is 0 and all these bits are 0's.
Thus the accumulator set to 0. The new minexp is determined
from maxexp, with cq bits reserved to avoid overflows, just
like in the main sum.
Then sum_raw is called with prec = 1, satisfying the first sum_raw
precondition (prec ≥ 1). And we have:
wq ≥ cq + dq ≥ logn + 3 = logn + prec + 2,
corresponding to the second sum_raw precondition.
The sign of the secondary term (−1, 0, or +1), corrected for the
halfway cases, is stored in a variable sst. In details: If the
value returned by sum_raw (the number of cancelled bits) is not 0,
then the secondary term is not 0, and its sign is obtained from the
most significant bit (MSB) of the accumulator: positive if MSB = 0,
negative if MSB = 1. Otherwise the secondary term is 0, and so is
its sign; however, for the halfway cases (tmd = 2), we want to
eliminate the ambiguity of their rounding due to the even-rounding
rule by choosing a nonzero value for the sign: −1 if the truncated
significand is even, +1 if it is odd.
Then, from the values of the variables rnd (rounding mode), tmd,
rbit (rounding bit), sst (sign of the secondary term, corrected
for the halfway cases), and sgn (sign of the sum), we determine:
* the ternary value (negative, zero, or positive);
* the correction case (between −1 and +2).
The different cases are summarized in the table below. The two lines
corresponding to halfway cases are not possible since sst has been
changed to an equivalent nonzero value as said above. The rounding
modes MPFR_RNDZ and MPFR_RNDA are not in this table since they are
handled like MPFR_RNDD and MPFR_RNDU depending on the value of sgn
(see the MPFR_IS_LIKE_RNDD and MPFR_IS_LIKE_RNDU macros).
rnd tmd rbit sst correction ternary
───────────────────────────────────────────────
N 1 0 − 0 +
N 1 0 0 0 0
N 1 0 + 0 −
N 1 1 − +1 +
N 1 1 0 +1 0
N 1 1 + +1 −
───────────────────────────────────────────────
N 2 0 − 0 −
N 2 0 0 n/a n/a (halfway case)
N 2 0 + +1 +
N 2 1 − 0 −
N 2 1 0 n/a n/a (halfway case)
N 2 1 + +1 +
───────────────────────────────────────────────
D 1 0 − −1 −
D 1 0 0 0 0
D 1 0 + 0 −
D 1 1 − 0 −
D 1 1 0 +1 0
D 1 1 + +1 −
───────────────────────────────────────────────
U 1 0 − 0 +
U 1 0 0 0 0
U 1 0 + +1 +
U 1 1 − +1 +
U 1 1 0 +1 0
U 1 1 + +2 +
───────────────────────────────────────────────
At this point, the variable inex contains the correct ternary value
(before the overflow/underflow detection) and we know the correction
that needs to be applied to the significand.
Algorithm and Implementation: Copy/Shift to the Destination
───────────────────────────────────────────────────────────
First, we can set the sign of the MPFR number from the value of sgn.
The bits of the accumulator that need to be taken into account for the
destination are those of exponents in ⟦max(u,minexp),e⟦ (if u < minexp,
the nonrepresented bits are seen as 0's). We distinguish two cases:
* u > minexp. We need to copy the bits of exponents in ⟦u,e⟦, i.e.
all the bits are represented in the accumulator. We just have a
left shift (mpn_lshift) or a copy (MPN_COPY). In the process, some
bits of exponent < u can have been copied to the trailing bits; they
are seen as garbage. Since rounding will need a specific operation
on the least significant limb, these trailing bits (located in this
limb) will be zeroed at the same time in the next step.
* u ≤ minexp. We just have a left shift (bits that are shifted in are
0's as specified by GMP, which is what we want) or a copy, and if
there are remaining low significant limbs in the destination, they
are zeroed with MPN_ZERO.
Note: By definition of e, the most significant bit that is copied is
the first bit after the cancelled bits: 1 for a positive number, 0 for
a negative number.
Algorithm and Implementation: Complement and Rounding
─────────────────────────────────────────────────────
Note: The various cases below are not in the same order as in the code,
but this is better for the explanations.
For the moment, let us assume that sq ≥ 2. We distinguish two cases:
* neg = 0 (positive sum). Since the significand can contain garbage
in the trailing bits (present when the precision is not a multiple
of GMP_NUMB_BITS), we set these trailing bits to 0 as required by
the format of MPFR numbers. If corr > 0, we need to add corr to
the significand (we can see that this works even if corr = 2 and
the significand contains all 1's, which was not obvious). This is
done with mpn_add_1, but corr must be shifted by sd bits to the
left, where sd is the number of trailing bits. If corr = 2 and
sd = GMP_NUMB_BITS − 1, the mathematical result of the shift does
not hold in the variable; in this case, the value 1 is added with
mpn_add_1 starting at the second limb, which necessarily exists,
otherwise this would mean that the precision of the MPFR number
is 1, and this is not possible (we assumed sq ≥ 2). In case of
carry out, meaning a change of binade, the most significant bit
of the significand is set to 1 without touching the other bits
(this is important because if corr = 2 and the significand has
only one limb, the least significant nontrailing bit may be 1),
and e is incremented. If corr < 0, then it is −1, so that we
subtract 1 from the significand with mpn_sub_1. In case of
borrow out, meaning a change of binade, the most significant bit
of the significand is set to 1 so that all the (nontrailing) bits
of the significand are 1's, and e is decremented.
* neg = 1 (negative sum). Contrary to the positive case, where we
could do MPN number + limb (with mpn_add_1) or MPN number − limb
(with mpn_sub_1), GMP does not provide a function to do an operation
like: limb − MPN number. However, we will show that this can be
emulated with mpn_neg, which does a negation, and mpn_com, which
does a complement. This allows us to avoid separate mpn_com (or
mpn_neg) and mpn_add_1 (or mpn_sub_1) operations, as they could
yield two loops in some particular cases involving a long sequence
of 0's in the low significant bits.
Let us focus on the negation and complement operations and what
happens at the bit level. For the complement operation, all the
bits are inverted and there is no dependency between them. The
negation of an integer is equivalent to its complement + 1:
neg(x) = com(x) + 1. Said otherwise, after an initial carry
propagation on the least significant sequence of 1's in com(x),
the bits are just inverted, i.e., one has a complement operation
on the remaining bits. This is why we will regard complement as
the core operation in the following.
Now, we want to compute:
abs(x + corr) = neg(x + corr)
= neg(x) − corr
= com(x) + (1 − corr)
where −1 ≤ 1 − corr ≤ 2. We consider two subcases, leading to a
nonnegative case for the correction, and a negative case:
* corr ≤ 1, i.e., 1 − corr ≥ 0.
We first compute the least significant limb by setting the
trailing bits to 1, complementing the limb, and adding the
correction term 1 − corr properly shifted. This can generate
a carry. In the case where corr = −1 (so that 1 − corr = 2)
and the shift count is GMP_NUMB_BITS − 1, the shift of the
correction term overflows, but this is equivalent to have a
correction term equal to 0 and a carry.
* If there is a carry, we apply mpn_neg on the next limbs
(if the significand has more than one limb). If there is
still a carry, i.e., if the significand has exactly one
limb or if there is no borrow out of the mpn_neg, then
we handle the change of binade just like in the positive
case for corr > 0.
* If there is no carry, we apply mpn_com on the next limbs
(if the significand has more than one limb). There cannot
be a change of binade in this case since a complement
cannot have a carry out.
* corr = 2, i.e., 1 − corr = −1.
Here we want to compute com(x) − 1, but GMP does not provide
an operation for that. The fact is that a sequence of low
significant bits 1 is invariant, and we need to do the loop
ourselves in C instead of uses an optimized assembly version
from GMP. However this may not be a problem in practice, as
the difference is probably not noticeable. When a limb with
a zero is reached (there is at least one since the most
significant bit of the significand is a 0), we compute its
complement − 1 (the "− 1" corresponds to a borrow in). If
there are remaining limbs, we complement them and a change
of binade is not possible. Otherwise the complement − 1 on
the most significant limb can lead to a change of binade;
more precisely, this happens on the significand 01111…111,
whose complement is 10000…000 and com(x) − 1 is 01111…111.
The change of binade is handled like in the positive case for
corr < 0.
If sq = 1, the solution described above does not work when we need to
add 2 to the significand since 2 is not representable on 1 bit. And
as this case sq = 1 is actually simpler, we prefer to consider it
separately. First, we can force the only limb to MPFR_LIMB_HIGHBIT,
which is the value 1 shifted GMP_NUMB_BITS−1 bits to the left, i.e.,
with MSB = 1, the other bits being 0 (these are the trailing bits):
this is the only possible significand in precision 1. Now we need to
add the correction term, which corresponds to a modification of the
exponent. In the positive case, we just add corr to the exponent e.
In the negative case, as forcing the only limb to MPFR_LIMB_HIGHBIT
corresponds to the computation of com(x), we just add (1 − corr) to
the exponent e, following the formula given in the case sq > 1.
Algorithm and Implementation: Exponent Consideration
────────────────────────────────────────────────────
Finally, we set the exponent of the MPFR number to e (not with
MPFR_SET_EXP because this exponent may not be in the current exponent
range) and call mpfr_check_range as usual to check whether e is in
the current exponent range and handle a possible overflow/underflow.
────────────────────────────────────────────────────────────────────────
Summary of the timings copied from:
^/misc/sum-timings/results/ReadMe@10244
Below, ~ means "more or less equivalent", and > means faster, with
several > if it is really much faster.
Summary of the timings with:
* sum_old: mpfr_sum from MPFR 3.1.4 (old algo).
* sum_new: mpfr_sum from the trunk patched for MPFR 3.1.4 (new algo).
* sum_add: basic sum implementation with mpfr_add, inaccurate and
sensitive to the order of the arguments.
size precx precy emax Cancel
--------------------------------
10^1 10^1 10^7 10^0 sum_new >> sum_add > sum_old
10^8 sum_new > sum_add >> sum_old
10^1 10^7 10^1 N sum_new >>>> sum_add >> sum_old
Y sum_add >> sum_old >> sum_new
10^1 10^7 10^7 10^0 sum_add ~ sum_new > sum_old
10^8 N sum_add >> sum_new >> sum_old [1]
10^8 Y sum_add >> sum_new >> sum_old
10^3 10^1 10^5 10^0 sum_new >>> sum_add ~ sum_old
10^8 sum_new >> sum_add >> sum_old
10^3 10^5 10^1 10^0 N sum_new >> sum_add > sum_old
10^0 Y sum_add > sum_old >> sum_new
10^8 N sum_new > sum_add >> sum_old
10^8 Y sum_add >> sum_old >> sum_new
10^3 10^5 10^5 10^0 sum_new > sum_add > sum_old
10^8 N sum_new > sum_add >>> sum_old
10^8 Y sum_add ~ sum_new >> sum_old [2]
10^5 10^1 10^1 10^0 sum_add ~ sum_new > sum_old
10^8 N sum_new >> sum_add >> sum_old
10^8 Y sum_add ~ sum_new >> sum_old
10^5 10^1 10^3 10^0 sum_new > sum_add > sum_old
10^8 sum_new > sum_add >> sum_old
10^5 10^3 10^1 10^0 N sum_add ~ sum_new > sum_old
10^0 Y sum_add > sum_old >> sum_new
10^8 N sum_new >> sum_add >> sum_old
10^8 Y sum_add >> sum_new > sum_old
10^5 10^3 10^3 10^0 sum_add ~ sum_new > sum_old
10^8 sum_new > sum_add >> sum_old
[1] In some cases, sum_new is faster than sum_add.
[2] In some cases, sum_new is much faster than sum_add, or the opposite!
Conclusion:
* Depending on the cases, sum_new is faster or slower than sum_add.
The fact that it can be much faster shows the importance of the
low-level algorithm.
* In most cases, sum_new is much faster than sum_old. The only cases
where it is (much) slower is when precy << precx and there is a
cancellation, due to the fact that the reiterations are always done
in a small precision, which is regarded as sufficient with a large
probability. Such pathological cases might be solved in the future
without much slowing down the common cases thanks to auto-detection,
but this may need a reallocation.
────────────────────────────────────────────────────────────────────────
An example:
[1] [2] A [3]
u0 = ***** | | . |
u1 = ******|** | . |
u2 = ********|***** | . |
u3 = | *****| . |
u4 = | | ****|**
u5 = | | ***|*
At iteration 1, minexp is determined to be [1]; thus u0, a part of u1,
and a part of u2 are taken into account for the truncated sum. Then it
appears that an important cancellation occurred, and another iteration
is needed. Since u1 was truncated, the new maxexp will be minexp, i.e.,
[1]. At iteration 2, minexp is determined to be [2]; thus a part of u1,
a part of u2, and u3 are taken into account for the truncated sum. Now,
assume that on this example, the error is small enough, but its sign is
unknown: the TMD occurs. Since no numbers were truncated at the previous
iteration, maxexp is the maximum exponent of the remaining numbers, here
the one of u4, and minexp is determined to be [3]. Assume that the sign
of the error can be determined now, so that we can return the rounded
result with the ternary value.
Note 1: Data locality can be improved after the first iteration if the
shifted values are stored at the end of the temporary area instead of
the beginning. The reason is that only the least significant part of
the accumulator will be used once a good approximate value of the sum
is known, and the accumulator lies just after the temporary area. But
the gain would probably not be noticeable in practice.
Note 2: It was considered to determine the number of cancelled limbs
instead of the number of cancelled bits in order to avoid a nontrivial
shift between iterations. Moreover, on platforms with a MMU, thus most
platforms where GNU MPFR would be used, we could arrange to reserve a
specific address space for the accumulator and choose the shift count
to be a multiple of a memory block size (or actually not to do any
shift/copy at all, just free the most significant part and allocate
a new part when needed), so that the shift (copy) of the accumulator
contents could just be MMU related changes, without any physical memory
copy. However, in practice, only large precisions with specific inputs
would benefit from such a choice, which would require non portable code,
and the relative speedup would remain small since the sum itself is
expected to take most of the time.
*** To be considered in future versions ***
It seems that carry propagation (mpn_add_1 & mpn_sub_1 in the code) is
most often limited. But consider the following cases, where all inputs
have the minimal precision 2, and the output precision is p:
u0 = 1
u_i = (−1)^i × 2^(−p) for i > 0
Here long carry propagation will occur for each addition of the initial
iteration, so that the complexity will be O(n·p) instead of O(n+p) if we
choose to delay carry propagation (however such a choice may slower the
average case and take more memory, such as around 3p instead of 2p).
When a new iteration is needed due to cancellation, a second accumulator
was considered in some early version of the algorithm: the temporary
results of the computations during the new iteration would be stored in
this second accumulator, which would generally be small, thus limiting
carry propagation; this method is actually equivalent to delaying carry
propagation. It could help in some cases, such as:
u0 = 2^q with some q > 0
u1 = 1
u2 = −2^q
u_i = (−1)^i × 2^(−p) for i > 2
but such examples are very specific cases, and as seen with the first
example, a better way must be chosen if avoiding long carry propagation
is regarded as important (in all cases). Moreover, while the use of two
accumulators does not take much more memory (since both accumulators can
occupy the same area, with a flexible limit between them), it probably
makes the code more complex, and noticeably slower if n is small.
|