summaryrefslogtreecommitdiff
path: root/pod
diff options
context:
space:
mode:
authorJarkko Hietaniemi <jhi@iki.fi>2001-11-27 15:04:37 +0000
committerJarkko Hietaniemi <jhi@iki.fi>2001-11-27 15:04:37 +0000
commit6a71076728f899e0bcf4c9adc7d1af200f3e6e47 (patch)
tree00185a325ed336a025ac74b8b2a9adb7c1428407 /pod
parent40b946947964007f370835171536b8012270d49a (diff)
downloadperl-6a71076728f899e0bcf4c9adc7d1af200f3e6e47.tar.gz
The Story of sort(), from John P. Linderman.
p4raw-id: //depot/perl@13319
Diffstat (limited to 'pod')
-rw-r--r--pod/perldelta.pod67
1 files changed, 67 insertions, 0 deletions
diff --git a/pod/perldelta.pod b/pod/perldelta.pod
index c775277204..27e4d7d7de 100644
--- a/pod/perldelta.pod
+++ b/pod/perldelta.pod
@@ -1121,6 +1121,73 @@ worst-case run time behaviour), and that sort() is now stable
(meaning that elements with identical keys will stay ordered as they
were before the sort). See the C<sort> pragma for information.
+The story in more detail: suppose you want to serve yourself a little
+slice of Pi.
+
+ @digits = ( 3,1,4,1,5,9 );
+
+A numerical sort of the digits will yield (1,1,3,4,5,9), as expected.
+Which C<1> comes first is hard to know, since one C<1> looks pretty
+much like any other. You can regard this as totally trivial,
+or somewhat profound. However, if you just want to sort the even
+digits ahead of the odd ones, then what will
+
+ sort { ($a % 2) <=> ($b % 2) } @digits;
+
+yield? The only even digit, C<4>, will come first. But how about
+the odd numbers, which all compare equal? With the quicksort algorithm
+used to implement Perl 5.6 and earlier, the order of ties is left up
+to the sort. So, as you add more and more digits of Pi, the order
+in which the sorted even and odd digits appear will change.
+and, for sufficiently large slices of Pi, the quicksort algorithm
+in Perl 5.8 won't return the same results even if reinvoked with the
+same input. The justification for this rests with quicksort's
+worst case behavior. If you run
+
+ sort { $a <=> $b } ( 1 .. $N , 1 .. $N );
+
+(something you might approximate if you wanted to merge two sorted
+arrays using sort), doubling $N doesn't just double the quicksort time,
+it I<quadruples> it. Quicksort has a worst case run time that can
+grow like N**2, so-called I<quadratic> behaviour, and it can happen
+on patterns that may well arise in normal use. You won't notice this
+for small arrays, but you I<will> notice it with larger arrays,
+and you may not live long enough for the sort to complete on arrays
+of a million elements. So the 5.8 quicksort scrambles large arrays
+before sorting them, as a statistical defence against quadratic behaviour.
+But that means if you sort the same large array twice, ties may be
+broken in different ways.
+
+Because of the unpredictability of tie-breaking order, and the quadratic
+worst-case behaviour, quicksort was I<almost> replaced completely with
+a stable mergesort. I<Stable> means that ties are broken to preserve
+the original order of appearance in the input array. So
+
+ sort { ($a % 2) <=> ($b % 2) } (3,1,4,1,5,9);
+
+will yield (4,3,1,1,5,9), guaranteed. The even and odd numbers
+appear in the output in the same order they appeared in the input.
+Mergesort has worst case O(NlogN) behaviour, the best value
+attainable. And, ironically, this mergesort does particularly
+well where quicksort goes quadratic: mergesort sorts (1..$N, 1..$N)
+in O(N) time. But quicksort was rescued at the last moment because
+it is faster than mergesort on certain inputs and platforms.
+For example, if you really I<don't> care about the order of even
+and odd digits, quicksort will run in O(N) time; it's very good
+at sorting many repetitions of a small number of distinct elements.
+The quicksort divide and conquer strategy works well on platforms
+with relatively small, very fast, caches. Eventually, the problem gets
+whittled down to one that fits in the cache, from which point it
+benefits from the increased memory speed.
+
+Quicksort was rescued by implementing a sort pragma to control aspects
+of the sort. The B<stable> subpragma forces stable behaviour,
+regardless of algorithm. The B<_quicksort> and B<_mergesort>
+subpragmas are heavy-handed ways to select the underlying implementation.
+The leading C<_> is a reminder that these subpragmas may not survive
+beyond 5.8. More appropriate mechanisms for selecting the implementation
+exist, but they wouldn't have arrived in time to save quicksort.
+
=item *
Hashes now use Bob Jenkins "One-at-a-Time" hashing key algorithm