A simple shuffle that proved not so simple after all

Yesterday I found following snippet of Ruby magic to shuffle an array:

class Array 
  def shuffle 
    sort { rand(3) - 1 } 
  end 
end                                                      

arr = (1..10).to_a 
# => [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]                                                     

arr.shuffle 
# => [1, 8, 6, 10, 9, 3, 7, 2, 5, 4]                                                     

arr.shuffle 
# => [3, 7, 10, 4, 5, 8, 2, 6, 9, 1]

There is one thing wrong with it, but the original site no longer permits commenting on this entry, so I thought I’d share my thoughts here.

Of course, the above solution works OK: it returns a nicely shuffled array every time it is called. Its succinctness is also admirable. Using sort to actually shuffle is also a clever hack.

But does it work in an efficient way? No. Using sort means that complexity of our shuffle method is O(n log n), just as sorting. Is this a bad thing? No, as long as you shuffle only small arrays. But as soon as you use it to shuffle something huge, the performance will suffer.

On “Big O Notation” — a small interruption

If you wonder what the hell is this O(n log n), you can check the Wikipedia entry on Big O Notation, but it’s very long and theoretical, so here is a short and hopefully easier explanation.

The Big O Notation is a tool used to analyze algorithms and estimate their efficiency, i.e. how fast they solve given problem. Thus you can compare e.g. various array sorting algorithms. To find the complexity of an algorithm, we must find out, or estimate, how many steps (operations that we can assume take roughly unit time) does it take to solve problem as a function of problem size, usually called n.

So, saying “algorithm A is O(n log n),  but algorithm B is O(n2)” means that when sorting an array of size n, algorithm A needs an amount of steps proportional to n long n, and B needs an amount of steps proportional to n2.

But beware, “Big O” won’t give you measurements in seconds. The right question to ask is: “If n (the number of elements in sorted array) would increase, how much more time would the algorithm need?” This gives us a useful estimation: if we tried to sort a 10 times bigger array, algorithm B would need 100 times more time!

A hidden constant

There is one more important thing about these estimations: a hidden constant. As you may have noticed, the above paragraph doesn’t directly compare algorithms A and B. All we can estimate, is how fast will execution time grow of individual algorithm, given some increase of input size. The real execution time is dependent on so many factors (like processor speed, operating system resources, programming language, quality of algorithm’s actual implementation) that it cannot be reliably used to compare algorithms. So all those factors are “shoved under the carpet” in the form of a hidden constant.

If we assume that input size grows sufficiently large, the impact of a hidden constant (no matter how big) can be safely ignored. How is that? Here’s some example: let’s assume that hidden constant of A, CA = 1000, because the algorithm runs on an ancient computer and was written not very efficiently in some slow language. On the other hand, we assume that CB = 1, because it runs on a fast computer and was carefully hand-coded in assembly.

So, for n = 10, algorithm A needs 1000 * 10 * log 10 = 40,000 steps, and algorithm B needs 1 * 10 * 10 = 100 steps. If we assume that completing 1000 steps takes 1 second, the execution times will be 40s and 0.1s, respectively. Looks like B is the winner? Wait a minute. Let’s see what will happen, when the value of n increases:

n time of A (CA * n log n * 0.001s) time of B (CB * n2 * 0.001s)
101 = 10 40s 0.1s
102 = 100 700s 10s
103 = 1,000 10,000s 1,000s
104 = 10,000 140,000s 100,000s
105 = 100,000 1,700,000s 10,000,000s
106 = 1,000,000 20,000,000s 1,000,000,000s
107 = 10,000,000 240,000,000s 100,000,000,000s

As we can see, for 100,000 elements and more our rusty, ancient computer starts to win with our fast and modern one. And the difference for 10,000,000 (ten million) elements is truly devastating: 7.6 years versus 3171 years! Three thousand years!

Back to shuffling

As I already said, O(n log n) is not the best complexity for such a simple algorithm like shuffling an array. Intuitively, the number of steps should be directly proportional to n, or — in Big O Notation — O(n). This is often called linear complexity.

So we need to find a linear shuffle — an algorithm that shuffles the array in one pass. Strictly speaking, it can do so in any number of passes, even 1000, as long as this number is constant.

What’s really funny is that many people try to find faster and faster implementations of
shuffling, but as long as they are using sort, their solutions will be flawed, in terms of complexity. We can see that in a comment on the site. Also, most of the discussion on comp.lang.ruby was focused on finding fastest shuffling-by-sorting implementation. There was not a single post directly mentioning algorithm complexity. At least some people proposed linear shuffle algorithms, but several of them had another problem: they were biased (which means that they weren’t truly random).

Another useful finding in the newsgroup discussion is following link: http://c2.com/cgi/wiki?LinearShuffle. You should certainly pay a visit to this site if you want to have a better understanding of various way to do the linear shuffle and how to avoid biased results. It has many examples of code in many languages, but sadly no Ruby version, so we’ll try to create one here.

We want to have a method that’s both short and efficient. So, I created following method using slightly modified version of one of the algorithms from the before mentioned Wiki page:

class Array 
  def shuffle 
    arr = self.dup 
    arr.size.downto 2 do |j| 
      r = rand j 
      arr[j-1], arr[r] = arr[r], arr[j-1] 
    end 
    arr 
  end 
end

Admittedly, it’s not as short as the sort-version, but is it faster? There is only one loop in this version, which runs through the array (backwards, from last element to second) and swaps current element with a random one that’s before it. First element is skipped because there is nothing to swap it with. Our version has linear complexity, so it should be faster for sufficiently big values of n. Let’s find this n.

Benchmarking

Here is my benchmarking code, borrowed from the newsgroup, but with a few enhancements:

require 'benchmark'                         

class Array 
  def shuffle 
    arr = self.dup 
    arr.size.downto 2 do |j| 
      r = rand j 
      arr[j-1], arr[r] = arr[r], arr[j-1] 
    end 
    arr 
  end 
  def shuffle_sort 
    sort {rand(3) - 1} 
  end 
  def shuffle_sort_by 
    sort_by {rand} 
  end 
end                         

def gc 
  GC.enable 
  GC.start 
  GC.disable 
end                      

[10, 100, 1000, 10000, 100000, 1000000].each do |n| 
  a = Array.new(n) { |i| i } 
  k = 1000000 / n 
  Benchmark.bm(25) do |bb| 
    gc 
    bb.report("shuffle ("+n.to_s+")         ") do 
      k.times do 
        a.shuffle 
      end 
    end 
    gc 
    bb.report("shuffle_sort ("+n.to_s+")    ") do 
      k.times do 
        a.shuffle_sort 
      end 
    end 
    gc 
    bb.report("shuffle_sort_by ("+n.to_s+") ") do 
      k.times do 
        a.shuffle_sort_by 
      end 
    end 
  end 
end

First comes Array class, enhanced with a three new methods:

  • shuffle is my implementation of linear shuffle,
  • shuffle_sort is taken from above mentioned snippet,
  • shuffle_sort_by is probably the fastest implementation of sort based shuffle, it was proposed several times in newsgroup discussion.

Next is a little utility function gc that runs garbage collector and then disables it. We’ll call it between the benchmarked methods. This should presumably minimize their interference on each other and also prevent garbage collector from impacting our results.

Then comes the real benchmarking code. We try consecutive powers of ten as array size n, from 10 to 1,000,000. For each n, we choose k as the number of times the shuffle method is called, such that n*k = 1000000. This means that array of 10 elements will be shuffled 100,000 times, but array of 1,000,000 only one time. Variable k acts as a normalization factor: if we chose a constant number instead (like 10 or 1,000,000), the measurements would either be fractions of seconds for small arrays or days for huge ones.

Results

Here are the results from my computer:

                               user     system      total        real 
shuffle (10)               3.100000   0.360000   3.460000 (  3.548542) 
shuffle_sort (10)          1.470000   0.010000   1.480000 (  1.500005) 
shuffle_sort_by (10)       0.890000   0.000000   0.890000 (  0.892586) 
                               user     system      total        real 
shuffle (100)              2.960000   0.030000   2.990000 (  3.030168) 
shuffle_sort (100)         3.730000   0.100000   3.830000 (  3.880254) 
shuffle_sort_by (100)      1.180000   0.000000   1.180000 (  1.204529) 
                               user     system      total        real 
shuffle (1000)             2.930000   0.020000   2.950000 (  3.046020) 
shuffle_sort (1000)        6.240000   0.200000   6.440000 (  6.553730) 
shuffle_sort_by (1000)     1.620000   0.010000   1.630000 (  1.637530) 
                               user     system      total        real 
shuffle (10000)            2.940000   0.020000   2.960000 (  3.026883) 
shuffle_sort (10000)       8.810000   0.380000   9.190000 (  9.353987) 
shuffle_sort_by (10000)    2.040000   0.010000   2.050000 (  2.143209) 
                               user     system      total        real 
shuffle (100000)           3.040000   0.010000   3.050000 (  3.124412) 
shuffle_sort (100000)     11.290000   0.120000  11.410000 ( 11.649619) 
shuffle_sort_by (100000)   2.710000   0.020000   2.730000 (  2.800463) 
                               user     system      total        real 
shuffle (1000000)           3.090000   0.010000   3.100000 (  3.153007) 
shuffle_sort (1000000)     13.870000   0.140000  14.010000 ( 14.473133) 
shuffle_sort_by (1000000)   3.730000   0.030000   3.760000 (  3.857054)

Interpretation

We can notice several things in these results:

  1. shuffle method takes always about 3s to run, while other methods take more and more time. This proves that shuffle has linear complexity and others have not.
  2. shuffle_sort_by is about 4 times faster than shuffle_sort for bigger arrays. This proves that they both have the same complexity (in terms of Big O Notation) and differ only by a hidden constant.
  3. shuffle is faster than shuffle_sort for 100 elements and more.
  4. shuffle is faster than shuffle_sort_by for 1,000,000 elements and more. The difference for 1,000,000 is quite big, so I experimented a little bit and found out that the “crossing point” is about 350,000 on my computer. For any array size greater than 350,000, shuffle is the fastest method.

Conclusion

So, which method is the best? If you know that you will never ever shuffle anything bigger than a few hundred elements, use shuffle_sort_by. In any other case (especially in a library code etc.) I would recommend playing safe and using shuffle. Even if you never throw at it anything huge, I doubt this method will be a bottleneck of your code. And as always, before you start optimizing through random tweaking, I recommend some profiling sessions or (at the very least) a benchmarking suite like the one above.

One could say that 350,000 (or whatever number it is on your computer) is quite a big number, especially for an array size. That’s true, but let’s not forget, that Ruby’s sort and sort_by are written in C and use its standard library qsort function, which is probably highly optimized, while our own shuffle is pure Ruby code. If we implemented our method in C, it would be probably fastest from the start (and I would certainly do such an experiment if this blog entry weren’t so long already).

And that’s why sort-version is four times slower than sort_by-version. The former has to go back from C to Ruby each time it has to compare two elements, the latter just calls its block once for every element and then stores the result.

Just one more thing

While writing this article I experimented with various versions of the methods and discovered one more flaw in the original version (renamed by me to shuffle_sort). I was wrong when I said “it returns a nicely shuffled array every time it is called.”

The results are not nicely shuffled at all. They are biased. Badly. That means that some permutations (i.e. orderings) of elements are more likely than others. Here’s another snippet of code to prove it, again borrowed from the newsgroup discussion:

N = 100000 
A = %w(a b c) 
Score = Hash.new { |h, k| h[k] = 0 } 
N.times do 
  sorted = A.shuffle 
  Score[sorted.join("")] += 1 
end                     

Score.keys.sort.each do |key| 
  puts "#{key}: #{Score[key]}" 
end

This code shuffles 100,000 times array of three elements: a, b, c and records how many times each possible result was achieved. In this case, there are only six possible orderings and we should got each one about 16666.66 times. If we try an unbiased version of shuffle (shuffle or shuffle_sort_by), the result are as expected:

abc: 16517 
acb: 16893 
bac: 16584 
bca: 16568 
cab: 16476 
cba: 16962

Of course, there are some deviations, but they shouldn’t exceed a few percent of expected value and they should be different each time we run this code. We can say that the distribution is even.

OK, what happens if we use the shuffle_sort method?

abc: 44278 
acb: 7462 
bac: 7538 
bca: 3710 
cab: 3698 
cba: 33314

This is not an even distribution at all. Again?

abc: 44444 
acb: 7494 
bac: 7320 
bca: 3683 
cab: 3611 
cba: 33448

As we can clearly see, the shuffle_sort method favorizes abc ordering (44% of results), then cba (33%), then acb and bac (about 7%), then bca and cab (less than 4%). The results for other array sizes are biased as well (I checked from 2 to 5). Imagine this algorithm used in a lottery!


10 responses to “A simple shuffle that proved not so simple after all

  • Jetru

    Superb! One of the better blogs i’ve come across! I’m a regular! :) And awesome site design :D

  • Tarun

    Sir, Please tell me how do we prove that one algorithm is faster than the other by finding out the hidden constants

  • szeryf

    Tarun: I’m sorry but you must have got that entirely backwards. The hidden constant is hidden because it doesn’t matter when you compare two algorithms. You don’t have to know how big it is.

    Consider two functions: F(n) = A * n and G(n) = B * n * n, where A and B could be arbitrarily large, but constant numbers. Do you need to know A and B values to tell that given n big enough, G(n) > F(n)?

    Or, more formally, we can easily prove that there is some N that for every n > N, G(n) > F(n) holds. Which means that it holds for “almost all” numbers. And this is enough for algorithmic theory to say “G is slower(bigger) than F” because the theory compares their asymptotic complexities, that is where n increases to infinity.

  • Daniel Bernier

    Very nice explanation & illustration of Big O notation…those benchmark numbers lay it out pretty clearly.

    delicious’ed.

  • not-just-yeti

    Maybe I’m an idiot, but is it obvious that
    arr.sort{rand(3)-1}
    really gives a uniform distribution? It seems to depend on the details of the sort.

    (For example, if we were to use bubblesort, the probability that the first element ends up last is only 1/2^n, instead of 1/n. So for quicksort, I need some convincing that the results really are uniform.)

    (Also, at the very least, that cute-but-inefficient version would want to use
    arr.sort{ 2*rand(2)-1 }
    )

  • szeryf

    not-just-yeti: it doesn’t. If you read the last part of this article (under “just one more thing”) you’ll see that I show that the results are not uniform.

    Also see http://www.codinghorror.com/blog/archives/001015.html if you want more proof.

  • not-just-yeti

    Ah, thanks.

    After a bunch of thinking, it can be proved that *no* sort algorithm will make shuffle_sort work (meaning that this isn’t just the details of quicksort that muck things up; it won’t work for mergesort or anything else):
    If the sort algorithm makes k comparisons (i.e. k calls to rand), then there are 2^k possible executions, which can’t be *uniformly* mapped onto the n! possible shuffled outputs.

    (As I write that, I realize it’s the same idea made about a non-sorting algorithm, on the post you linked to.)

  • Student

    Style cop!

    class Array
      def shuffle!
        (1...size).each do |j|
          r = rand(j+1)
          self[r], self[j] = at(j), at(r)
        end
        self
      end
      def shuffle
        dup.shuffle!
      end
    end

    at is supposed to be a bit faster than [], but it is rhs only.
    minimize the arithmetic inside the loop.
    minimize the arithmetic outside the loop.
    allow the user to decide if he wants to pay for the dup or not.

Leave a comment