Monday, June 11, 2007

Bogo-Permute

So here's today's random question, apropos of, er, nothing. What's the performance of the following incredibly-poor algorithm for randomly permuting a list of numbers?

def bogo_permute(in_list):
    N = len(in_list)
    out_list = []
    # Keep track of whether we have dealt with each element of the input list
    seen = [False] * N
    for iteration in range(N):
        while True:
            which = randrange(N)
            if not seen[which]:
                out_list.append(in_list[which])
                seen[which] = True
                break
    return out_list

There are N selections needed for the output list. The first time through, the randomly selected entry from the input list will definitely not have been processed already, so it can be transferred across to the output list. The second time through, there's a chance that the random selection process will select the same entry again; for later iterations, this chance of selecting something that has already been selected will get higher and higher. So, on average, how many of these repeated picks are needed?

Given a coin that comes up heads with probability p (and so tails with probability q=1-p), what's the average number of coin tosses needed before the first heads comes up?

As an example, imagine that the first head turns up on the fourth toss. This means that the sequence of tosses is TTTH. The probability of this sequence of coin tosses is then q3p = (1-p)3p. This is easy to generalize:

Number of TossesSequenceProbability
1Hp
2THqp
3TTHq2p
4TTTHq3p
mT…THqm-1p

We want the average number of tosses needed to get a head, also known as the expected value. This is a sum of the each possible value multiplied by the probability of that value occurring. In other words, the average number of tosses needed is:

(1*(probability of heads on first toss) + 2*(probability of first heads on second toss) + 3*(probability of heads on third toss) + …)

Put into maths, this means that the average number is an infinite sum: Σm=1 m (qm-1p). From a trusty binomial expansion, the Σm=1 m qm-1 part pops out as (1-q)-2, and since q=1-p this means that the whole thing simplifies down to just 1/p.

(This seems reasonable: for a fair coin, it takes an average of 2 throws to get a head; for a coin that only comes up heads one time in a hundred, it takes an average of a hundred throws.)

So back to the original problem. For the i-th selection (with i running from 1 to N), the chance of picking an array entry that is still there is p=(N+1-i)/N, and so the average number of picks needed is N/(N+1-i). The first selection needs N/N = 1 attempt, the second selection needs N/(N-1) attempts, the third selection needs N/(N-2) attempts, …, the last selection needs N/1 = N attempts (on average).

Taken together, this means that the average number of random picks for the whole process is Σi=1N N/(N+1-i). Rearranging (with j=N+1-i) this is the same as N Σj=1N1/j, which includes a partial sum of the harmonic series, known as a harmonic number HN. In the large N limit, the harmonic numbers tend to HN ≈ ln(N) + γ (where γ is a constant), and so the asymptotic behaviour of the algorithm is O(N ln(N)).

So it's bad, but not quite as bad as I'd expected (I'd guessed it would be O(N2)).