Random KPermutations and KSubsets in O(k) space

The first part of this article described some algorithms for the generation of random permutations, kpermutations and ksubsets. The algorithm for generating permutations is very efficient; however, the algorithms for generating kpermutations and ksubsets can be improved significantly.

The following algorithm was given for the generation of random kpermutations, and ksubsets:

for (i=1 to n) ar[i] = i;
for (i=1 to k) swap(ar[i], ar[Random(i,n)]);

The big problem with this algorithm is that it uses space inefficiently. Selecting a Kpermutation of size k from n elements can require O(n) space. For instance, when selecting a KPermutation of 1000 items from a collection of 100,000 items, the algorithm requires an array of 100,000 elements. This is clearly suboptimal and undesirable. The algorithm also initializes the array of 100,000 elements, and this results in O(n) time as well.

Ideally the algorithm should use only O(k) space and O(k) time. This turns out to be possible but requires a few changes to the basic algorithm. The crucial insight is that while generating a Kpermutation, the algorithm disturbs at most O(k) elements in the array containing the item numbers. This allows using a sparse array of O(k) space to represent an array of n elements.

The sparse array contains the numbers 1 to n arranged in ascending order, and it is modeled by a hashtable. Elements that get disturbed by the kpermutation algorithm are placed in the hashtable, and all other elements are assumed to be in their correct places in the array. For example, an array with the following contents:

1 2 3 4  6 5  7 8 9
can be represented by a hashtable containing the following key-value pairs:
(5 6)
(6 5)
An array reference for element i is satisfied by first searching for the key i in the hashtable, and returning the associated value if it is in there; if the key is not in the hashtable the value i is returned. For example, an array reference for 3 in the above array returns 3, but an array reference for 6 returns 5 (the key 6 is in the hashtable).

The kpermutation algorithm only uses a swap operation, so by implementing a swap operation for the sparse array representation the original algorithm can be used without any significant modifications. The algorithm will change to the following pseudo code:

ar = init_htable();
for (i=1 to k) swap_keys(ar, i, Random(i,n));

The swap_keys operation is fairly straightforward to implement, but as the keys might or might not be present in the hashtable, there are four cases to be handled. The four cases and the actions required to swap the keys are summarized below:

Key i     Key j      Action

present   present    swap values, i.e., (i,n)(j,m) -> (i,m)(j,n) 
present   absent     add (j,j) to htable and swap values 
absent    present    add (i,i) to htable and swap values
absent    absent     add (i,i) (j,j) to htable and swap values 

The algorithm takes O(k) space and O(k) time as the swap_keys function executes k times, and on each invocation it uses O(1) time and can allocate at most O(1) additional space. After the algorithm has finished executing, the kpermutation can be retrieved from the hashtable by retrieving keys 1 to k in order. Actually, the algorithm can be easily optimized so that the kpermutation is directly stored in a separate array. Other optimizations are also possible but they don't improve the O(k) time and space behavior of the algorithm beyond constant factors.

Unlike the original random kpermutation algorithm, this algorithm requires a reinitialization of the hashtable whenever a new kpermutation needs to be generated. Generating many kpermutations without reinitialization of the hashtable will lead the algorithm to use more than O(k) space; this behavior being the result of the swap_keys operation executing more than k times.

As discussed in part I of this article, the random ksubset generation algorithm is the same as the kpermutation algorithm, so the above algorithm can generate ksubsets using O(k) space and O(k) time. However, if the ksubset is required to be in sorted order then a bucket sort can be used to sort the output of this algorithm. A bucket sort will work very well in this scenario as the items of the random ksubset will be distributed randomly over the interval 1 to n. Combining bucket sort with this algorithm will allow sorted ksubsets to be generated in O(k) time.

by Usman Latif  [Mar 26, 2004]

Related Links:

Random Permutations, KPermutations and KSubsets