水塘抽样
Reservoir sampling(水塘抽样) is a family of randomized algorithms for randomly choosing k samples from a list of n items, where n is either a very large or unknown number. Typically n is large enough that the list doesn’t fit into main memory. For example, a list of search queries in Google and Facebook.
So we are given a big array (or stream) of numbers (to simplify), and we need to write an efficient function to randomly select k numbers where 1 <= k <= n. Let the input array be stream[].
A simple solution is to create an array reservoir[] of maximum size k. One by one randomly select an item from stream[0..n-1]. If the selected item is not previously selected, then put it in reservoir[]. To check if an item is previously selected or not, we need to search the item in reservoir[]. The time complexity of this algorithm will be O(k^2). This can be costly if k is big. Also, this is not efficient if the input is in the form of a stream.
Problem¶
- Choose
kentries fromnnumbers. Make sure each number is selected with the probability ofk/n
Basic idea:
- Choose
1, 2, 3, ..., kfirst and put them into the reservoir. - For
k+1, pick it with a probability ofk/(k+1), and randomly replace a number in the reservoir. - For
k+i, pick it with a probability ofk/(k+i), and randomly replace a number in the reservoir. - Repeat until
k+ireachesn
Proof¶
- For
k+i, the probability that it is selected and will replace a number in the reservoir isk/(k+i) -
For a number in the reservoir before (let's say
X), the probability that it keeps staying in the reservoir is-
P(X was in the reservoir last time)×P(X is not replaced by k+i)=
P(X was in the reservoir last time)× (1-P(k+i is selected and replaces X)) =k/(k+i-1)× (1-k/(k+i)×1/k) =k/(k+i)
-
-
When
k+ireachesn, the probability of each number staying in the reservoir isk/n
Example¶
- Choose
3numbers from[111, 222, 333, 444]. Make sure each number is selected with a probability of¾ - First, choose
[111, 222, 333]as the initial reservior - Then choose
444with a probability of¾ -
For
111, it stays with a probability ofP(444 is not selected)+P(444 is selected but it replaces 222 or 333)=¼+¾*⅔=¾ -
The same case with
222and333. - Now all the numbers have the probability of
¾to be picked.
Java¶
具体的Java代码如下:
// A function to randomly select k items from stream[0..n-1].
static void selectKItems(int stream[], int n, int k)
{
int i; // index for elements in stream[]
// reservoir[] is the output array. Initialize it with
// first k elements from stream[]
int reservoir[] = new int[k];
for (i = 0; i < k; i++)
reservoir[i] = stream[i];
Random r = new Random();
// Iterate from the (k+1)th element to nth element
for (; i < n; i++)
{
// Pick a random index from 0 to i.
int j = r.nextInt(i + 1);
// If the randomly picked index is smaller than k,
// then replace the element present at the index
// with new element from stream
if(j < k)
reservoir[j] = stream[i];
}
}