水塘抽样
Reservoir sampling(水塘抽样) is a family of randomized algorithms for randomly choosing k
samples from a list of n
items, where n
is either a very large or unknown number. Typically n
is large enough that the list doesn’t fit into main memory. For example, a list of search queries in Google and Facebook.
So we are given a big array (or stream) of numbers (to simplify), and we need to write an efficient function to randomly select k numbers where 1 <= k <= n. Let the input array be stream[].
A simple solution is to create an array reservoir[]
of maximum size k
. One by one randomly select an item from stream[0..n-1]
. If the selected item is not previously selected, then put it in reservoir[]
. To check if an item is previously selected or not, we need to search the item in reservoir[]
. The time complexity of this algorithm will be O(k^2). This can be costly if k
is big. Also, this is not efficient if the input is in the form of a stream.
Problem¶
- Choose
k
entries fromn
numbers. Make sure each number is selected with the probability ofk/n
Basic idea:
- Choose
1, 2, 3, ..., k
first and put them into the reservoir. - For
k+1
, pick it with a probability ofk/(k+1)
, and randomly replace a number in the reservoir. - For
k+i
, pick it with a probability ofk/(k+i)
, and randomly replace a number in the reservoir. - Repeat until
k+i
reachesn
Proof¶
- For
k+i
, the probability that it is selected and will replace a number in the reservoir isk/(k+i)
-
For a number in the reservoir before (let's say
X
), the probability that it keeps staying in the reservoir is-
P(X was in the reservoir last time)
×P(X is not replaced by k+i)
=
P(X was in the reservoir last time)
× (1
-P(k+i is selected and replaces X)
) =k/(k+i-1)
× (1
-k/(k+i)
×1/k
) =k/(k+i)
-
-
When
k+i
reachesn
, the probability of each number staying in the reservoir isk/n
Example¶
- Choose
3
numbers from[111, 222, 333, 444]
. Make sure each number is selected with a probability of¾
- First, choose
[111, 222, 333]
as the initial reservior - Then choose
444
with a probability of¾
-
For
111
, it stays with a probability ofP(444 is not selected)
+P(444 is selected but it replaces 222 or 333)
=¼
+¾
*⅔
=¾
-
The same case with
222
and333
. - Now all the numbers have the probability of
¾
to be picked.
Java¶
具体的Java代码如下:
// A function to randomly select k items from stream[0..n-1].
static void selectKItems(int stream[], int n, int k)
{
int i; // index for elements in stream[]
// reservoir[] is the output array. Initialize it with
// first k elements from stream[]
int reservoir[] = new int[k];
for (i = 0; i < k; i++)
reservoir[i] = stream[i];
Random r = new Random();
// Iterate from the (k+1)th element to nth element
for (; i < n; i++)
{
// Pick a random index from 0 to i.
int j = r.nextInt(i + 1);
// If the randomly picked index is smaller than k,
// then replace the element present at the index
// with new element from stream
if(j < k)
reservoir[j] = stream[i];
}
}