跳转至

水塘抽样

Reservoir sampling(水塘抽样) is a family of randomized algorithms for randomly choosing k samples from a list of n items, where n is either a very large or unknown number. Typically n is large enough that the list doesn’t fit into main memory. For example, a list of search queries in Google and Facebook.

So we are given a big array (or stream) of numbers (to simplify), and we need to write an efficient function to randomly select k numbers where 1 <= k <= n. Let the input array be stream[].

A simple solution is to create an array reservoir[] of maximum size k. One by one randomly select an item from stream[0..n-1]. If the selected item is not previously selected, then put it in reservoir[]. To check if an item is previously selected or not, we need to search the item in reservoir[]. The time complexity of this algorithm will be O(k^2). This can be costly if k is big. Also, this is not efficient if the input is in the form of a stream.

Problem

  • Choose k entries from n numbers. Make sure each number is selected with the probability of k/n

Basic idea:

  • Choose 1, 2, 3, ..., k first and put them into the reservoir.
  • For k+1, pick it with a probability of k/(k+1), and randomly replace a number in the reservoir.
  • For k+i, pick it with a probability of k/(k+i), and randomly replace a number in the reservoir.
  • Repeat until k+i reaches n

Proof

  • For k+i, the probability that it is selected and will replace a number in the reservoir is k/(k+i)
  • For a number in the reservoir before (let's say X), the probability that it keeps staying in the reservoir is

    • P(X was in the reservoir last time) × P(X is not replaced by k+i)

      = P(X was in the reservoir last time) × (1 - P(k+i is selected and replaces X)) = k/(k+i-1) × (1 - k/(k+i) × 1/k) = k/(k+i)

  • When k+i reaches n, the probability of each number staying in the reservoir is k/n

Example

  • Choose 3 numbers from [111, 222, 333, 444]. Make sure each number is selected with a probability of ¾
  • First, choose [111, 222, 333] as the initial reservior
  • Then choose 444 with a probability of ¾
  • For 111, it stays with a probability of

    P(444 is not selected) + P(444 is selected but it replaces 222 or 333) = ¼ + ¾* = ¾

  • The same case with 222 and 333.

  • Now all the numbers have the probability of ¾ to be picked.

Java

具体的Java代码如下:

// A function to randomly select k items from stream[0..n-1]. 
static void selectKItems(int stream[], int n, int k) 
{ 
    int i;   // index for elements in stream[] 

    // reservoir[] is the output array. Initialize it with 
    // first k elements from stream[] 
    int reservoir[] = new int[k]; 
    for (i = 0; i < k; i++) 
        reservoir[i] = stream[i]; 

    Random r = new Random(); 

    // Iterate from the (k+1)th element to nth element 
    for (; i < n; i++) 
    { 
        // Pick a random index from 0 to i. 
        int j = r.nextInt(i + 1); 

        // If the randomly  picked index is smaller than k, 
        // then replace the element present at the index 
        // with new element from stream 
        if(j < k) 
            reservoir[j] = stream[i];             
    } 

}

Reference