# 蓄水池抽样浅说 (2)

#### 跳跳跳！—— Algorithm X

• 确定该跳过多少个元素 $S(k, n)$
• 跳过 $S(k, n)$ 个元素
• 从前 $k$ 个元素中随机产生一个要替换的元素，用下一个元素替换

\begin{align} F_S(s) & =\mathbb{P}(S(k,n) \leq s) \\ &= 1 – \mathbb{P}(S(k,n) > s) \\ &= 1 – \left(1 – \frac{k}{n+1}\right)\left(1 – \frac{k}{n+2}\right)\cdots\left(1 – \frac{k}{n+s+1}\right) \\ &= 1 – \frac{(n+1-k)(n+2-k)\cdots(n+s-k+1)}{(n+1)(n+2)\cdots(n+s+1)} \end{align}

$$F_X(a) = \mathbb{P}(X \leq a) = \mathbb{P}(F^{-1}(U) \leq a) = \mathbb{P}(U \leq F(a)) = F(a)$$

import random
import sys
import time

def getS(n, k):
u = random.random()
S = 0
n += 1
quot = (n - k) / n
while (quot > u):
S += 1
n += 1
quot *= (n - k) / n
return S

def reservoirSampleX(stream, sample_size):
result = []
s = 0
for index, line in enumerate(stream):
if index < sample_size:
result.append(line)
else:
if not s:
result[int(random.random() * sample_size)] = line
s = getS(index + 1, sample_size)
else:
s -= 1
return result

1. Vitter, Jeffrey S. “Random sampling with a reservoir.” ACM Transactions on Mathematical Software (TOMS) 11.1 (1985): 37-57.