Statistics How To

Reservoir Sampling

Sampling >

Reservoir sampling is a quota-based random sampling method, used to get a sample of a particular size when you don’t know the population size (i.e. when you’re dealing with a data stream of unknown length). The reservoir can be updated with replacement, or without replacement.

It’s called reservoir sampling because the selected items are placed into a reservoir (i.e. a holding set). As each stream tuple is received, the algorithm updates dynamically.

Originally developed for one-pass processing from magnetic tapes (Andrade et al. 2014), reservoir sampling is now used for one pass stream processing in data mining.

Reservoir Sampling Without Replacement

A reservoir sample without replacement is one where every distinct element has an equal probability of being selected:
reservoir sampling

Where:

  • n = population size.
  • m = a distinct element.

As the sampling is done without replacement, each element in the set is distinct (i.e. only selected once).

Reservoir Sampling With Replacement

Sampling with replacement means that every element has the possibility of being chosen for the reservoir more than once. It should guarantee that every element in the sample has an equal chance (1/n) of being placed in a certain position in the sample, no matter what elements are in the other positions. Formally, this is written as:

P(𝒥 – {i1, i2, …, im,}) = 1/nm

References

Andrade, H. et al. (2014). Fundamentals of Stream Processing: Application Design, Systems, and Analytics. Cambridge University Press.
Park, B. et al. Reservoir-Based Random Sampling with Replacement from Data Stream. (1987). In Proceedings of the Fourth SIAM International Conference on Data Mining (Proceedings in Applied Mathematics) 4th ed. Edition. Society for Industrial and Applied Mathematics. pp. 492-496.
Vitter, J. (1985). Random Sampling with a Reservoir. ACM Transactions on Mathematical Software, Vol. 11, No. 1, March.

------------------------------------------------------------------------------

Need help with a homework or test question? With Chegg Study, you can get step-by-step solutions to your questions from an expert in the field. Your first 30 minutes with a Chegg tutor is free!

Statistical concepts explained visually - Includes many concepts such as sample size, hypothesis tests, or logistic regression, explained by Stephanie Glen, founder of StatisticsHowTo.

Comments? Need to post a correction? Please post a comment on our Facebook page.

Check out our updated Privacy policy and Cookie Policy