Statistics How To

Undersampling and Oversampling in Data Analysis

Sampling >

Undersampling in data analysis is a technique that attempts to reduce the bias associated with imbalanced classes of data. Undersampling combined with oversampling, are two techniques that deal with imbalances in a training set. You can undersample the majority class, oversample the minority class, or combine the two techniques.

In general, undersampling (instead of oversampling) the majority class works best for large data sets. That’s because with oversampling, you’re adding more data points, which can lead to a data set that’s too massive to use classifiers like support vector machines (García-Pedrajas, 2010).

Random Undersampling

With random undersampling, you randomly remove members of the majority class until you reach a preset threshold.

One advantage to random selection here is that you don’t have to make decisions on which points are important and which are not: you simply let the random process do the work. Several studies have shown that random selection performs as well as, if not better than, processes where deliberate removal choices are made.

However, a distinct disadvantage is that the process could remove important members. Problems tend to result in data that is non-smooth, has boundaries or small features (Dey, n.d.). One way to avoid this pitfall is to combine undersampling and boosting (Liu et al, as cited in García-Pedrajas, 2010). You might also want to manually resample or repair any holes in the data algorithmically.

References

Dey, T. Undersampling and Oversampling in Sample Based Shape Modeling. Retrieved December 16, 2019 from: https://graphics.stanford.edu/courses/cs468-03-fall/Papers/deygiesen_undersampling.pdf
García-Pedrajas, N. et al. (2010). Trends in Applied Intelligent Systems: 23rd International Conference on Industrial Engineering and Other Applications of Applied Intelligent Systems, IEA/AIE 2010, Cordoba, Spain, June 1-4, 2010, Proceedings. Springer Science & Business Media.

------------------------------------------------------------------------------

Need help with a homework or test question? With Chegg Study, you can get step-by-step solutions to your questions from an expert in the field. Your first 30 minutes with a Chegg tutor is free!

Statistical concepts explained visually - Includes many concepts such as sample size, hypothesis tests, or logistic regression, explained by Stephanie Glen, founder of StatisticsHowTo.

Comments? Need to post a correction? Please post a comment on our Facebook page.

Check out our updated Privacy policy and Cookie Policy