Undersampling in data analysis is a technique that attempts to reduce the bias associated with imbalanced classes of data. Undersampling combined with oversampling, are two techniques that deal with imbalances in a training set. You can undersample the majority class, oversample the minority class, or combine the two techniques.
In general, undersampling (instead of oversampling) the majority class works best for large data sets. That’s because with oversampling, you’re adding more data points, which can lead to a data set that’s too massive to use classifiers like support vector machines (García-Pedrajas, 2010).
With random undersampling, you randomly remove members of the majority class until you reach a preset threshold.
One advantage to random selection here is that you don’t have to make decisions on which points are important and which are not: you simply let the random process do the work. Several studies have shown that random selection performs as well as, if not better than, processes where deliberate removal choices are made.
However, a distinct disadvantage is that the process could remove important members. Problems tend to result in data that is non-smooth, has boundaries or small features (Dey, n.d.). One way to avoid this pitfall is to combine undersampling and boosting (Liu et al, as cited in García-Pedrajas, 2010). You might also want to manually resample or repair any holes in the data algorithmically.
Dey, T. Undersampling and Oversampling in Sample Based Shape Modeling. Retrieved December 16, 2019 from: https://graphics.stanford.edu/courses/cs468-03-fall/Papers/deygiesen_undersampling.pdf
García-Pedrajas, N. et al. (2010). Trends in Applied Intelligent Systems: 23rd International Conference on Industrial Engineering and Other Applications of Applied Intelligent Systems, IEA/AIE 2010, Cordoba, Spain, June 1-4, 2010, Proceedings. Springer Science & Business Media.
Need help with a homework or test question? With Chegg Study, you can get step-by-step solutions to your questions from an expert in the field. Your first 30 minutes with a Chegg tutor is free!
Comments? Need to post a correction? Please post a comment on our Facebook page.