Member-only story

Without peeking into the future, Oversample your imbalanced dataset to balance it

3 min readAug 29, 2020

For those constrained by your project deadline, feel free to skip directly to the code

Getting high accuracy on the test set has always been the end goal, especially while tackling a classification problem. This attitude led me to a pretty decent accuracy on a multi-class classification problem. The imbalance in the dataset did not seem to decrease the accuracy so I had ignored the imbalance problem. However, on looking at the distribution of the predictions, it was clearly visible that the model blindly predicted the class with the maximum number of occurrences in the imbalanced dataset. This resulted in a decent accuracy since most of the data instances were labelled with the class with the maximum frequency which is what the model would always predict.

One way of eliminating the bias introduced by the imbalance in the dataset is a random oversampling of the dataset. One thing to keep in mind while oversampling is to not peek into the future. This essentially means that while oversampling, we should only oversample the training set and not the test set. The way we oversample will make it clear why it would be cheating if we split the dataset into train and test after oversampling.

Thus, the first step will be to split the imbalanced dataset into a train and test set. I have used stratified sampling in order to maintain the distribution of classes in the dataset even after splitting the dataset.

Without peeking into the future, Oversample your imbalanced dataset to balance it

Written by Nikhil Nanda

No responses yet