Handling imbalanced dataset in supervised learning using family of SMOTE algorithm.

Consider a problem where you are working on a machine learning classification problem. You get an accuracy of 98% and you are very happy. But that happiness doesn’t last long when you look at the confusion matrix and realize that majority class is 98% of the total data and all examples are classified as majority class. Welcome to the real world of imbalanced data sets!!

Some of the well-known examples of imbalanced data sets are

1 – Fraud detection: where number of fraud cases could be much smaller than non-fraudulent transactions.

2- Prediction of disputed / delayed invoices: where the problem is to predict default / disputed invoices.

3- Predictive maintenance data sets, etc

In all the above examples, the cost of mis-classifying minority class could very high. That means, if I am not able to identify the fraud cases correctly, the model won’t be useful.

There are multiple ways of handling unbalanced data sets. Some of them are : collecting more data, trying out different ML algorithms, modifying class weights, penalizing the models, using anomaly detection techniques, oversampling and under sampling techniques etc.

I am focusing mainly on SMOTE based oversampling techniques in this article. As a data scientist you might not have direct control over collection of more data which might need various approvals from client, top management and could also take more time etc. Also, applying class weights or too much parameter tuning can lead to overfitting. Undersampling technique can lead to loss of important information. But that might not be the case with oversampling techniques. Oversampling methods can be easily tried and embedded in your framework.

Over Sampling Algorithms based on SMOTE

1-SMOTE: Synthetic Minority Over sampling Technique (SMOTE) algorithm applies KNN approach where it selects K nearest neighbors, joins them and creates the synthetic samples in the space. The algorithm takes the feature vectors and its nearest neighbors, computes the distance between these vectors. The difference is multiplied by random number between (0, 1) and it is added back to feature. SMOTE algorithm is a pioneer algorithm and many other algorithms are derived from SMOTE.

Reference: SMOTE

R Implementation: smotefamily, unbalanced, DMwR

Python Implementation: imblearn

2- ADASYN: ADAptive SYNthetic (ADASYN) is based on the idea of adaptively generating minority data samples according to their distributions using K nearest neighbor. The algorithm adaptively updates the distribution and there are no assumptions made for the underlying distribution of the data. The algorithm uses Euclidean distance for KNN Algorithm. The key difference between ADASYN and SMOTE is that the former uses a density distribution, as a criterion to automatically decide the number of synthetic samples that must be generated for each minority sample by adaptively changing the weights of the different minority samples to compensate for the skewed distributions. The latter generates the same number of synthetic samples for each original minority sample.

Reference: ADASYN