Home » Technical Topics » Machine Learning

Exploring oversampling and under-sampling: Core techniques for balancing imbalanced datasets in ML

Introduction

This article will address this issue using resampling techniques such as over-sampling and under-sampling, which help balance datasets and improve model performance. This core technique for balancing imbalanced datasets in machine learning uses over-sampling and under-sampling in machine learning for the datasets where one class significantly outweighs others. This imbalance can lead to biased models favouring the majority class, resulting in poor predictive performance for minority-class instances during the model in production.

We will explore their implementations, advantages, limitations, and impact on machine learning model performance.

  • Random Oversampling
  • Synthetic Minority Over-sampling Technique (SMOTE)
  • Random Under-sampling
image

Figure 1: Balanced and imbalanced

Understanding the problem of imbalanced datasets

A dataset is imbalanced when one class (majority) has significantly more instances than another (minority).

This imbalance leads to:

  • Poor Model Generalization: The model is biased toward the majority class.
  • Skewed Evaluation Metrics: Accuracy becomes misleading as the model often predicts the majority class.
  • Limited Learning of Minority Class Patterns: The minority class may not contribute enough to the training process.
image

Figure 2: Problem of Imbalanced Datasets

Common use cases where imbalanced datasets occur:

  • Fraud detection (fraudulent transactions are rare)
  • Medical diagnosis (certain diseases have fewer positive cases)
  • Spam detection (spam emails are fewer than legitimate ones)

Fraud detection: Identifying rare fraudulent transactions

Fraud detection is a critical aspect of financial security that aims to identify fraudulent activities within vast volumes of legitimate transactions. One significant challenge in fraud detection is that fraudulent transactions are rare, making them difficult to spot while minimizing disruptions to regular customer activity.

The main challenge in fraud detection is an imbalance in data based on the scenario below.

  • Fraudulent transactions make up a tiny fraction of overall transactions, leading to highly imbalanced datasets in fraud detection systems.
  • Traditional machine learning models may struggle to identify fraudulent cases as everyday transactions overshadow them.

Impact of imbalanced datasets on fraud detection

Fraud detection is significantly influenced by imbalanced datasets, where fraudulent transactions comprise only a tiny fraction of total transactions. This imbalance presents several challenges for machine learning models, leading to biased predictions, poor generalization, and difficulty detecting rare fraud cases.

Class imbalance and its challenges

In fraud detection, the dataset typically consists of:

  • Legitimate transactions (majority class) → 99% or more
  • Fraudulent transactions (minority class) → Less than 1%

Since fraud cases are rare, standard machine learning models tend to favor the majority class (legitimate transactions) and may overlook fraudulent transactions.

Challenges introduced by imbalanced data

  • Biased Model Learning
  • Models tend to classify all transactions as legitimate because that minimizes overall error.
  • Example: If a model predicts all transactions as legitimate, it might still achieve 99% accuracy, but it ultimately fails at detecting fraud.
  • Poor Sensitivity to Fraud Cases (Low Recall)
  • Even if some fraud cases exist, models trained on imbalanced data might fail to identify them, leading to low recall (high false negatives).
  • Fraudulent activities go unnoticed, increasing financial losses.
  • High False Positives & Customer Friction
  • If the model compensates for imbalance by flagging more transactions as fraudulent, it may increase false positives (legitimate transactions wrongly flagged as fraudulent).
  • This can frustrate customers whose genuine transactions get blocked.
image

Figure 3: Problem of Imbalanced Datasets (Use cases)

Medical Diagnosis: Medical diagnosis relies on machine learning models to detect diseases based on patient data. However, some diseases, such as rare cancers, genetic disorders, and certain infections, occur infrequently in the population. This creates a significant class imbalance problem, where most cases are healthy individuals (negative cases), while actual disease cases (positive cases) are rare.

Impact of Imbalanced Data on Medical Diagnosis

A. High False Negatives (Missed Diagnoses)

  • Since positive cases are rare, machine learning models trained on imbalanced data may fail to detect them, classifying most patients as healthy (majority class).
  • Example: A model diagnosing cancer might predict with 98% accuracy that a patient has cancer, but if only 2% of patients have cancer, it may miss most of them, leading to delayed treatment and worse outcomes.

B. High False Positives (Unnecessary Anxiety & Testing)

  • If a model tries to compensate for imbalance by detecting more positive cases, it may flag too many healthy individuals as sick, leading to false positives.
  • Example: A model for detecting a rare genetic disorder may mistakenly flag many patients who do not have the disease, leading to unnecessary medical tests and emotional distress.

C. Bias in Machine Learning Models

  • Models trained on imbalanced medical datasets learn to predict the majority class (healthy cases) more confidently, overlooking the minority class (disease cases).
  • Result: The model may fail to generalize and underperform when diagnosing actual patients

To mitigate these issues, we use resampling techniques, which fall into two broad categories:

  • Oversampling: Increasing the number of minority class instances.
  • Under-sampling: Reducing the number of majority class instances.

Oversampling Methods

What is Oversampling?

Oversampling is a technique that balances imbalanced datasets by increasing the number of samples in the minority class. This can be done in two main ways:

  • Random Oversampling – Duplicates existing minority class samples to match the majority class.
  • Synthetic Oversampling (e.g., SMOTE) – Creates new synthetic samples by interpolating between existing minority class instances instead of direct duplication.

Oversampling helps prevent models from being biased toward the majority class, improving their ability to detect rare events, such as fraud detection or rare diseases in medical diagnosis. However, it should be applied carefully to avoid overfitting.

Random Oversampling: Random oversampling involves duplicating instances of the minority class randomly until both classes are balanced. This prevents the model from being biased towards the majority class.

Implementation in Python Using Random Oversampling

from imblearn.over_sampling import RandomOverSampler

from collections import Counter

# Sample imbalanced dataset

X = [[1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12], [13], [14],[15], [16], [17], [18]]

y = [0, 0, 0, 0,0,0,0,0,0,0,0,0, 1, 1, 1, 1, 1]  # Balanced dataset in this case

# Display class distribution before oversampling

before_sampling = Counter(y)

print(“Class distribution before oversampling:”, before_sampling)

# Apply Random Oversampling

ros = RandomOverSampler(sampling_strategy=’auto’, random_state=42)

X_resampled, y_resampled = ros.fit_resample(X, y)

# Display class distribution after oversampling

after_sampling = Counter(y_resampled)

print(“Class distribution after oversampling:”, after_sampling)

Output

Class distribution before oversampling: Counter({0: 12, 1: 5})

Class distribution after oversampling: Counter({0: 12, 1: 12})

Pros

  • It is simple and easy to implement.
  • Helps models learn more from the minority class.

Cons

  • This can lead to overfitting as duplicated samples do not introduce new patterns.

Synthetic Minority Over-sampling Technique (SMOTE)

Synthetic Minority Over-sampling Technique (SMOTE) is a data augmentation method that balances imbalanced datasets by generating synthetic samples for the minority class. Instead of simply duplicating existing samples, SMOTE creates new synthetic data points by interpolating between existing minority class instances. This helps machine learning models learn better decision boundaries and reduces overfitting caused by duplicate data. SMOTE is widely used in applications like fraud detection, medical diagnosis, and anomaly detection, where rare events need better representation. This generates new synthetic samples using interpolation rather than simple duplication.

It creates new data points by:

  • Selecting a minority class instance.
  • Finding its k-nearest minority class neighbors.
  • Generating a new point along the line connecting the instance and its neighbor.

Implementation in Python Using SMOTE

from imblearn.over_sampling import SMOTE

from collections import Counter

# Sample imbalanced dataset

X = [[1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12], [13], [14], [15], [16], [17], [18]]

y = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1]  # Imbalanced dataset

# Display class distribution before SMOTE

before_smote = Counter(y)

print(“Class distribution before SMOTE:”, before_smote)

# Apply SMOTE

smote = SMOTE(sampling_strategy=’auto’, random_state=42)

X_smote, y_smote = smote.fit_resample(X, y)

# Display class distribution after SMOTE

after_smote = Counter(y_smote)

print(“Class distribution after SMOTE:”, after_smote)

Output

Class distribution before SMOTE: Counter({0: 12, 1: 6})

Class distribution after SMOTE: Counter({0: 12, 1: 12})

Pros

  • Reducing overfitting by introducing synthetic but realistic samples.
  • Enhances model generalization for the minority class.

Cons

  • May create borderline or noisy samples.
  • Can increase training time.

Under-sampling Methods

Under-sampling is a technique for balancing imbalanced datasets by reducing the number of samples in the majority class. This helps prevent machine learning models from being biased toward the dominant class.

Types of Under-sampling:

  • Random Under-sampling (RUS): Randomly removes majority class samples to equalize class distribution.
  • Cluster Centroid Under-sampling: Replaces majority class samples with their cluster centroids, preserving key information.
  • NearMiss: Selects majority class samples closest to minority class instances, ensuring better class separation.

While under-sampling reduces dataset size and speeds up training, it risks losing valuable information from the majority class, potentially affecting model performance.

Random Under-sampling: As discussed earlier, it randomly removes instances from the majority class to balance class distribution.

Implementation in Python using random under-sampling

from imblearn.under_sampling import RandomUnderSampler

from collections import Counter

# Sample imbalanced dataset

X = [[1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12], [13], [14], [15], [16], [17], [18]]

y = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1]  # Imbalanced dataset

# Display class distribution before undersampling

before_undersampling = Counter(y)

print(“Class distribution before undersampling:”, before_undersampling)

# Apply Random Undersampling

rus = RandomUnderSampler(sampling_strategy=’auto’, random_state=42)

X_resampled, y_resampled = rus.fit_resample(X, y)

# Display class distribution after undersampling

after_undersampling = Counter(y_resampled)

print(“Class distribution after undersampling:”, after_undersampling)

Output

Class distribution before undersampling: Counter({0: 12, 1: 5})

Class distribution after undersampling: Counter({0: 5, 1: 5})

Pros

  • Reducing model training time by decreasing dataset size.
  • Works well when the dataset is large.

Cons

  • Risk of information loss: Important majority class samples may be removed.
  • Can lead to underfitting.

Implementation in Python Using SMOTE

from imblearn.over_sampling import SMOTE

from collections import Counter

# Sample imbalanced dataset

X = [[1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12], [13], [14], [15], [16], [17], [18]]

y = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1]   # Imbalanced dataset

# Display class distribution before SMOTE

before_smote = Counter(y)

print(“Class distribution before SMOTE:”, before_smote)

# Apply SMOTE

smote = SMOTE(sampling_strategy=’auto’, random_state=42)

X_smote, y_smote = smote.fit_resample(X, y)

# Display class distribution after SMOTE

after_smote = Counter(y_smote)

print(“Class distribution after SMOTE:”, after_smote)

Comparing the impact of resampling on model performance

To understand the impact of resampling, let’s apply it to a classification model and compare performance before and after resampling.

Step 1: Create an imbalanced dataset

from sklearn.datasets import make_classification

import matplotlib.pyplot as plt

import seaborn as sns

# Generate imbalanced data

X, y = make_classification(n_classes=2, weights=[0.9, 0.1],

                           n_samples=1000, random_state=42)

# Plot class distribution

sns.histplot(y, discrete=True)

plt.title(“Original Class Distribution”)

plt.show()

image

Figure 4: Imbalanced dataset (histplot)

Step 2: Apply resampling and train a model

from sklearn.model_selection import train_test_split

from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import classification_report

# Split data

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Apply SMOTE

smote = SMOTE(random_state=42)

X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

# Train model

clf = RandomForestClassifier(random_state=42)

clf.fit(X_train_smote, y_train_smote)

# Evaluate

y_pred = clf.predict(X_test)

print(classification_report(y_test, y_pred))

image

Figure 5: Classification Report

# Plot class distribution

sns.histplot(y_train_smote, discrete=True)

plt.title(“Original Class Distribution”)

plt.show()

image

Figure 6: Figure 4: Balanced dataset (histplot)

Key observations

  • Without resampling, the model may predict mostly the majority class.
  • After applying SMOTE and Random Oversampling, the recall for the minority class improves significantly.
  • Random undersampling can lead to the loss of valuable majority class samples but is useful when dataset size is a concern.

Choosing the Right Resampling Strategy

MethodWhen to UseProsCons
Random OversamplingSmall datasetsSimple: preserves all dataOverfitting risk
SMOTEComplex datasetsGenerates diverse samplesSynthetic noise risk
Random UndersamplingLarge datasetsFaster trainingInformation loss

Conclusion

Resampling techniques play a crucial role in handling imbalanced datasets in machine learning. While oversampling (Random Oversampling, SMOTE) helps generate synthetic or duplicated samples, under-sampling (Random Under-sampling) removes excessive majority-class data to balance the dataset.

The choice of technique depends on the dataset size, problem complexity, and trade-offs between overfitting and underfitting. Experimenting with different methods and evaluating their impact on model performance using proper metrics (precision, recall, F1-score) is essential for optimal results.

Leave a Reply

Your email address will not be published. Required fields are marked *