Handling Imbalanced Data in Machine Learning (Python Examples)

In many real-world datasets, one class appears far more frequently than another. For example, in fraud detection, fraudulent transactions are rare compared to legitimate ones. In medical diagnosis, patients with a disease may be far fewer than healthy individuals. In spam detection, spam emails are usually fewer than normal messages depending on the dataset.

When such imbalance exists, many machine learning models become biased toward the majority class. A model may show high overall accuracy while failing completely on the minority class. This creates a dangerous illusion of performance. Therefore, understanding how to detect, measure, and solve imbalance is essential for building reliable predictive systems.

What is Imbalanced Data?

A dataset is called imbalanced when the number of observations in one target class is significantly larger than the others. In binary classification, this often means one class dominates while the second class has very few examples.

Suppose a dataset has 10,000 records where 9,800 belong to class 0 and only 200 belong to class 1. This means class 1 represents only 2% of the total data. A model that predicts every sample as class 0 would achieve 98% accuracy, yet it would be useless because it never detects class 1.

This is why accuracy alone becomes misleading in imbalanced problems.

Since majority examples dominate training, the model learns patterns that favor them. Minority examples contribute little to the loss function unless special techniques are used.

As a result, the classifier may have poor recall, low precision, weak probability estimates, and unstable decision boundaries for rare classes. In practical domains such as healthcare, cybersecurity, finance, and manufacturing, missing minority events can be expensive or dangerous.

To detect imbalance, the first step is to inspect the target distribution. Counting class frequencies immediately reveals the severity of the imbalance. A ratio such as 95:5 is moderate, while 99.9:0.1 can be extreme.

In Python, one may inspect class counts as follows.
import pandas as pd

# Create a sample dataset
data = pd.DataFrame({
    "target": [0] * 98 + [1] * 2
})

# Display class counts
print(data["target"].value_counts())

# Display class proportions
print(data["target"].value_counts(normalize=True))
Output:
target
0    98
1     2
Name: count, dtype: int64
target
0    0.98
1    0.02
Name: proportion, dtype: float64
98% of rows belong to class 0, and 2% belong to class 1.

This is why engineers rely on metrics such as precision, recall, F1-score, ROC-AUC, and especially PR-AUC when positive cases are rare.

- Precision measures how many predicted positive cases were truly positive. It is important when false alarms are costly.
- Recall measures how many actual positive cases were found. It matters when missing positives is dangerous.
- F1-score balances precision and recall into a single measure.
- ROC-AUC measures ranking quality across thresholds, while PR-AUC is often more informative for severe imbalance.

Strategies to Improve Imbalanced Data

1. Resampling the Dataset

A common solution is changing class distribution in the training set. This can be done by oversampling minority cases or undersampling majority cases.

1.1 Random Oversampling

Oversampling duplicates minority examples so the learner sees them more often. It is simple and often effective, but excessive duplication may cause overfitting.
import pandas as pd
from sklearn.utils import resample

# Create a sample dataset
data = pd.DataFrame({
    "target": [0] * 98 + [1] * 2
})

# Show original distribution
print("Before Oversampling:")
print(data["target"].value_counts())
print(data["target"].value_counts(normalize=True))

# Separate majority and minority classes
majority = data[data["target"] == 0]
minority = data[data["target"] == 1]

# Randomly oversample minority class
minority_oversampled = resample(
    minority,
    replace=True,
    n_samples=len(majority),
    random_state=42
)

# Combine both classes
balanced_data = pd.concat([majority, minority_oversampled])

# Show new distribution
print("\nAfter Oversampling:")
print(balanced_data["target"].value_counts())
print(balanced_data["target"].value_counts(normalize=True))
Output:
Before Oversampling:
target
0    98
1     2
Name: count, dtype: int64
target
0    0.98
1    0.02
Name: proportion, dtype: float64

After Oversampling:
target
0    98
1    98
Name: count, dtype: int64
target
0    0.5
1    0.5
Name: proportion, dtype: float64

1.2 Random Undersampling

Undersampling removes majority examples. This reduces training size and balances classes, but useful information may be lost.
import pandas as pd
from sklearn.utils import resample

# Create a sample dataset
data = pd.DataFrame({
    "target": [0] * 98 + [1] * 2
})

# Show original distribution
print("Before Undersampling:")
print(data["target"].value_counts())
print(data["target"].value_counts(normalize=True))

# Separate majority and minority classes
majority = data[data["target"] == 0]
minority = data[data["target"] == 1]

# Randomly undersample majority class
majority_undersampled = resample(
    majority,
    replace=False,
    n_samples=len(minority),
    random_state=42
)

# Combine both classes
balanced_data = pd.concat([majority_undersampled, minority])

# Show new distribution
print("\nAfter Undersampling:")
print(balanced_data["target"].value_counts())
print(balanced_data["target"].value_counts(normalize=True))
Output:
Before Undersampling:
target
0    98
1     2
Name: count, dtype: int64
target
0    0.98
1    0.02
Name: proportion, dtype: float64

After Undersampling:
target
0    2
1    2
Name: count, dtype: int64
target
0    0.5
1    0.5
Name: proportion, dtype: float64
Oversampling is useful when data is small and minority examples are precious. Undersampling is useful when majority data is huge and redundant.

2. SMOTE

SMOTE stands for Synthetic Minority Over-sampling Technique. Instead of duplicating minority records, it creates synthetic samples between nearby minority points. This often improves generalization.
import pandas as pd
from imblearn.over_sampling import SMOTE

# Create a sample dataset
data = pd.DataFrame({
    "feature1": range(100),
    "target": [0] * 98 + [1] * 2
})

# Separate features and target
X = data[["feature1"]]
y = data["target"]

# Show original distribution
print("Before SMOTE:")
print(y.value_counts())
print(y.value_counts(normalize=True))

# Apply SMOTE
smote = SMOTE(random_state=42, k_neighbors=1)
X_resampled, y_resampled = smote.fit_resample(X, y)

# Show new distribution
print("\nAfter SMOTE:")
print(y_resampled.value_counts())
print(y_resampled.value_counts(normalize=True))
Output:
Before SMOTE:
target
0    98
1     2
Name: count, dtype: int64
target
0    0.98
1    0.02
Name: proportion, dtype: float64

After SMOTE:
target
0    98
1    98
Name: count, dtype: int64
target
0    0.5
1    0.5
Name: proportion, dtype: float64
SMOTE can be powerful, but it should be applied carefully. If classes overlap strongly, synthetic samples may create noisy regions.

3. Class Weights

Many algorithms allow assigning higher penalties to minority mistakes. This avoids changing the dataset and instead changes learning priorities.
import pandas as pd
from sklearn.linear_model import LogisticRegression

# Create a sample dataset
data = pd.DataFrame({
    "feature1": range(100),
    "target": [0] * 98 + [1] * 2
})

# Separate features and target
X = data[["feature1"]]
y = data["target"]

# Show original distribution
print("Class Distribution:")
print(y.value_counts())
print(y.value_counts(normalize=True))

# Train model with class weights
model = LogisticRegression(class_weight="balanced", random_state=42)
model.fit(X, y)

# Show learned class weights effect
print("\nModel trained using balanced class weights.")
print("Minority class errors receive higher penalty.")
Output:
Class Distribution:
target
0    98
1     2
Name: count, dtype: int64
target
0    0.98
1    0.02
Name: proportion, dtype: float64

Model trained using balanced class weights.
Minority class errors receive higher penalty.
This method is elegant and especially effective in linear models, trees, and boosting methods.

4. Threshold Tuning

Many classifiers output probabilities. The default threshold is often 0.5, but for imbalanced data this may be suboptimal. Lowering the threshold can increase recall, while raising it may improve precision.
import pandas as pd
from sklearn.linear_model import LogisticRegression

# Create a sample dataset
data = pd.DataFrame({
    "feature1": range(100),
    "target": [0] * 98 + [1] * 2
})

# Separate features and target
X = data[["feature1"]]
y = data["target"]

# Train classifier
model = LogisticRegression(class_weight="balanced", random_state=42)
model.fit(X, y)

# Get predicted probabilities for class 1
probs = model.predict_proba(X)[:, 1]

# Apply different thresholds
pred_50 = (probs >= 0.50).astype(int)
pred_30 = (probs >= 0.30).astype(int)

# Count predicted positives
print("Threshold = 0.50")
print(pd.Series(pred_50).value_counts())

print("\nThreshold = 0.30")
print(pd.Series(pred_30).value_counts())
Output:
Threshold = 0.50
0    96
1     4
Name: count, dtype: int64

Threshold = 0.30
0    96
1     4
Name: count, dtype: int64
Threshold tuning should be guided by business cost rather than guesswork.

5. Ensemble Methods

Tree ensembles such as Random Forest, XGBoost, LightGBM, and Balanced Random Forest often perform well on imbalanced datasets. These models capture nonlinear relationships and interact well with class weighting.
import pandas as pd
from sklearn.ensemble import RandomForestClassifier

# Create a sample dataset
data = pd.DataFrame({
    "feature1": range(100),
    "target": [0] * 98 + [1] * 2
})

# Separate features and target
X = data[["feature1"]]
y = data["target"]

# Show original distribution
print("Class Distribution:")
print(y.value_counts())
print(y.value_counts(normalize=True))

# Train Random Forest with class weights
model = RandomForestClassifier(
    n_estimators=100,
    class_weight="balanced",
    random_state=42
)

model.fit(X, y)

# Predict classes
predictions = model.predict(X)

print("\nPredicted Class Counts:")
print(pd.Series(predictions).value_counts())
Output:
Class Distribution:
target
0    98
1     2
Name: count, dtype: int64
target
0    0.98
1    0.02
Name: proportion, dtype: float64

Predicted Class Counts:
0    98
1     2
Name: count, dtype: int64
They are widely used in industry because they combine predictive strength with flexible handling of skewed distributions.

Cross Validation Best Practices

When validating imbalanced data, use stratified cross-validation. This preserves class proportions in each fold. Without stratification, some folds may contain too few minority examples.
import pandas as pd
from sklearn.model_selection import StratifiedKFold

# Create a sample dataset
data = pd.DataFrame({
    "feature1": range(100),
    "target": [0] * 98 + [1] * 2
})

X = data[["feature1"]]
y = data["target"]

# Stratified Cross Validation
skf = StratifiedKFold(n_splits=2, shuffle=True, random_state=42)

for fold, (train_idx, test_idx) in enumerate(skf.split(X, y), start=1):
    y_train = y.iloc[train_idx]
    y_test = y.iloc[test_idx]

    print(f"Fold {fold}")
    print("Training Class Counts:")
    print(y_train.value_counts())

    print("Testing Class Counts:")
    print(y_test.value_counts())
    print("-" * 30)
Output:
Fold 1
Training Class Counts:
target
0    49
1     1
Name: count, dtype: int64
Testing Class Counts:
target
0    49
1     1
Name: count, dtype: int64
------------------------------
Fold 2
Training Class Counts:
target
0    49
1     1
Name: count, dtype: int64
Testing Class Counts:
target
0    49
1     1
Name: count, dtype: int64
------------------------------
Also remember that resampling must be done only on training folds, never before splitting, to avoid data leakage.

Pipeline Example

A safe workflow uses scaling, SMOTE, and a classifier inside one pipeline.
import pandas as pd
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

# Create a sample dataset
data = pd.DataFrame({
    "feature1": range(100),
    "feature2": range(100, 200),
    "target": [0] * 98 + [1] * 2
})

# Separate features and target
X = data[["feature1", "feature2"]]
y = data["target"]

# Show original distribution
print("Original Class Distribution:")
print(y.value_counts())

# Build pipeline
pipeline = Pipeline([
    ("scaler", StandardScaler()),
    ("smote", SMOTE(random_state=42, k_neighbors=1)),
    ("model", LogisticRegression(random_state=42))
])

# Train pipeline
pipeline.fit(X, y)

print("\nPipeline trained successfully.")
print("Steps applied: Scaling โ†’ SMOTE โ†’ Logistic Regression")
Output:
Original Class Distribution:
target
0    98
1     2
Name: count, dtype: int64

Pipeline trained successfully.
Steps applied: Scaling โ†’ SMOTE โ†’ Logistic Regression

Conclusion

If imbalance is mild, start with proper metrics and class weights. If imbalance is moderate, test SMOTE and threshold tuning. If imbalance is extreme, combine anomaly detection, specialized ensembles, calibrated probabilities, and domain-specific cost analysis.

There is no universal recipe. Practical experimentation with disciplined validation is the true path.

Handling imbalanced data is fundamentally about ensuring rare but important events are learned correctly. Standard accuracy can hide failure, while proper metrics reveal truth. Techniques such as oversampling, undersampling, SMOTE, class weights, threshold tuning, and ensemble learning provide strong solutions when used carefully.
Nagesh Chauhan
Nagesh Chauhan
Principal Engineer | Java ยท Spring Boot ยท Python ยท Microservices ยท AI/ML

Principal Engineer with 14+ years of experience in designing scalable systems using Java, Spring Boot, and Python. Specialized in microservices architecture, system design, and machine learning.

Share this Article

๐Ÿ’ฌ Comments

Join the discussion