Underfitting vs Overfitting: Finding the Balance in Machine Learning

Machine learning models often fail in two opposite ways. Some models learn too little and remain weak no matter how much data or tuning we provide. Others learn too much and become trapped inside the training data they were supposed to understand. These two failures are known as underfitting and overfitting.

A recommendation engine may fail because it ignores user behavior and shows the same products to everyone. Another may fail because it memorized yesterday's clicks instead of learning lasting preferences. The goal of machine learning is not to achieve perfection on historical records. Historical data is only rehearsal. The real examination begins when new and unseen data arrives.

A successful model must learn genuine patterns and apply them reliably to future cases. This ability is called generalization. Underfitting and overfitting are two of its greatest enemies.

What is Underfitting?

Underfitting occurs when a model is too simple, too weak, or too restricted to capture meaningful relationships in the data. It performs poorly during training and continues performing poorly during validation because it never learned enough to begin with. Imagine predicting house prices using only square footage while ignoring location, neighborhood quality, transport access, amenities, age of the property, and market demand. The model may look mathematically elegant, but it remains practically blind.

price = a * area + b

This equation may produce rough estimates, yet it misses much of reality. That is underfitting.

When underfitting is present, both training and validation scores are usually low and close to one another.

Training Accuracy = 67%
Validation Accuracy = 65%

The model is not overthinking. It is barely thinking at all.

How to Fix Underfitting

When a model underfits, the remedy is usually to help it learn more. This may involve choosing a stronger algorithm, engineering richer features, reducing excessive regularization, or allowing more training time. Sometimes the issue is not the algorithm itself but the poverty of the information given to it.

from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier()

A nonlinear model often captures patterns that a rigid linear model cannot.

What is Overfitting?

Overfitting occurs when a model learns the training data too precisely. It captures not only genuine patterns but also noise, random coincidences, and accidental quirks. The model becomes excellent at remembering the past while becoming unreliable at predicting the future. Imagine a fraud detection model that learns strange rules such as purchases made at exactly 3:07 PM being suspicious, or cart totals ending in 13 being risky. These patterns may have appeared in training data by chance rather than truth.

When overfitting appears, training performance looks excellent while validation performance falls behind.

Training Accuracy = 99%
Validation Accuracy = 76%

The model has mastered the training set but failed to understand the real problem.

How to Fix Overfitting

When a model overfits, the remedy is usually to control complexity and improve discipline. This may involve collecting more data, simplifying the model, pruning trees, increasing regularization, applying dropout, or removing noisy features.

from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier(max_depth=5)

Restricting depth often prevents memorization and improves validation performance.

Real-World Example: Recommendation Systems

Suppose an online store uses a simple recommender system that shows only the best-selling products to every visitor. It does not care whether the customer prefers books, fashion, electronics, or sports equipment. It ignores browsing history, purchase history, price sensitivity, and category taste. Everyone receives nearly the same suggestions.

products = ["Phone Case", "Running Shoes", "Laptop Bag"]

def recommend(user):
    return products[:3]

This system likely suffers from underfitting because it treats all users as identical. A gamer may receive kitchen tools. A parent may receive gaming accessories. A student may receive luxury goods. The model has learned one shallow truth—that popular products sell—but it has ignored the deeper structure of customer behavior.

Now imagine replacing this simple system with a giant deep learning model trained on every click, hover, scroll, pause, accidental tap, and random interaction. If the dataset is messy, sparse, or noisy, the model may begin memorizing patterns unique to yesterday's traffic.

from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(
    n_estimators=500,
    max_depth=None
)

Such a model may produce impressive training metrics.

Training Accuracy = 99%
Validation Accuracy = 74%

The gap between training and validation performance suggests overfitting. The model has become a historian of yesterday instead of a predictor of tomorrow.

The winning system usually lies between these extremes. A better recommender uses meaningful signals such as recent views, previous purchases, category interests, similar users, and price range, while controlling unnecessary complexity.

from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(
    n_estimators=100,
    max_depth=8,
    min_samples_leaf=5
)

Such a balanced system may produce healthier results.

Training Accuracy = 88%
Validation Accuracy = 86%

The scores are strong and close together, which often signals better generalization.

Model Evaluation & Reliability

One train-test split can be deceptive. A lucky split may hide overfitting, while an unlucky split may exaggerate weakness. Cross validation offers a stronger estimate of real performance by repeating evaluation across multiple folds.

from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression

# Load sample dataset
data = load_iris()
X = data.data
y = data.target

# Create model
model = LogisticRegression(max_iter=200)

# Perform 5-fold cross validation
scores = cross_val_score(model, X, y, cv=5)

# Print results
print("Fold Scores:", scores)
print("Average Accuracy:", scores.mean())
print("Standard Deviation:", scores.std())

Fold Scores: [0.96666667 1.         0.93333333 0.96666667 1.        ]
Average Accuracy: 0.9733333333333334
Standard Deviation: 0.02494438257849294

Consistent fold scores suggest stability. Wide variation may indicate fragile learning or data issues.

Regularization discourages unnecessary complexity. It is one of the strongest defenses against overfitting. In linear models it may limit coefficient growth. In neural networks it may appear as dropout or weight decay.

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error, r2_score

# Create sample regression dataset
X, y = make_regression(
    n_samples=200,
    n_features=5,
    noise=15,
    random_state=42
)

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Create Ridge model
model = Ridge(alpha=1.0)

# Train model
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Evaluate
print("R2 Score:", r2_score(y_test, y_pred))
print("MSE:", mean_squared_error(y_test, y_pred))

R2 Score: 0.9782108327747819
MSE: 189.3151775901967

A small increase in bias often creates a large reduction in variance, making the trade worthwhile.

Sometimes outstanding metrics are not signs of intelligence but signs of leakage. Leakage happens when future information or target-related clues accidentally enter training data. For example, predicting customer churn using a feature created after cancellation may produce false excellence.

The model appears brilliant but collapses in production. Whenever results look suspiciously perfect, leakage should be investigated.

Learning curves compare performance as training size increases. If both training and validation scores remain low, underfitting is likely. If training remains high while validation stays much lower, overfitting is likely. If both converge high and remain close, the model is usually healthy.

These curves are among the most useful diagnostic tools in machine learning.

Many teams trust offline metrics too much. Real systems require continuous monitoring. A model that once generalized well may later fail because user behavior changed, competitors entered the market, or product trends shifted.

Watch for falling click-through rate, rising false positives, drifting feature distributions, or widening gaps between offline and live performance. Production reality is always the final judge.

Begin with training and validation metrics.

Case A: Train 69% Validation 67%
Case B: Train 99% Validation 78%
Case C: Train 91% Validation 89%

Case A suggests underfitting. Case B suggests overfitting. Case C suggests healthy generalization.

Experienced practitioners ask deeper questions than "Which algorithm should I use?" They ask whether the data is representative, whether features are meaningful, whether leakage exists, whether the model is too rigid or too flexible, and whether metrics remain stable over time. They understand that model quality emerges from systems thinking, not from one clever library call.

Conclusion

Underfitting happens when a model is too simple and misses important patterns. Overfitting happens when a model is too complex and memorizes noise. Underfitting usually shows low training and low validation performance, while overfitting often shows high training performance and weaker validation results.

The best models live in the disciplined middle. They learn enough to capture truth and remain humble enough to ignore noise. That balance is where practical machine learning becomes engineering.

Underfitting vs Overfitting: Finding the Balance in Machine Learning

What is Underfitting?

How to Fix Underfitting

What is Overfitting?

How to Fix Overfitting

Real-World Example: Recommendation Systems

Model Evaluation & Reliability

Conclusion

Nagesh Chauhan

Share this Article

💬 Comments

Join the discussion