Data Cleaning & Preprocessing for ML (Missing Values, Outliers, Normalization)

In machine learning, many people focus on algorithms, model tuning, and deployment pipelines. Yet experienced engineers know that a large share of success is decided much earlier, during data cleaning and preprocessing. A sophisticated model trained on weak data often performs worse than a simple model trained on clean and well-prepared data.

Raw data from real systems is rarely perfect. Customer records may contain blank fields, financial tables may include impossible numbers, sensors may fail temporarily, logs may arrive in inconsistent formats, and numerical features may exist on wildly different scales. If these issues are ignored, the model learns confusion rather than truth.

Data cleaning is the discipline of correcting, filtering, and improving raw data. Preprocessing is the process of transforming that cleaned data into a form suitable for machine learning systems. Together, they form one of the most valuable stages in the entire ML lifecycle.

A machine learning model does not understand business meaning. It only sees patterns in numbers and categories. If age contains missing values, income contains typing errors, and transaction amounts include impossible spikes, the model cannot distinguish signal from noise unless engineers prepare the data carefully.

This is why teams often say: garbage in, garbage out. Better input usually creates better output more reliably than endlessly switching algorithms.

A model trained on noisy customer churn data may blame the wrong factors. A fraud model may miss suspicious behavior because abnormal records were not handled properly. A recommendation engine may underperform because inconsistent user histories were never cleaned.

Handling Missing Values

Missing values are among the most common problems in production datasets. A customer may skip optional profile fields. A payment service may fail to return one attribute. A sensor may stop reporting temporarily. A manual data entry process may leave blanks.

Missing values are dangerous because many algorithms cannot process them directly, and even when they can, missingness itself may carry business meaning.

Imagine a loan application dataset where annual income is blank. That absence may not be random. It could indicate uncertainty, incomplete forms, or higher-risk applicants. Therefore, missing data should be handled thoughtfully rather than mechanically.

One common method is imputation, which means filling missing values using reasonable substitutes.
from Pandas import df

df["age"] = df["age"].fillna(df["age"].median())
df["city"] = df["city"].fillna("Unknown")
Median values are often useful for skewed numeric columns, while categories may use a placeholder such as Unknown.

Another strategy is removing rows or columns when missingness is extreme and the data has little value. However, aggressive deletion can reduce dataset size and introduce bias.

Understanding Outliers

Outliers are observations that differ sharply from the rest of the dataset. They may represent genuine rare behavior, data entry mistakes, fraud attempts, system glitches, or extraordinary events.

For example, in an employee salary dataset, values such as 50,000 and 60,000 may be common, while 9,999,999 may be suspicious. In an e-commerce dataset, a single order of 100,000 units may indicate either enterprise demand or corrupted records.

Outliers matter because they can distort averages, inflate variance, and mislead many algorithms. Linear regression, distance-based models, and clustering methods can be especially sensitive.

A common statistical method uses the interquartile range.
from Pandas import df

Q1 = df["price"].quantile(0.25)
Q3 = df["price"].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
clean_df = df[(df["price"] >= lower) & (df["price"] <= upper)]
This code uses the IQR (Interquartile Range) method to detect and remove outliers from the price column. It calculates the 25th percentile (Q1) and 75th percentile (Q3), then defines lower and upper limits using 1.5 ร— IQR.

Any rows with prices outside this range are treated as outliers and removed, creating a cleaned dataset called clean_df.

This approach flags unusually low or high values relative to the majority of the data.

Yet outliers should not always be removed. A fraud detection system may depend precisely on unusual behavior. A demand forecasting model must sometimes learn from rare holiday spikes. Context determines whether an outlier is noise or gold.

Normalization and Feature Scaling

Many datasets contain variables on different numeric scales. One column may store age values between 18 and 80. Another may store salary values in lakhs or millions. Another may store website visits between 0 and 5000.

When scales differ dramatically, some algorithms give excessive importance to larger-number features. Distance-based methods such as K-Nearest Neighbors and K-Means are especially sensitive. Gradient-based models may also train inefficiently.

Normalization and scaling solve this issue by bringing features into comparable ranges.

A common technique is Min-Max scaling.
from sklearn.preprocessing import MinMaxScaler

from Pandas import df

scaler = MinMaxScaler()
df[["age", "income"]] = scaler.fit_transform(df[["age", "income"]])
This usually transforms values into a range such as 0 to 1.

Another popular technique is Standardization, which centers data around zero with unit variance.
from sklearn.preprocessing import StandardScaler

from Pandas import df

scaler = StandardScaler()
df[["age", "income"]] = scaler.fit_transform(df[["age", "income"]])
Standardization is often useful for linear models, logistic regression, neural networks, and many optimization-based methods.

Example

Imagine building a customer churn model for a subscription platform. The raw dataset contains common quality problems:

- Missing values in customer age
- Impossible values such as negative monthly charges
- Different feature scales like login count versus annual revenue

If this data is used directly, the model may produce poor and unstable predictions.
import pandas as pd
import numpy as np

# Raw customer data
data = {
    "age": [25, 32, np.nan, 45, 29, np.nan, 41, 38],
    "monthly_charge": [49, 59, -20, 75, 65, 9999, 55, 60],
    "login_count": [12, 8, 5, 20, 15, 3, 10, 11],
    "annual_revenue": [12000, 18000, 9000, 25000, 21000, 15000, 22000, 17000]
}

df = pd.DataFrame(data)

print("Raw Data:\n")
print(df)

# -----------------------------------
# 1. Handle impossible values
# -----------------------------------

# Replace negative charges with NaN
df["monthly_charge"] = df["monthly_charge"].apply(
    lambda x: np.nan if x < 0 else x
)

# Cap unusually high charges
df["monthly_charge"] = df["monthly_charge"].clip(upper=500)

# -----------------------------------
# 2. Fill missing values
# -----------------------------------

# Fill missing age with median age
df["age"] = df["age"].fillna(df["age"].median())

# Fill missing monthly_charge with median
df["monthly_charge"] = df["monthly_charge"].fillna(
    df["monthly_charge"].median()
)

# -----------------------------------
# 3. Final Cleaned Data
# -----------------------------------

print("\nCleaned Data:\n")
print(df)
Raw Data:

    age  monthly_charge  login_count  annual_revenue
0  25.0              49           12           12000
1  32.0              59            8           18000
2   NaN             -20            5            9000
3  45.0              75           20           25000
4  29.0              65           15           21000
5   NaN            9999            3           15000
6  41.0              55           10           22000
7  38.0              60           11           17000

Cleaned Data:

    age  monthly_charge  login_count  annual_revenue
0  25.0            49.0           12           12000
1  32.0            59.0            8           18000
2  35.0            60.0            5            9000
3  45.0            75.0           20           25000
4  29.0            65.0           15           21000
5  35.0           500.0            3           15000
6  41.0            55.0           10           22000
7  38.0            60.0           11           17000

Practical Workflow

First, inspect the dataset. Second, measure missingness. Third, understand unusual values. Fourth, scale numerical variables where needed. Fifth, validate outputs before training.
Step 1: Profile columns 
Step 2: Detect missing values 
Step 3: Investigate outliers 
Step 4: Apply scaling 
Step 5: Validate transformed data 
Step 6: Train model 
This process reduces avoidable downstream problems.

Many beginners remove every outlier automatically, even when rare cases are valuable. Others fill all missing values with zero, which can create false meaning. Some normalize tree-based model inputs unnecessarily while ignoring more important issues like leakage or duplicate rows.

Another common mistake is fitting scalers on the full dataset before train-test splitting. This leaks information from validation data into training transformations.

The correct order is usually split first, then fit transformations on training data only.
from fontTools.misc.symfont import y
from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler

X_train, X_test, y_train, y_test = train_test_split(X, y)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
This preserves honest evaluation.

Not every unusual record is wrong. Not every missing field is harmful. Not every variable requires scaling. Excessive cleaning can remove valuable business signals and create sterile datasets that no longer reflect reality.

The goal is not perfect-looking data. The goal is useful, trustworthy, representative data.

In real systems, preprocessing must be repeatable. The same transformations used during training must also run during inference. If the training pipeline uses median imputation and scaling, the production system must apply identical logic to incoming requests.

Inconsistency between training and production preprocessing is a common hidden cause of model failure.

Summary

- Missing values should be analyzed and handled through imputation or selective removal.
- Outliers may be noise or valuable rare signals, depending on context.
- Normalization and scaling help many algorithms learn fairly across features.
- Leakage prevention requires fitting transformations only on training data.
- Strong ML systems depend heavily on disciplined preprocessing pipelines.

Master data preparation, and model building becomes far more effective.

Nagesh Chauhan
Nagesh Chauhan
Principal Engineer | Java ยท Spring Boot ยท Python ยท Microservices ยท AI/ML

Principal Engineer with 14+ years of experience in designing scalable systems using Java, Spring Boot, and Python. Specialized in microservices architecture, system design, and machine learning.

Share this Article

๐Ÿ’ฌ Comments

Join the discussion