Data Validation in Machine Learning (Great Expectations and Schema Drift)

In many organizations, data issues are not caused by complex algorithms but by missing values, broken formats, duplicate records, unexpected schema changes, or silent upstream failures. This is why data validation has become a foundational discipline in data engineering.

Data validation is the process of testing data against defined quality rules before it is trusted for downstream use. These rules may check row counts, column types, allowed ranges, uniqueness, null percentages, freshness, relationships between fields, or schema consistency. Instead of discovering errors after business damage occurs, validation systems detect issues early.

Among the most widely used tools in this area is Great Expectations, while one of the most important operational concerns is schema drift, where data structure changes unexpectedly over time.

This article explains data validation practically, why it matters, how Great Expectations works, and how teams manage schema drift in production pipelines.

What is Great Expectations?

Great Expectations is an open-source framework for testing, documenting, and monitoring data quality. It allows teams to define rules called expectations that datasets should satisfy. These expectations are then executed automatically during pipelines or scheduled checks.

Examples include expecting a column to be non-null, expecting values to fall within a range, expecting unique primary keys, or expecting row counts above a threshold.

Great Expectations also generates human-readable documentation and validation reports, making data quality visible across teams.

Software engineers write unit tests for code. In a similar spirit, data engineers write expectations for datasets. Instead of assuming customer_id is always present, they explicitly test it. Instead of trusting order totals are positive, they verify it.

This mindset changes data quality from reactive debugging into proactive engineering discipline.

A dataset of customers may require unique IDs and non-null emails.
import pandas as pd
import great_expectations as gx

# Create sample dataset
data = pd.DataFrame({
    "customer_id": [101, 102, 103, 103],
    "email": [
        "a@example.com",
        "b@example.com",
        None,
        "d@example.com"
    ]
})

# Get Great Expectations context
context = gx.get_context()

# Add pandas datasource
datasource = context.data_sources.add_pandas(name="my_pandas_source")

# Add dataframe asset
data_asset = datasource.add_dataframe_asset(name="customers_data")

# Define batch
batch_definition = data_asset.add_batch_definition_whole_dataframe(
    name="customers_batch"
)

# Create validator directly
validator = context.get_validator(
    batch_request=batch_definition.build_batch_request(
        batch_parameters={"dataframe": data}
    )
)

# Add expectations
validator.expect_column_values_to_not_be_null("customer_id")
validator.expect_column_values_to_be_unique("customer_id")
validator.expect_column_values_to_not_be_null("email")

# Run validation
results = validator.validate()

print(results)
Output:
Calculating Metrics: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 6/6 [00:00<00:00, 3941.40it/s]
Calculating Metrics: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 8/8 [00:00<00:00, 5947.26it/s]
Calculating Metrics: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 6/6 [00:00<00:00, 15382.53it/s]
Calculating Metrics: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 12/12 [00:00<00:00, 15897.55it/s]
{
  "success": false,
  "results": [
    {
      "success": true,
      "expectation_config": {
        "type": "expect_column_values_to_not_be_null",
        "kwargs": {
          "batch_id": "my_pandas_source-customers_data",
          "column": "customer_id"
        },
        "meta": {},
        "severity": "critical"
      },
      "result": {
        "element_count": 4,
        "unexpected_count": 0,
        "unexpected_percent": 0.0,
        "partial_unexpected_list": []
      },
      "meta": {},
      "exception_info": {
        "raised_exception": false,
        "exception_traceback": null,
        "exception_message": null
      }
    },
    {
      "success": false,
      "expectation_config": {
        "type": "expect_column_values_to_be_unique",
        "kwargs": {
          "batch_id": "my_pandas_source-customers_data",
          "column": "customer_id"
        },
        "meta": {},
        "severity": "critical"
      },
      "result": {
        "element_count": 4,
        "unexpected_count": 2,
        "unexpected_percent": 50.0,
        "partial_unexpected_list": [
          103,
          103
        ],
        "missing_count": 0,
        "missing_percent": 0.0,
        "unexpected_percent_total": 50.0,
        "unexpected_percent_nonmissing": 50.0
      },
      "meta": {},
      "exception_info": {
        "raised_exception": false,
        "exception_traceback": null,
        "exception_message": null
      }
    },
    {
      "success": false,
      "expectation_config": {
        "type": "expect_column_values_to_not_be_null",
        "kwargs": {
          "batch_id": "my_pandas_source-customers_data",
          "column": "email"
        },
        "meta": {},
        "severity": "critical"
      },
      "result": {
        "element_count": 4,
        "unexpected_count": 1,
        "unexpected_percent": 25.0,
        "partial_unexpected_list": [
          null
        ]
      },
      "meta": {},
      "exception_info": {
        "raised_exception": false,
        "exception_traceback": null,
        "exception_message": null
      }
    }
  ],
  "suite_name": "default",
  "suite_parameters": {},
  "statistics": {
    "evaluated_expectations": 3,
    "successful_expectations": 1,
    "unsuccessful_expectations": 2,
    "success_percent": 33.33333333333333
  },
  "meta": {
    "great_expectations_version": "1.17.0",
    "expectation_suite_name": "default",
    "run_id": {
      "run_name": null,
      "run_time": "2026-04-24T16:25:37.475446+05:30"
    },
    "batch_spec": {
      "batch_data": "PandasDataFrame"
    },
    "batch_markers": {
      "ge_load_time": "20260424T105537.445152Z",
      "pandas_data_fingerprint": "277ed982077761077a61b05d655faa60"
    },
    "active_batch_definition": {
      "datasource_name": "my_pandas_source",
      "data_connector_name": "fluent",
      "data_asset_name": "customers_data",
      "batch_identifiers": {
        "dataframe": ""
      }
    },
    "validation_time": "20260424T105537.475400Z",
    "checkpoint_name": null
  },
  "id": null
}
This quickly reveals a broken assumption before the data reaches production systems.

The most useful validations are often simple. Critical IDs should be unique. Required fields should not be null. Numeric values should remain within sensible ranges. Dates should parse correctly. Categorical columns should contain approved values. Row counts should remain within expected bounds. Timestamps should be recent enough to indicate pipeline freshness.

These checks prevent a large percentage of operational incidents.

Manual Python checks can work for small systems, but they often become scattered and undocumented. Great Expectations provides reusable expectations, structured reporting, documentation, and stronger governance.

As systems grow, standardized frameworks usually outperform ad hoc scripts.

What is Schema Drift?

Schema drift occurs when the structure of incoming data changes unexpectedly over time. In data systems, a schema defines how data is organized, including column names, data types, formats, order, and nested fields. When this structure changes without proper coordination, downstream pipelines can fail or produce incorrect results.

Schema drift may happen in many ways. A column can be renamed, removed, added, reordered, or changed to a different data type. For example, a field named amount may change from an integer to a string, signup_date may start arriving in a new date format, or nested JSON objects may suddenly be introduced into an API response.

Consider a payment pipeline that expects this CSV file:
customer_id,amount,payment_date
101,250,2024-01-10
102,300,2024-01-11
A reporting job may assume that amount is numeric and can be summed directly.

Now imagine the source application is updated and starts sending:
customer_id,total_amount,payment_date
101,"250 USD",10/01/2024
102,"300 USD",11/01/2024
Several schema changes happened at once:

- amount was renamed to total_amount
- Numeric values became text strings such as "250 USD"
- Date format changed from YYYY-MM-DD to DD/MM/YYYY

As a result, downstream jobs may fail, dashboards may show missing values, or machine learning features may become corrupted.

Because many systems depend on stable schemas, even small changes can create serious problems. A machine learning model trained on numeric amount values may break if production data suddenly provides strings. A warehouse load job may fail if a required column disappears.

Schema drift often happens when software developers modify source systems without informing data teams. It may also appear after vendor API upgrades, database migrations, CSV export changes, manual spreadsheet edits, or multiple services producing inconsistent event formats.

Sometimes schema changes are intentional improvements, such as adding a new column like discount_code. The real danger is not change itself, but unmanaged change without validation, communication, or backward compatibility.

Modern data teams reduce schema drift risk using schema registries, validation tools, contracts, automated tests, and monitoring systems that alert when incoming structures no longer match expectations.

Conclusion

Data validation protects organizations from one of the oldest risks in technology: trusting broken inputs. Great Expectations helps teams define clear quality rules, while managing schema drift ensures structural consistency as systems evolve.

Reliable pipelines are not built merely by moving data quickly. They are built by ensuring that what moves is correct, complete, and trustworthy. That is the true purpose of validation engineering.
Nagesh Chauhan
Nagesh Chauhan
Principal Engineer | Java ยท Spring Boot ยท Python ยท Microservices ยท AI/ML

Principal Engineer with 14+ years of experience in designing scalable systems using Java, Spring Boot, and Python. Specialized in microservices architecture, system design, and machine learning.

Share this Article

๐Ÿ’ฌ Comments

Join the discussion