Data Version Control with DVC and LakeFS

In modern data systems, code is rarely the only asset that changes. Datasets evolve, features are regenerated, labels are corrected, schemas are updated, and machine learning experiments depend heavily on historical data states. While software engineers have long used version control systems such as Git for source code, data teams often struggle to manage changing datasets with the same discipline. This challenge gave rise to data versioning.

Data versioning is the practice of tracking changes to datasets over time so that teams can reproduce results, compare revisions, audit transformations, and safely collaborate. Among the most important tools in this area are DVC and LakeFS. DVC extends Git-style workflows for data science projects, while LakeFS brings Git-like branching and commits to object storage data lakes.
DVC = Git-style versioning + pipelines for ML/data science projects (files, experiments, models, reproducibility). LakeFS = Git-like branching/commits for large object storage data lakes (S3, GCS, Azure Blob).

They can each do some of what the other does, but not all of it individually.
Imagine a machine learning model that performed well three months ago but now produces weaker results. If nobody knows which dataset version was used during training, reproducibility becomes nearly impossible. If a bad batch of data enters a pipeline and overwrites trusted records, rollback becomes difficult. If multiple analysts modify shared data simultaneously, collaboration turns chaotic.

Data versioning solves these problems by creating traceable snapshots of data states. It allows teams to answer practical questions such as which data trained this model, when did values change, who introduced a schema update, and how can we restore a previous version.

This article explains the principles, tools, workflows, and production practices of data versioning using DVC and LakeFS.

What is DVC?

DVC, short for Data Version Control, is a tool designed for machine learning and analytics workflows. It integrates with Git while storing large files outside the Git repository. Instead of placing multi-gigabyte datasets directly in Git, DVC stores lightweight metadata files that point to external storage such as local disks, S3, Azure Blob, or Google Cloud Storage.

This creates a familiar development experience where code remains in Git and data is tracked through DVC references.

DVC Basic Workflow

A common DVC workflow begins with initializing DVC in a project, adding datasets, connecting remote storage, and committing metadata to Git.
git init 
dvc init 
dvc add data/train.csv 
git add data/train.csv.dvc .gitignore 
git commit -m "Track training dataset"
1. dvc init: Initializes DVC in your Git repository. It creates DVC configuration files such as .dvc/ and .dvcignore.

2. dvc add data/train.csv: Tells DVC to track the file data/train.csv.
- DVC creates a metadata file: data/train.csv.dvc. This file stores the file path, checksum/hash, size, and version reference.
- DVC stores the real CSV file in its local cache: .dvc/cache/
- The original CSV is usually added to .gitignore, so Git does not store the large file directly.

3. git add data/train.csv.dvc .gitignore: Tells Git to track:
- data/train.csv.dvc — pointer file
- .gitignore — ignores the raw dataset

4. git commit -m "Track training dataset": Creates a commit so you can later return to this exact dataset version.

Instead of Git storing train.csv (2GB), Git stores train.csv.dvc (a small text file), while DVC stores the real train.csv in the cache or remote storage.

Most teams configure cloud storage as the data remote.
dvc remote add -d storage s3://ml-project-bucket/dvcstore 
dvc push
This uploads tracked data to remote storage while preserving version references.

Suppose a model trained on an earlier commit needs to be rerun.
git checkout abc123 
dvc pull
DVC restores the dataset associated with that commit, allowing exact reproduction of historical experiments.

Git commit stores pointers, not data copies (for exampple in s3). S3 stores content-addressed objects, usually by hash. Example: s3://bucket/cache/files/ab/c123... s3://bucket/cache/files/xy/z999... So storage is organized by file content hashes, not Git commit IDs.

That means:
- If two commits use identical data, same object reused
- If data changes, new hash/object stored
- No need to duplicate full copies for every commit

DVC also supports pipeline stages, where preprocessing, training, and evaluation steps are tracked.
dvc stage add -n preprocess \
  -d src/preprocess.py \
  -d data/raw.csv \
  -o data/clean.csv \
  python src/preprocess.py
This creates dependency tracking so changed inputs automatically signal which stages need rerunning.

What is LakeFS?

LakeFS is a version control layer for object storage systems such as Amazon S3, S3-compatible storage, Azure Blob Storage, and Google Cloud Storage. It brings software-engineering style versioning practices to modern data lakes by introducing concepts similar to Git: branches, commits, merges, and rollbacks.

In a traditional Git repository, developers version source code files. LakeFS applies a similar idea to data stored in cloud object storage. Instead of tracking Python scripts or text files inside a code repository, LakeFS tracks datasets, tables, partitions, parquet files, CSV files, and other objects directly at the storage layer.

This makes LakeFS especially valuable for enterprise analytics platforms, lakehouse architectures, machine learning pipelines, and teams that manage large shared datasets. Rather than copying terabytes of data into separate buckets for testing, teams can create lightweight branches and safely experiment before publishing changes to production.

For example, imagine an e-commerce company storing daily sales data in Amazon S3. The analytics team depends on this data for dashboards, finance reports, and forecasting. If a broken ETL job accidentally overwrites production files, many downstream systems may fail. With LakeFS, engineers can first load new data into a temporary branch, validate totals, compare results, and merge only after checks pass.

Similarly, a healthcare company storing patient event logs in Azure Blob Storage may need to apply new cleaning logic to remove duplicate records. Instead of modifying the live data lake directly, engineers can create a branch, run transformations, test outputs, and merge only when the data quality team approves the result.

Because versioning happens at the storage layer, LakeFS works across many tools already used in modern data stacks. Engines such as Apache Spark, Trino, dbt, Airflow, and machine learning pipelines can interact with branched datasets without requiring teams to redesign their workflows.

LakeFS Basic Workflow

LakeFS introduces Git-like workflows for data lakes. A team first creates a LakeFS repository that points to an object storage location such as an S3 bucket. Inside that repository, the main branch typically represents trusted production data.

Suppose a retail company receives millions of transaction records every night. A data engineer wants to test a new ingestion pipeline that fixes timestamp errors and removes duplicates. Instead of writing directly into production storage, the engineer creates a branch called feature-cleanup. This branch behaves like an isolated workspace while still referencing the same underlying storage.

The engineer runs the pipeline on the branch, generating cleaned parquet files and updated partitions. Analysts can then compare row counts, revenue totals, null values, and partition sizes between main and feature-cleanup. If the branch passes validation, the engineer commits the changes and merges them into main.

If problems are discovered—for example, missing rows from one country or duplicated orders—the branch can simply be discarded. Production data remains untouched throughout the testing process.

This is far safer than older workflows where teams duplicated buckets, copied massive folders, or tested directly inside production paths. Copying large datasets wastes time and storage costs, while writing directly to live environments creates operational risk.

LakeFS reduces that friction by making isolated experimentation fast and lightweight. Teams can test schema changes, partition rewrites, deduplication logic, backfills, and new transformations without endangering production data.

As a result, organizations gain faster experimentation, easier collaboration, safer releases, cleaner rollback options, and stronger governance over shared datasets.

DVC vs LakeFS

DVC and LakeFS both bring version-control ideas to data, but they solve different problems and operate at different layers of the modern data stack.

DVC is especially strong for machine learning projects, experiment reproducibility, dataset tracking tied to Git repositories, and local developer workflows. It is commonly used by data scientists and ML engineers who want to version datasets, models, and pipelines alongside source code.

For example, a machine learning engineer building a fraud detection model may keep Python code in Git while using DVC to track training datasets, feature files, and saved models. Each Git commit can be linked to a specific dataset version, allowing the engineer to reproduce past experiments exactly. If model performance drops after a new data refresh, the team can easily return to an earlier dataset version and compare results.

LakeFS, by contrast, is especially strong for enterprise data lakes, shared storage governance, branching production-scale data, and large collaborative environments. It is designed for teams managing large datasets stored in systems such as Amazon S3, Azure Blob Storage, or Google Cloud Storage.

For example, a bank may store years of transaction logs in a central data lake used by analysts, compliance teams, and machine learning pipelines. Before applying a new transformation to billions of rows, engineers can create a LakeFS branch, test the update safely, validate totals, and merge only after approval.

In simple terms, DVC is usually project-centric, while LakeFS is infrastructure-centric. DVC focuses on helping individual ML projects remain reproducible. LakeFS focuses on helping organizations safely manage shared production data at scale.

Many organizations use both tools together. DVC may be used for model development, experiment tracking, and dataset versions within ML repositories, while LakeFS manages the centralized enterprise lake where raw and processed data lives.

For instance, a data science team may train models using curated datasets pulled from a LakeFS-managed data lake, then use DVC to version features, experiments, and trained models inside their machine learning repository. In this way, the two systems complement each other rather than compete.

Conclusion

Data versioning brings engineering discipline to one of the most valuable assets in modern organizations: data itself. DVC helps data scientists manage datasets and experiments alongside Git workflows, while LakeFS brings branching, commits, and safe collaboration to enterprise data lakes.

When teams can reproduce the past, test changes safely, and recover from mistakes quickly, they move faster with greater confidence. That is the true power of versioned data systems.
Nagesh Chauhan
Nagesh Chauhan
Principal Engineer | Java · Spring Boot · Python · Microservices · AI/ML

Principal Engineer with 14+ years of experience in designing scalable systems using Java, Spring Boot, and Python. Specialized in microservices architecture, system design, and machine learning.

Share this Article

💬 Comments

Join the discussion