MLOps: Data versioning with DVC — Part Ⅰ

7 min readJul 13, 2021

While machine learning and AI propagate in industry, academia and research and slowly becoming the main stream approach for solving complex real-word problems. AI teams are still are having difficult time tracking ML experiments, and managing training/validation data. In general, AI development teams are following ad-hoc approach to manage the entire ML lifecycle and including model deployment in production.

It’s not easy to keep track of all the data you used in experiments and models produced. Git is used for versioning the code, but not suitable for keeping large data or model files. Running prediction on target dataset in your local environment is not the ultimate goal. If you are having trouble with versioning the ML artefacts (e.g. models, any other data files) during your experiments, this is the right blog for you!

In this article, part Ⅰ, you’ll learn:

How to use DVC to manage and keep track of your data files, any other ML artefacts.
How to use DVC to build your ML pipelines so that you can easily reproduce your ML process

I’ll introduce how to deploy custom models in production with AWS Sagemaker batch transform in part Ⅱ.

Data versioning in ML projects

Data Version Control, or DVC, is a data and ML experiment management tool that takes advantage of the existing engineering toolset that you’re already familiar with (Git, CI/CD, etc.).

Version control for ML projects, we need to consider not only the code, but also the data and models. DVC is an easy to use tool that it works on the top of Git. DVC utilise a dvc file to helps you to version control your ML artefacts and Git is responsible for version control the code and that dvc file.

Now you know what is DVC. Then in the following sections, I’ll show you details about:

How to versioning data in ML projects
How to set remote storage that keeps your ML artefacts, and retrieve ML artefacts from remote storage to your local project
How to keep track of data files when you made changes or added new dataset
How to switch between different versions
How to build ML pipelines with DVC

Before starting versioning, you’ll need DVC installed on your system. Check DVC official document about how to install DVC. Then let’s assume we have a simple example of git repository following the structure below:

example-dvc-repo/
|
|-- data/
    |-- training.csv
|-- metrics/
|-- model/
|-- src/
    |-- train.py

data: in this example we use CSV file as our training set file
model: the trained model will be stored in this folder
metrics: other ML artefacts can be stored in this folder
src: python scripts will be stored in this folder (e.g. train.py, validate.py)

The workflow of following sections is shown in Figure 1 below:

Versioning ML artefacts

DVC uses a so-called *.dvc file which contains a unique md5 hash to link the dataset to the project. DVC stores the copy of this data file into DVC cache, using the first two letters of the hash as the folder name. Then use the Git command(e.g. git commit) to record this stage.

# Using dvc add to start tracking files and generate .dvc file
$ dvc add data/training.csv# Using git commit to version the .dvc file
$ git add data/training.csv.dvc data/.gitignore
$ git commit -m "Add training data"

After this step, there are some new files generated in your project folder, shown below (only listed necessary folders, files in this step):

example-dvc-repo/
|
|-- data/
   |-- training.csv
   |-- training.csv.dvc  --> dvc file for training.csv
|-- .dvc/
   |-- cache/
      |-- a3/            --> a copy of training.csv
         |-- 04afb96060aad90176268345e10355
   |-- gitignore
   |-- config --> keeps info about remote storage (see next section)

Storing versioned ML artefacts to remote storage

DVC supports various cloud storage that enables us to store our dataset or models in remote cloud storage (e.g. S3, google drive). In order to do this, first, we need to setup the remote storage using DVC, then push the ML artefacts to this remote storage. By doing so, we do not need to keep the large dataset in our Git repository, the lightweight, human-readable *.dvc file which contains the link to our real dataset will be kept in the Git repository.

# Setup remote storage using dvc (in this example, we use S3)
$ dvc remote add -d [storage_name] s3://[bucket]/[dvc_storage]# Upload dataset to remote storage S3
$ dvc push

After executing the commands above, a remote will be added into DVC config, which looks like:

['remote "storage_name"']
url = s3://[bucket]/[dvc_storage]
[core]
remote = storage_name

Retrieving ML artefacts

Having DVC-tracked data stored remotely, it can be downloaded to Git repository when needed. As the DVC file stored in Git repository contains the hash to uniquely identify the data and remote storage information is stored in DVC config, it knows where to find the data on remote storage and download the data to the local project.

# Using dvc pull to download data
$ dvc pull

Making changes on dataset

When making changes to dataset locally, the DVC add command allows us to track the latest version of the dataset. It will update the md5 hash inside the DVC file and then the new version of dataset has been linked to the project. It basically follows the same steps in Versioning ML artefacts.

# Using dvc add to track the new version of dataset
$ dvc add data/training.csv# Push the new version to remote storage
$ dvc push# Record the stage for the new version of dataset
$ git commit -m "training set updated"

Switch between versions

When we want to rollback to a certain version of dataset, git checkout will help us checkout a commit or a revision of DVC file, and then using dvc checkout to synchronise data.

# Using git checkout to checkout a commit you want
$ git checkout <...># Using dvc checkout to sync data
$ dvc checkout

DVC supports building ML pipelines

Before we introduced how to version control the dataset in your ML projects. DVC also allows you build your ML pipelines, it provides a dvc.yaml file describes each stage of a pipeline in ML project and can be generated manually or by DVC command. Another file dvc.lock is similar to .dvc file we mentioned before in Versioning ML artefacts , which contains md5 hash information related to ML artefacts. This enable us easily run and reproduce any stage in your ML pipeline (e.g. training stage, validation stage).

First you’ll need DVC installed, data versioned and remote storage added. Then the example workflow of establishing the following ML pipeline is described below.

# Create dvc pipeline that runs training stage of ML project
# Specify the stage name (train), input (-d), output (-o) and 
# followed by command:$ dvc run -n train \
$ -d src/train.py -d data/training.csv \
$ -o model/model.pkl \
$ python src/train.py# Do the same for validation stage$ dvc run -n validate \
$ -d src/validate.py -d data/validation.csv \
$ -M metrics/accuracy.json \
$ --plots-no-cache metrics/matrix.png \
$ python src/validate.py

After successfully executed the above commands, a dvc.yaml file is generated and contains these two stages. Accordingly, dvc.lock file is presented and contains several hashes and each hash uniquely identify one file.

# dvc.yamlstages:
  train:
    cmd: python src/train.py 
    deps:
    - data/training.csv
    - src/train.py
    outs:
    - model/model.pkl
  validate:
    cmd: python src/validate.py 
    deps:
    - data/validation.csv
    - src/validate.py
    metrics:
    - metrics/accuracy.json:
        cache: false
    plots:
    - metrics/confusion_matrix.png:
        cache: false# dvc.locktrain:
  cmd: python src/train.py
  deps:
  - path: data/training.csv
    md5: a304afb96060aad90176268345e10355
    size: 123407
  - path: src/train.py
    md5: 094631814e0233a8c189b5f5dd214f61
    size: 4445
  outs:
  - path: model/model.pkl
    md5: 1b320cd680be56cbbf71eb649ff29d25
    size: 704750
validate:
  cmd: python src/validate.py
  deps:
  - path: data/validation.csv
    md5: e57e5c9144f8eb84563b692d41c80f46
    size: 50112
  - path: src/validate.py
    md5: cd59a289500b05233548fc48fa0f13dd
    size: 4876
  outs:
  - path: metrics/confusion_matrix.png
    md5: 8149ea4e39ea587930d9d16c75d80d8f
    size: 15926
  - path: metrics/accuracy.json
    md5: 0be41efa33b74e9cc33f1e3cf12d4f69
    size: 126

Now you have 2 stages in your ML pipeline. You can use git add git commit git commands to let git version control this state. In the future, if you changed anything in your project (e.g. data), you can first follow Versioning ML artefacts before to let dvc track the new version of the data and then you can easily rerun the training or validation stage by DVC command such as dvc repro to rerun that stage. If you want to reproduce a state in a point of time, you can follow Switch between versions before to sync your local project with that specific state, then change anything you like and finally using dvc repro STAGE_NAME to quickly rerun certain stages in your own ML pipeline.

Reference

MLOps and data versioning in machine learning project

Machine Learning Pipelines with DVC (Hands-On Tutorial!)

Data Version Control With Python and DVC

Using DVC to create an efficient version control system for data projects

Machine Learning in Production: A Literature Review

MLOps: Data versioning with DVC — Part Ⅰ

Data versioning in ML projects

Reference

Written by Yizhen Zhao

Responses (1)