How to Version Machine Learning Experiments Instead of Tracking Them

Hi! Let’s talk about versioning machine learning experiments and how it is different from tracking them. You may have heard of experiment tracking tools such as Weights&Biases, MLflow, or Neptune. They offer an API you can call from your code to log the experiment information which is then stored in a database, and you get access to a dashboard where you can visually compare your numerous experiments. This is a massive improvement over manually logging the machine learning experiments into a spreadsheet.

However, if you want to reproduce or restore a particular experiment, you still need to trace back your steps to stitch together the exact code, data, and pipeline to reproduce the experiment. Wouldn’t it be better to be able to automate this? We can actually do that if we move from tracking experiments to versioning them! This is something that we can do with a new feature that has been recently added to DVC, an open-source tool mostly known in the context of Data Versioning. Let’s see how it works.

Setup

To start versioning experiments with DVC, initialize it from any Git repo with dvc exp init -i. Make sure your project is structured in the following way:

This not only helps to organize the project in a logical way, but also helps DVC to version all the moving parts that make up an experiment.

Next, instead of logging experiment metadata and metrics with a tracking tool, we’ll need to store them in files, so make sure that your training script reads the parameters from params.yaml, and writes the resulting metrics into a metrics.json.

You may notice that when initializing DVC automatically creates a basic pipeline, and stores it in a dvc.yaml file. With this setup your training code, pipeline, parameters, and metrics now live in files that can be versioned.

Once you set up your project like this, you can already point out a few benefits:

Your training code has less dependencies and doesn’t need to call an external API to log experiment information.
Your training code can use any programming language you like. Python, R, Julia, Java even? Pick your favorite!
Your project is well organized.
You can version control parameters, metrics, pipeline, code and data (DVC can handle large data that won’t fit in Git)
The pipeline in dvc.yaml captures everything needed to reproduce an experiment, and it can be versioned too.

In addition to these, there are some perks that may be less obvious at a first glance.

Large Data and Model Versioning

Large data and models aren’t easily tracked in Git, but with DVC you can keep them using your own storage, yet make them Git-compatible. When initialized DVC starts tracking the models folder making Git ignore it yet storing and versioning it so you can back up versions anywhere and check them out alongside your experiment code.

Less noise when comparing experiments

In a dashboard, you can see all of your experiments, and I mean ALL of them. It can be overwhelming to go through them even with filtering and sorting options. With version control you have more flexibility and control over what you share and how you organize things.

You can keep your experiments in branches and choose not to push the ones with uninspiring results. That already reduces some of the clutter.

But DVC can take it one step further. If you were to create a branch for each and every experiment, especially when doing hyperparameter tuning, this would result in way too many branches to fit any Git branching workflow. To help with that, DVC tracks your experiments so you don’t need to create commits or branches for each one:

Only when you decide which of these experiments are worth sharing with the team, you can convert them into actual Git branches:

Single-command reproducibility

Finally, codifying the entire experiment pipeline is a good first step towards reproducibility, but it still leaves it to the user to execute that pipeline. With DVC you can reproduce the entire experiment with a single command. Not only that, but it will check for cached inputs and outputs and skip recomputing data that’s been generated before which can be a massive time saver at times.

Conclusion

To summarize, experiment versioning allows you to codify your experiments in such a way that your experiment logs are always connected to the exact data, code, and pipeline that went into the experiment. You have control over which experiments end up shared with your colleagues for comparison, and this can prevent clutter. Finally, reproducing a versioned experiment becomes as easy as running a single command, and it can even take less time than initially, if some of the pipeline steps have cached outputs that are still relevant.

Thanks for reading! And if you want to learn more about versioning machine learning experiments, check out DVC docs.

Also published here.