Why is DevOps for Machine Learning so Different?

The term ‘MLOps’ is appearing more and more. Many from a traditional DevOps background might wonder why this isn’t just called ‘DevOps’. In this article we’ll explain why MLOps is so different from mainstream DevOps and see why it poses new challenges for the industry.

We'll see that the key differences between DevOps and MLOps come from how machine learning uses data. We'll see that the need to handle data volume, transformation and quality affects the whole MLOps lifecycle.

Current State of DevOps vs MLOps

DevOps is a well-established set of practices to ensure smooth build-deploy-monitor cycles. It is based around CI/CD and infrastructure. The space of tools includes git, Jenkins, Jira, docker, kubernetes etc.:

MLOps has not achieved the same level of maturity. As much as 87% of machine learning projects never go live.

ML infrastructure is complex and workflows extend beyond production of artifacts to include data collection, prep and validation. The types of hardware resources involved can be specialised (e.g. GPUs) and require management. The data flowing through the model and the quality of predictions can also require monitoring, resulting in a complex MLOps landscape:

Why So Different?

The driver behind all these differences can be found in what machine learning is and how it is practised. Software performs actions in response to inputs and in this ML and mainstream programming are alike. But the way actions are codified differs greatly.

Traditional software codifies actions as explicit rules. The simplest programming examples tend to be ‘hello world’ programs that simply codify that a program should output ‘hello world’. Further control structures can then be added to add more complex ways to perform actions in response to inputs. As we add more control structures, we learn more of the programming language. This rule-based input-output pattern is easy to understand in relation to older terminal systems where inputs are all via the keyboard and outputs are almost all text. But it also true of most of the software we interact with, though the types of inputs and outputs can be very diverse and complex.

ML does not codify explicitly. Instead rules are indirectly set by capturing patterns from data. This makes ML more suitable for a more focused type of problem that can be treated numerically. For example, predicting salary from data points/features such as experience, education, location etc. This is a case of a regression problem, where the aim is to predict the value of a variable (salary) from the values of other variables by use of previous data. Machine learning is also used for classification problems, where instead of predicting a value for a variable, instead the model outputs a probability that a data point falls into a particular class. Example classification problems are:

Given hand-written samples for numbers, predict which number is which.
Classify images of objects according to category e.g. types of flowers

We don’t need to understand all the details here of how ML is done. However, it will help to have a picture of how ML models are trained. So let’s consider at a high level what is involved in a regression problem such as predicting salary from data for experience, education, location etc. This can be addressed by programmatically drawing a line through the data points:

The line is embodied in an equation:

The coefficients/weights get set to initial values (e.g. at random). The equation can then be used on the training data set to make predictions. In the first run the predictions are likely to be poor. Exactly how poor can be measured in the error, which is the sum of the distances of all the output variable (e.g. salary) samples from the prediction line. We can then update the weights to try to reduce the error and repeat the process of making new predictions and updating the weights. This process is called 'fitting' or 'training' and the end result is a set of weights that can be used to make predictions.

So the basic picture centres on running training iterations to update weights to progessively improve predictions. This helps to reveal how ML is different from traditional programming. The key points to take away from this from a DevOps perspective are:

The training data and the code together drive fitting.
The closest thing to an executable is a trained/weighted model. These vary by ML toolkit (tensorflow, sc-kit learn, R, h2o, etc.) and model type.
Retraining can be necessary. For example, if your model is making predictions for data that varies a lot by season, such as predictions for how many items of types of clothing will sell in a month. In that case training on data from summer may give good predictions in summer but will not give good predictions in winter.
Data volumes can be large and training can take a long time.
The data scientist’s working process is exploratory and visualisations can be an important part of it.

This leads to different workflows for traditional programming and ML development.

Workflows

Consider a traditional programming a workflow:

User Story
Write code
Submit PR
Tests run automatically
Review and merge
New version builds
Built executable deployed to environment
Further tests
Promote to next environment
More tests etc.
PROD
Monitor - stacktraces or error codes

The trigger for a build is a code change in git. The packaging for an executable is normally docker.

With machine learning the driver for a build might be a code change. Or it might be new data. The data likely won’t be in git due to its size.

Tests on ML are not likely to be a simple pass/fail since you’re looking for quantifiable performance. One might choose to express performance numerically with an error level. The level can vary a lot by business context. For example, consider a model that predicts a likelihood of a financial transaction being fraudulent. Then there may be little risk in predicting good transactions as fraudulent so long as the customer is not impacted directly (there may be a manual follow-up). But predicting bad transactions as good could be very high risk.

The ML workflow can also differ depending on whether the model can learn while it is being used (online learning) or if the training takes place separately from making live predictions (offline learning). For simplicity let’s assume the training takes place separately.

A high-level MLOps workflow could look like:

Data inputs and outputs. Preprocessed. Large.
Data scientist tries stuff locally with a slice of data.
Data scientist tries with more data as long-running experiments.
Collaboration - often in jupyter notebooks & git
Model may be pickled/serialized
Integrate into a running app e.g. add REST API (serving)
Integration test with app.
Rollout and monitor performance metrics

The monitoring for performance metrics part can be particularly challenging and may involve business decisions. Let’s say we have a model being used in an online store and we’ve produced a new version. In these cases it is common to check the performance of the new version by performing an A/B test. This means that a percentage of live traffic is given to the existing model (A) and a percentage to the new model (B). Let’s say that over the period of the A/B test we find that B leads to more conversions/purchases. But what if it also correlates with more negative reviews or more users leaving the site entirely or is just slower to respond to requests? A business decision may be needed.

The role of MLOps is to support the whole flow of training, serving, rollout and monitoring. Let's better understand the differences from mainstream DevOps by looking at some MLOps practices and tools for each stage of this flow.

Training

Initially training jobs can be run on the data scientist’s local machine. But as the size of the dataset or processing grows then a tool will be needed that can leverage specialised cloud hardware, parallelize steps and allow long-running jobs to run unattended. One tool for this is kubeflow pipelines:

Steps can be broken out as reusable operations and run in parallel. This helps address needs for steps to split the data into segments and apply cleaning and pre-processing on the data. The UI allows for inspecting the progress of steps. Runs can be given different parameters and executed in parallel. This allows data scientists to experiment with different parameters and see which result in a better model. Similar functionality is provided by MLFlow experiments, polyaxon and others.

Many training platforms can be hooked up with Continuous Integration. For example, a training run could be triggered on a commit to git and the model could be pushed to make live predictions. But deciding whether a model is good for live use can involve a mixture of factors. It might be that the main factors can be tested adequately at the training stage (e.g. model accuracy on test data). Or it might be that only initial checks are done at the training stage and the new version is only cautiously rolled out for live predictions. We’ll look at rollout and monitoring later - first we should understand what live predictions can mean.

Live Predictions and Model Serving

For some models the predictions are made on a file of data points or a new file each week. This kind of scenario is called offline predictions. In other cases predictions need to be made on demand. For this live use-cases typically the model is made available to respond to HTTP requests. This is called real-time serving.

One approach to serving is to package a model by serializing it as a python pickle file and hosting that for the serving solution to load it. For example, this is serving manifest for kubernetes using the Seldon serving solution (a tool on which I work):

apiVersion: machinelearning.seldon.io/v1alpha2
kind: SeldonDeployment
metadata:
  name: sklearn
spec:
  name: iris
  predictors:
  - graph:
      children: []
      implementation: SKLEARN_SERVER
      modelUri: gs://seldon-models/sklearn/iris
      name: classifier
    name: default
    replicas: 1

The ‘SeldonDeployment’ is a kubernetes custom resource. Within that resource it needs to be specified which toolkit was used to build the model (here sci-kit learn) and where to obtain the model (in this case a google storage bucket).

Some serving solutions also cater for the model to be baked into a docker image but python pickles are common as a convenient option for data scientists. Submitting the serving resource to kubernetes will make an HTTP endpoint available that can be called to get predictions. Often the serving solution will automatically apply any needed routing/gateway configuration needed, so that data scientists don’t have to do so manually.

Rollout

Self-service for data scientists can also be important for rollout. This can need careful handling because the model has been trained on a particular slice of data and that data might turn out to differ from live. The key strategies used to reduce the risk of this are:

1) Canary rollouts

With a canary rollout a percentage of the live traffic is routed to the new model while most of the traffic goes to the existing version. This is run for a short period of time as a check before switching all traffic to the new model.

2) A/B Test

With an A/B test the traffic is split between two versions of a model for a longer period of time. The test may run until a sufficient sample size is obtained to compare metrics for the two models. For some serving solutions (e.g. Seldon, KFServing) the traffic-splitting part of this can be handled by setting percentage values in the serving resource/descriptor. Again, this is to enable data scientists to set this without getting into the details of traffic-routing or having to make a request to DevOps.

3) Shadowing

With shadowing all traffic is sent to both existing and new versions of the model. Only the existing/live version of the model’s predictions are returned as responses to live requests. The non-live model’s predictions are not returned and instead are just tracked to see how well it is performing.

Deciding between different versions of a model naturally requires monitoring.

Monitoring

With mainstream web apps it is common to monitor requests to pick up on any HTTP error codes or an increase in latency. With machine learning the monitoring can need to go much deeper into domain-specific metrics. For example, for a model making recommendations on a website it can be important to track metrics such as how often a customer makes a purchase vs chooses not to make a purchase or goes to another page vs leaves the site.

It can also be important to monitor the data points in the requests to see whether they are approximately in line with the data that the model was trained on. If a particular data point is radically different from any in the training set then the quality of prediction for that data point could be poor.

It is termed an ‘outlier’ and in cases where poor predictions carry high risk then it can be valuable to monitor for outliers.

If a large number of data points differ radically from the training data then the model risks giving poor predictions across the board - this is termed ‘concept drift’.

Monitoring for these can be advanced as the boundaries for outliers or drift may take some experimenting to decide upon.

For metrics that can be monitored in real-time it may be sufficient to expose dashboards with a tool such as grafana. However, sometimes the information that reveals whether a prediction was good or not is only available much later. For example, there may be a customer account opening process that flags a customer as risky. This could lead to a human investigation and only later will it be decided whether the customer was risky or not. For this reason it can be important to log the entire request and the prediction and also store the final decision. Then offline analysis run over a longer period can provide a wider view of how well the model is performing.

Support for custom metrics, request logging and advanced monitoring varies across serving solutions. In some cases a serving solution comes with out of the box integrations (e.g. Seldon) and in other cases the necessary infrastructure may have to be setup and configured separately.

Governance

If something goes wrong with running software then we need to be able to recreate the circumstances of the failure. With mainstream applications this means tracking which code version was running (docker image), which code commit produced it and the state of the system at the time. That enables a developer to recreate that execution path in the source code. This is reproducibility.

Achieving reproducibility for machine learning involves much more. It involves knowing what data was sent in (full request logging), which version of the model was running (likely a python pickle), what source code was used to build it, what parameters were set on the training run and what data was used for training. The data part can be particularly challenging as this means retaining the data from every training run that goes to live and in a form that can be used to recreate models. So any transformations on the training data would need to be tracked and reproducible.

The tool scene for tracking across the ML lifecycle is currently dynamic. There are tools such as ModelDB, kubeflow metadata, pachyderm and Data Version Control (DVC), among others. As yet few standards have emerged as to what to track and how to track it. Typically platforms currently just integrate to a particular chosen tool or leave it to the users of the platform to build any tracking they need into their own code.

There are also wider governance challenges for ML concerning bias and ethics. Without care models might end up being trained using data-points that a human would consider unethical to use in decision-making. For instance, a loan approval system might be trained on historic loan repayment data. Without a conscious decision about which data points are to be used, it might end up making decisions based on Race or Gender.

Given concerns about bias, some organisations are putting an emphasis on being able to explain why a model made the prediction that it did in a given circumstance. This goes beyond reproducibility as being able to explain why a prediction was made can be a data science problem in itself ('explainability'). Some types of models such as neural networks are being referred to as ‘black box’ as it is not easy to see why a prediction would come about from inspecting their internal structure. There are black-box explanation techniques emerging (such as Seldon's Alibi library) but for now many organisations for whom explainability is a key concern are currently sticking to white box modelling techniques.

Summary

MLOps is an emerging area. MLOps practices are distinct from mainstream DevOps because the ML development lifecycle and artifacts are different. Machine learning works by using patterns from training data - this makes the whole MLOps workflow sensitive to data changes, volumes and quality.

There are a wide range of MLOps tools available but most are young and compared with mainstream DevOps the tools may not yet interoperate very well. There are some initiatives towards standardisation but currently the landscape is quite splintered with big commercial players (including major cloud providers) each focusing primarily on their own end-to-end ML platform offering. Large organisations are having to choose whether an end-to-end offering meets their machine learning platform needs or if they instead want to assemble a platform themselves from individual (likely open source) tools.