Model Calibration in Machine Learning: An Important but Inconspicuous Concept

Introduction
The concept behind model calibration
Some real-time applications of model calibration
Conclusion
References

Introduction

Calibration- Although is one of the most important concepts in machine learning, is not spoken about enough among beginner enthusiasts in the AI/ML space. The calibration tells us how much we can trust a model prediction, especially in classification models. Having a good grasp of calibration is a necessity for the meaningful interpretation of numerical outputs of machine learning classifiers. In this article, we'll discuss the theory behind machine learning model calibration and its importance through some simple real-life examples.

The concept behind model calibration

A machine learning model is calibrated if it produces calibrated probabilities. More specifically, probabilities are calibrated where a prediction of a class with confidence p is correct 100*p percentage of the time

Looks complicated?

Let us understand through a simple example:

Let us consider that we need to build a machine-learning model to predict whether it will rain or not on a particular day. Since there are only 2 possible outcomes - “Rain” and “No Rain”, we can consider this as a binary classification model.

Here, “Rain” is a positive class which is represented as 1 and “No Rain” is a negative class which is represented as 0.

If the model’s prediction for a particular day is 1 then we can consider that it is expected that the day is going to be rainy.

Similarly, If the model’s prediction for a particular day is 0 then we can consider that it is expecting that the day is not going to be rainy.

In real-time, the machine learning models often represent the prediction as a numerical vector representing some probability values.

So, it is not necessary that we will always get a value of 0 or 1. Usually, if the predicted value is greater than or equal to 0.5 then it is considered 1 and if the predicted value is lesser than 0.5 then it is considered 0.

For example, if the model’s prediction for a particular day is 0.66 then we can consider it as 1. Similarly, if the model’s prediction for a particular day is 0.24 then we can consider it as 0.

Let's assume that our model predicted the outcome for the upcoming 10 days like this:

We can see that if the probability value is greater than or equal to 0.5 then the prediction is “Rain”.

Similarly, We can see that if the probability value is lesser than 0.5 then the prediction is “No Rain”.

Now, the statistical question is -

“Are the probability values real likelihood values for the outcome ?”

In other words, If I am having a probability value of 0.8 then does it means there is an 80% chance that the day will be rainy?

If I am having a probability value of 0.2 then does it means there is a 20% chance that the day will be rainy?

Statistically, if I am claiming that my model is calibrated then the answer should be “Yes”.

The probability values should not be mere threshold values to decide the class of output. Instead, it should represent the real likelihood of the outcome.

Here, Day 1 has a probability value of 0.81 but Day 10 has a probability value of only 0.76. This means although there is a chance of rain on both days, Day 1 has a 5% greater chance than Day 10 of being rainy. This shows the strength of the probabilistic forecast of the outcome. A good statistician will infer a lot of patterns from a large number of outcomes similar to this if he has got a model like this.

Let's see how statisticians are interpreting the model calibration in a graphical way.

Consider a graph like this with the values from 0 to 1 divided equally on the X-axis-

Now, In each bucket, plot the outcome according to their probability values.

For example,

In buckets 0.6-0.8, we have 4 data points - Day 4, Day 8, Day 9 and Day 10.

Similarly, we can follow the same procedure for all other buckets-

Until now, we have plotted only predicted values.

Since, our positive class is “Rain”, Let us differentiate the values in each bucket whose actual value is “Rain”.

Now, find the fraction of positive class in each bucket:

Once this stage is reached, just plot these fractional values as a line along the Y-axis-

The line is not in a proper linear structure. This means that our model is not well-calibrated. A well-calibrated model’s chart would have looked like this-

Ideally, a well-calibrated model expects the probability of “Rain” around 40%-60% in the 3rd bucket (0.4-0.6). However, our model is giving only a 30% probability of an outcome being the “Rain”. This is a significant deviation. These kinds of deviation can be seen in other buckets also.

Some statisticians use the area between the calibrated curve and the model’s probability curve to evaluate the performance of the model. When the area becomes smaller, the performance would be greater as the model curve will be nearer to a calibrated curve.

Some real-time applications of model calibration in machine learning

There are a lot of real-time scenarios in which the end users of ML applications depend upon model calibration for effective and insightful decision-making such as-

Let's consider that we are building a ranking-based model for an e-commerce platform. If a model is well-calibrated then its probability values can be trusted for the recommendation purpose. For example, the model says that there is an 80% chance that the user likes Product A and there is a 65% chance that the user likes Product B. Hence, we can recommend Product A to the user as the first preference and Product B as the second preference.
In the case of clinical trials, consider that some doctors are developing drugs. If the model is predicting that 2 drugs are very effective for the treatment - Drug A and Drug B. Now, doctors should choose the best available option from the list since they can't take a risk since this is a highly risky trial dealing with human life. If the model is giving a probability value of 95% for Drug A and 90% for Drug B then the doctors will be obviously going ahead with Drug A.

Conclusion

In this article, we've gone through the theoretical basis of model calibration and discussed the importance of understanding whether a classifier is calibrated or not through some simple real-life examples. Building the “Reliability” for machine learning models is often a bigger challenge to the researchers than developing or deploying it to the servers. Model calibration is extremely valuable in cases where predicted probability is of interest. It gives insight or understanding of uncertainty in the prediction of the model and in turn, the reliability of the model to be understood by the end-user, especially in critical applications.

I hope, this write-up helped you get a preface to this concept and understand its criticality. You can refer to the materials mentioned in the reference section to get an in-depth understanding of the same.

References