Machine learning is nothing but an optimisation problem. Researchers use an algebraic acme called “Losses” in order to optimise the machine learning space defined by a specific use case. A “Loss” can be seen as a distance between the true values of the problem and the values predicted by the model. The greater the loss is, the more huge the errors you made on the data. Most of the performance evaluation metrics such as accuracy, precision, recall, f1 score etc are an indirect derivation of the Loss functions. There are a lot of loss functions implemented by the researchers like-

For Regression problems- Mean Squared Error, Mean of Absolute Error, Huber Loss, Log Cosh Loss, Quantile Loss etc.
For Binary classification problems - Binary Cross Entropy Loss, Hinge Loss etc.
For Multi classification problems - Multi-Class Cross Entropy Loss, KL- Divergence

In this article, I will introduce you to a sophisticated loss metric called “Hinge loss” which is discussed in some of the most recommended textbooks regarding predictive modelling. I hope the explanation will be in a lucid manner, both visually and mathematically to help beginner enthusiasts in the machine learning field.

The concept behind the Hinge loss

Hinge loss is a function popularly used in support vector machine algorithms to measure the distance of data points from the decision boundary. This helps approximate the possibility of incorrect predictions and evaluate the model's performance.

Some of the other popularly used loss functions in classification algorithms are-

Gini Impurity
Logarithmic loss
Misclassification error etc.

The support vector machine is a supervised machine learning algorithm that is popularly used for predicting the category of labelled data points.

For example-

Predicting whether a person is male or female
Predicting whether the fruit is an apple or orange
Predicting whether a student will pass or fail the exams etc.

SVM uses an imaginary plane that can travel across multiple dimensions for its prediction purpose. These imaginary planes which can travel through multiple dimensions are called hyperplanes. It is very difficult to imagine higher dimensions using human brains since our brain is naturally capable to visualize only up to 3 dimensions.

Let’s take a simple example to understand this scenario.

We have a classification problem to predict whether a student will pass or fail the examination. We have the following features as independent variables-

Marks in internal exams
Marks in projects
Attendance percentage

So, these 3 independent variables become 3 dimensions of a space like this-

Let’s consider that our data points look like this where-

The green colour represents the students who passed the examination
The red colour represents the students who failed the examination

Now, SVM will create a hyperplane that travels through these 3 dimensions to differentiate the failed and passed students-

So, technically now the model understands that every data points which falls on one side of the hyperplane belong to the students who passed the exams and vice versa. This hyperplane is called the decision boundary or maximum margin hyperplane. The distance from a data point to the decision boundary shows the strength of the prediction.

The following image shows better visualization-

Logically,

If the distance between the decision boundary and the data point is relatively large then it means that the model is somewhat confident about its prediction.
If the distance between the decision boundary and the data point is relatively low then it means that the model is less confident about its prediction.

The following image will give you a better intuition -

Here,

The value of the decision boundary is zero.

“+” indicates the prediction of the positive class.
“-” Indicates prediction of negative class.
yf(x) is the hinge loss function

There are 2 primary scenarios where the researchers use hinge loss in SVM-

Scenario 1 (In training data): To optimally build a model in multi-dimensional space which reduces the misclassification and strengthens the decision-making ability.

It also helps to build the best-fit decision boundary by selecting that decision boundary that has the minimum hinge loss out of the many options via trial and error method or hyperparameter tuning (This approach is similar to the process of finding the best fit line in the linear regression during the training process).

Scenario 2 (In testing data): To evaluate the performance of the SVM model.

Let us understand the calculation of hinge loss in SVM with respect to scenario 1. The below image is a visual representation of the hinge loss function.

The X-axis shows the distance from the decision boundary to the data point.
The Y-axis shows the loss size or penalty that the hinge loss function calculates for a specific data point.

The dotted line represents the number 1 on X-axis. If a data point is correctly predicted by the model and its distance from the decision boundary is greater than 1 then the loss is very minimal (nearly zero).

If the data point is placed exactly at the decision boundary then the hinge loss will have a value of 1 (Obviously, the distance between the decision boundary and the data point will be zero).

If a data point is incorrectly predicted (classified) by the model then there are 2 possibilities-

Possibility 1: The distance between the data point and the decision boundary is in a positive direction (Data point 1 in the above image).

Possibility 2: The distance between the data point and the decision boundary is in a negative direction (Data point 3 in the above image).

In possibility 1, the hinge loss will not increase rapidly i.e. the loss value will be below.

For example,

Let’s assume that the value of the decision boundary is 0 and the value of the predicted data point is +2.5 (but the actual value should be below 0 since it is a wrong prediction). Here, the difference between the predicted value and decision boundary is +2.5. Hence, the hinge loss will be low (This is depicted as data point 1 in the above image).

In possibility 2, the hinge loss will be increasing rapidly i.e. the loss value will be high.

For example,

Let’s assume that the value of the decision boundary is 0 and the value of the predicted data point is -1.5 (but the actual value should be above 0 since it is a wrong prediction). Here, the difference between the predicted value and decision boundary is -1.5. Hence, the hinge loss will be high (This is depicted as data point 3 in the above image).

Let us calculate the hinge loss for these 2 possibilities-

As we discussed earlier,

The value of hinge loss is 1 when the value of the predicted data point is 0.
According to possibility 1, if the value of the predicted data point is +2.5 then the hinge loss of that prediction is zero.
According to possibility 2, if the value of the predicted data point is -1.5 then the hinge loss of that prediction is 2.5.

Similarly, the hinge loss for every data point can be calculated for a model. When an SVM model is constructed in a multidimensional plane, we should try always try to minimize the hinge loss as much as possible to increase the predictive ability of the model.

A real-time example with a sample dataset

Imagine that we have a binary classification problem to predict whether a student will pass/fail the examination based on the following predictor variables-

Attendance percentage
Marks scored in the internal exams
Marks scored in the assignments

We trained our model with 1000 records and now we have the following table as the test data-

We evaluate the model using the following test data and make predictions. Our predictions are as follows-

The predicted value will be a number between -1 and 1 with a margin of 0.2. If the value is less than or equal to zero then the predicted class will be considered -1 and if the value is greater than zero then the predicted class will be considered +1.

Since the margin is 0.2 and the decision boundary is 0. The hinge loss is calculated for –

All incorrect predictions
All correct predictions within the range of [-0.2, +0.2]

The Hinge loss is as follows-

The data points with unique Id 1, 5, 7, 8, and 9 are predicted correctly. Hence there is no hinge loss for those instances.
The data points with unique Ids 4, 6, and 10 are predicted incorrectly. Hence hinge loss is calculated for those data points.
The data points with unique Id 2 and 3 are predicted correctly but their value is within the range of margin. Hence hinge loss is calculated for those data points.
The total hinge loss of the model is the mean of all these values = (1.6 + 1.8 + 3.1 + 3 + 5 ) / 10 = 1.45

Conclusion

Hinge loss- although might initially look a little bit complicated, I hope you have got a fundamental intuition about this concept through this article. A lot of complex use cases can be optimally solved using this technique that demands a binary classification algorithm (especially Support Vector Machines). This metric is available as an inbuilt library in most data science-oriented programming languages such as Python or R. Hence, it is easy to implement this once you understand the theoretical intuition. I have added the links to some of the advanced materials in the references section where you can deep dive into the complex calculations if you are interested.

References

Hinge loss - Wikipedia
Rosasco, L.; De Vito, E. D.; Caponnetto, A.; Piana, M.; Verri, A. (2004). "Are Loss Functions All the Same?" (PDF)? Neural Computation. 16 (5): 1063 - 1076. CiteSeerX 10.1.1.109.6786. doi:10.1162/089976604773135104. PMID 15070510.
Duan, K. B.; Keerthi, S. S. (2005). "Which Is the Best Multiclass SVM Method? An Empirical Study" (PDF). Multiple Classifier Systems. LNCS. Vol. 3541. pp. 278–285. CiteSeerX 10.1.1.110.6789. doi:10.1007/11494683_28. ISBN 978-3-540-26306-7.
Rennie, Jason D. M.; Srebro, Nathan (2005). Loss Functions for Preference Levels: Regression with Discrete Ordered Labels (PDF). Proc. IJCAI Multidisciplinary Workshop on Advances in Preference Handling.

Hinge Loss - A Steadfast Loss Evaluation Function for the SVM Classification Models in AI & ML

The concept behind the Hinge loss

A real-time example with a sample dataset

Conclusion

References