An Introduction to Machine Learning Model Scoring

There is a plethora of metrics that can be used to evaluate machine learning models, and identifying the right metric to use is a crucial step for accurately assessing a model’s performance and whether its predictions are trustworthy.

One of the major challenges is that a model could simply memorize the data it is being trained with, and consequently perform poorly on new, unseen samples.

In the case of classification, a model could also favor one class over another because the training dataset used contained an imbalanced number of samples from each class.

Examples like these demonstrate why choosing the right metrics to evaluate machine learning models comprehensively is essential. At Modzy, we understand the importance of using and sharing the right performance metrics to communicate AI model scoring and performance, and all of the partners in our marketplace adhere to the same transparency standard.

What You Need to Know about AI Model Scoring

Evaluation metrics are designed to score different aspects of model performance for many diverse use-cases. For example, in the context of airport security and threat detection, reducing the number of false negative predictions made by a model would be a huge improvement. Missing a possible threat as passive could cause much more damage than incorrectly identifying something passive as threatening.

Suppose someone claims that to have constructed a machine learning model that could identify threats boarding U.S. flights with 99% accuracy (the ratio of correct predictions over the total number of samples). This seems like a near-perfectly performing model, but is it trustworthy?

Given (approximately) the 800 million passengers on U.S. flights per year, and respectively the negligible number of threats that boarded U.S. flights from the years 2000-2017 [1], the model could simply be constructed to label all passengers as non-threatening, achieving a high accuracy score. This case is an example of how imbalanced data can affect how a model’s performance is incorrectly perceived when the wrong metric is chosen for performance evaluation.

Rather than only reporting accuracy, reporting recall would be more insightful because it describes the model’s ability to find all relevant samples in a given dataset. Even though the threat detection model in this example gives a near-perfect accuracy score, its recall would be 0, because it could not identify any relevant threats.

Assume a new model is designed to identify all passengers as threats. This would produce a perfect recall score of 1. However, the model would not be particularly useful (all passengers would have to be banned), but the way to conduct AI model scoring is to calculate the model’s precision. Precision scores the model’s ability to identify relevant samples only.

In the case of the threat detection model, precision quantifies the ratio of correctly identified threats over all possible threats predicted (correctly and incorrectly).

In AI model scoring, some real-world scenarios require maximizing one metric at the expense of the other. In the case of preliminary disease screenings of patients, it would be ideal to achieve a near-perfect recall score (all patients who truly have the disease), at the expense of precision (the screening can be rerun to identify false positives). However, for cases where precision and recall need to be balanced, the F1 score (harmonic mean between precision and recall) can be used.

Modzy Differentiation

At Modzy, machine learning models are designed and developed across a multitude of domains, and we work hard to provide the best possible models for each domain. To conduct AI model scoring, each set of metrics is carefully chosen for optimization and evaluation based on the domain requirements for each model. Multiple metrics are used to provide a comprehensive and transparent understanding of the performance of Modzy models.

Deciding which metric to optimize a machine learning model’s performance is a crucial step for obtaining a model that performs as expected. Different domains have different consequences caused by model error, some of which are more critical than others.

To ensure that no critical prediction error is made, the right metric must be used to assess model performance. At Modzy, we make sure to optimize all of our models by using the appropriate metrics to provide quality assessments and efficient models.

References:

“Airline Information for Download.” Bureau of Transportation Statistics, United States Department of Transportation, 22 May 2019, https://www.bts.gov/.