Explainability Concerns Hide Poor AI

High-profile examples of bias in AI models have set off worries among consumers and regulators. Amazon has been criticized for an AI recruiting model that appeared to downgrade applications that contained words like “women” and the graduates of two all-women colleges. And a face recognition algorithm used by police worldwide made significant mistakes in identifying black women as compared to white women.

In an effort to deal with the situation, explainability experts, conferences and mathematical models abound, but without making much apparent headway. But is this an issue that requires new mathematics or just better AI fundamentals?

AI and credit scoring

In my area of expertise, credit scoring, AI solutions in traditional banking and lending still largely remain "black boxes", where no one really seems to know how AI models are generating their results. As a result, lenders in the US and UK remain cautious about a credit score that is even partially generated by AI, with concerns about bias in terms of ethnicity, gender, or even ZIP code.

Consumers who are denied credit have the right to know why their application was rejected and to request corrections to any incorrect or outdated data that may have contributed to the rejection. For example, if their latest loan repayment was made but not recorded, a customer can request an update by TransUnion, an American consumer credit reporting agency that provides credit profiling on over one billion individual consumers in over 30 countries, including nearly every credit-active consumer in the United States.

Innovation in credit scoring has moved on, of course, from the conventional analysis of credit history, still in use at most bureaus. Fintechs are building AI lending models based on alternative data sources, which include digital footprints based on smartphone metadata used as proxies for repayment risk.

The concept was originated by Brown University economist Daniel Björkegren, who examined the phone records of 3,000 borrowers from a bank in Haiti. By analyzing the time of calls, how long they lasted, and how much money users spent on their phones. Björkegren found that the bank could have reduced defaults by 43%.

Since this work in 2015, the science of behavioral analysis of metadata has evolved considerably and offers many new ways of enabling people without conventional credit files to obtain the finance they need.

AI bias remains a concern

AI bias remains a concern, however, whether we are talking about AI models based on historical data about purchases and repayments or based on alternative data from smartphones.

From my perspective as a fintech expert who works with data enrichment solutions, the explainability panic looks like a smokescreen to cover bad AIs. Badly conceived models with unclear targets and operating with leaky data leave their owners with no clear idea of what is going on and an inevitable explainability gap.

They also produce commercially dubious outputs. In an attempt to get usable results, AI whizkids then decide to fix things with complex algorithms in black- or grey-box solutions that just render results even more opaque.

It doesn’t have to be like this. Clear principles at the start of a project can avoid much, maybe all of the heartache. Let me explain.

Garbage in…

The obvious starting principle is "garbage in - garbage out". It applies to the data used for “feature construction” and data used for “target calculation”. In machine learning a feature is an individual, measurable property or characteristic. Feature construction is when you use one or more existing features to deduce another. For example, if I know a person’s date of birth, I can deduce age. The target is the feature you want to understand more clearly - risk in my case - by uncovering relationships and patterns between that feature and historical data.

If you create useless features or variables with data leaks (the use of information from outside the training dataset that would not normally be available at the time of prediction) you are likely to overestimate the predictive ability of your model in production mode. You certainly won’t get a stable well-performing model.

For example, suppose you are building a credit risk scoring model that measures the creditworthiness of a customer based on the history of all loans the customer has had. You decide to create a new variable called "Number of Existing Loans" which indicates the total number of loans, both with your bank and with other lenders.

To avoid any data leakage, you must ensure that this new variable only includes information that was available at the time of application of each of the loans that the customer applied for, not after. In other words, if you feed the model with information that was collected after the time of application, you generate a form of data leakage that affects the predictive power of the model and makes it unreliable.

Another example, perhaps easier to grasp and closer to the way credolab builds features, is the Device Velocity feature. Velocity checks the number of times that certain behavior occurs on a customer’s mobile device within certain intervals and looks for anomalies or similarities to known fraudulent behavior. To avoid any data leakage, credolab only processes the data generated from the device before the date for which we are processing the velocity feature.

The same holds true - probably even more so - when it comes to the target. To get stable results, you have to know exactly what it is you want to detect when you construct the model. And you definitely need to have the right data to deduce your target. If the target is “understand default risk”, as it usually is in my industry, then you absolutely have to have a source of data with a proven link to repayments. Even if that is alternative data, at some stage this will have to connect to customers’ agreed payments schedules and what actually occurred in terms of payment dates and amounts.

Let's take a real-life example from one of our clients in Brazil. Ideally, in order to confidently use the score for decision-making, we should collect data before the loan is granted, not after. Only in this way can the score be confidently used for decisioning. If the data is collected after the loan has been approved (this might happen, for example, if a bank integrates the SDK and runs a backtest on historical data), the accuracy of the model may be compromised.

For instance, if the dataset mainly includes good customers who are still using the app, it may represent a lower risk profile. However, this does not necessarily mean that all customers are less likely to default on their payments. It's common for delinquent customers to uninstall the lender's app right before missing a payment.

Lenders can also use historical data from loans it has already approved because that contains possible insights into the characteristics of customers who either have or have not made their repayments on schedule. It’s also possible to use historical data to analyze how a bank has previously attracted and originated loans (and approved or rejected them) using conventional methods. You are then in a position to build a model based on alternative data and optimize for the best way to combine the two methods.

If you do either of these things, however, it's crucial to define the target consistently for all approved loans, ensuring that each loan has an equal chance of becoming good or bad. If the risk profile of a portfolio has significantly reduced in a short period, it may be due to an improvement in portfolio quality. However, in many cases, it's a result of a wrongly selected target, which could be linked to "ever delinquency" instead of a specific period, such as 30 days past due. When this happens, a model built on loans approved that have yet to become delinquent may miss identifying customers who pose a higher risk.

Get the basics right and explainability explains itself

If you hold to those basic principles, the rest becomes surprisingly simple. You can analyze the data intensely, discover a lot of interrelationships and patterns, and create new features without data leaks. At this point it doesn't really matter which modeling algorithm you use - log regression, boosting, neural networks - they will all work in pretty much the same way unless you are working on the very largest datasets - say, 300-500k observations daily for each modeling scenario. And, if the choice of modeling algorithm is irrelevant, why would you choose a complex black- or grey-box approach? You can just apply the simplest algorithm - i.e. log regression.

Which brings us back to my starting argument. The apparent crisis of AI explainability vanishes if you have features constructed in the right way and properly calculated target variables.