Fair and Balanced? Thoughts on Bias in Probabilistic Modeling

In recent months and years, the Machine Learning community has conducting a notable amount of soul searching on the question of algorithmic bias: are our algorithms operating in ways that are fundamentally unfair towards specific groups within society?

This conversation intrigues me because it falls at an intersection between two areas of thought which are central to my interests: societal norms and probabilistic modeling. However, I’ve often found the discussion space to be a frustrating one, because it contains so many people talking past each other: so many different definitions of bias being implicitly conflated together, so much un-nuanced laying of blame. This post is my attempt at clarification: both for myself, and hopefully for the conversation as a whole.

Differing base rate bias

If you’ve been engaged in any conversations around machine learning and bias, you’ve doubtless seen this post from ProPublica, which asserts that the COMPAS recidivism prediction model was biased because of the differing composition of its errors: while both black and white prisoners had equal aggregate error rates, the error for black prisoners was likelier to be a false negative (predict recidivism, doesn’t recidivate) whereas the error for white prisoners was likelier to be a false positive (doesn’t predict recidivism, recidivates. Further research into the accusations of unfairness that ProPublica leveled suggested that this wasn’t as cut and dry as ProPublica suggested, despite the seductively compelling narrative of “naive, privileged techies cause harm”.

The original blog post presenting these findings is worth reading in it’s own right, but I’ll make an attempt to briefly summarize it’s ideas here: whether you believe this algorithm is fair depends on how you define fairness. And you generally cannot have an algorithm be fair according to more than one definition at once. One possible way to define it is as “people in the same score bucket have equal true probabilities of reoffending”. This is often shorthanded as “calibration”. Northpointe, the company that builds the COMPAS score, asserts its score to be fair because it is well-calibrated: within a given score bucket, black and white defendants who score in that bucket are, in aggregate, equally likely to reoffend.

ProPublica’s definition of fairness focuses in on the prisoners who did not re-offend, and showed that within that group, black prisoners were likely to be deemed riskier than the white prisoners. Another way of thinking of this is: within the set of people who ultimately did not re-offend, Pro Publica’s definition of fairness requires the average score of black prisoners to be equal to the average score of white ones. Let’s call that “negative class balance”. You may care about this because you don’t want one group to be systematically deemed more risky, conditional on being actually low-risk. You could also imagine a symmetric constraint on the positive class, where, within the set of people who ultimately did reoffend, blacks and whites exhibit the same average score. Let’s call this “positive class balance”. You may care about this because you don’t want one group to be systematically “let off the hook” conditional on truly reoffending.

This paper, by Kleinberg et al, proved mathematically that you could not have all three of these conditions (calibration, negative class balance, and positive class balance), except in the case where base rates (i.e. the aggregate rate of reoffense) is equivalent across groups, or where you have a perfect model. In anything less than a perfect model, when groups reoffend at different rates, you have to sacrifice at least one of these notions of fairness in order to satisfy the others.

At this point, it becomes a question of tradeoffs, depending on the domain in which you’re operating. I think there’s a quite valid argument to be made that, in cases where the state is choosing to further incarcerate someone or not, and source of harm we most want to avoid is “unfair” further incarceration, we’d prefer to ensure that truly low-risk people are treated the same across demographic groups, and care less about our score balance among the truly high-risk individuals. But, I think the fundamental point here is: it’s an argument, and there’s a real trade-off being made; it’s not an obvious technical flaw of the algorithm that it wasn’t specifically designed to meet an ambivalent moral trade-off criteria.

This question, of differing base rates among groups, is a controversial one. People frequently argue that, in cases like the recidivism one illustrated here, the data itself is unfair, because it reflects an oppressive society without an equal playing field. That’s an entirely reasonable argument to make. But, it goes fundamentally deeper than alleging that an algorithm is performing incorrectly. It asserts that any system that learns the statistical patterns of our current world is dangerous, because our current world is normatively flawed in serious ways, and any system that learns associations in this flawed world will be incorrect once these normative flaws are remedied, or, worse, will block those societal imbalances from being rectified.

Differing conditional distributions

Another area of potential algorithmic bias is that of differing accuracy across demographic groups, due to differing conditional distributions. By conditional distribution, I mean the distribution of the features X, conditional on output y.

For the sake of this running discussion, I’m going to reference the above image, which was the cause of a publicity firestorm because Google Photos’ algorithm incorrectly tagged two people of color as Gorillas. I’m going to operate for the moment on the assumption that this observed error represents a broad, genuine problem, whereby Google is more likely to classify black faces as being nonhuman. I’m not entirely convinced that assumption is correct, and will address that skepticism at the end of this section.

But, for the moment, let’s imagine we can frame this problem as one where white faces have a low error rate when it comes to being classified as human, and black faces have a higher one. What could cause an outcome like that? One clear potential cause is differing conditional distributions. Your “X” here is made up of pixels (which eventually get turned into higher-level convolutional features). For simplicity, let’s reduce the number of categories, and imagine Y is a binary of “human” vs “non-human”. If you have a situation where you have two distinct feature profiles (caucasian, and non-caucasian) that both map to human, and caucasian is a strong numerical majority in the data, it will pull the classifier towards seeing the features associated with caucasian as the ones most indicative of an image having the class of “human”.

The intuition behind this becomes easier if you imagine an extreme case: where 99 of the samples in the dataset under “human” are caucasian faces, and only 1 is non-caucasian. In this case, most regularization schemes would incentivize the algorithm to learn a simpler mapping to check for common caucasian features, rather than adding functional model capacity to capture this smaller second subgroup. You’re going to have this problem to some degree whenever you have subgroups with different distributions over your features X, that all need to get mapped to a single shared outcome Y. Generally, this is easiest to do if you specify these subgroups in advance, and give them their own sublabels, so that you’re “telling” the model in advance that there are more distinct groups it needs to be able to capture.

One question a solution like this one obviously raises is: what are the right groups along which to enforce equal performance? All levels of difference exist along a gradient: if you zoom in far enough, you can find many levels of smaller and smaller subgroups existing within bigger groups, and in the limit, enforcing equalized performance across every such subgroup devolves to requiring equal expected performance for every individual in the dataset. This is a potentially interesting problem, and one I don’t recall having seen addressed before, but seems a quite difficult constraint to fulfill, when you consider the two poles of someone smack in the middle of the distribution, and someone who is a very strong outlier; most well-regularized models will end up performing better for the former individual.

Ultimately, the groupings we decide to enforce equal performance along are likely going to be contextual, and a function of histories of oppression and disenfranchisement, rather than representing any categories inherently more salient than others in the data. There’s nothing necessarily wrong with this — peoples’ concrete experiences of interacting with an algorithm are going to be shaped by the societal backdrop where they might be particularly sensitive to errors like these, which makes it a problem worth specially addressing.

I should also add one aside here, addressing the actual case that sparked this furor: a single anecdote really doesn’t constitute a broad problem, or a well-formulated one. The claim isn’t: “people of color are systematically misidentified as non-human entities”. It just so happened that this particular incidence of mislabeling has painful societal baggage that makes this error a particularly meaningful one. In order to stop situations actually analogous to this one, Google would have to systematically understand what kinds of errors carry similar kinds of baggage, and input that knowledge into it’s training system. But that assessment is fundamentally contextual, and fundamentally arbitrary: there isn’t a mathematically rigorous way of ensuring that no one gets image labels they find derogatory or insulting.

Bias in the underlying distribution P(X)

Where the prior two sections addressed differing base rates across demographic groups and numerical-minority feature subgroups that might be hard to learn, the last idea I want to focus on is that of bias embedded (heh) in free-form data itself, even without attaching specific targets or groupings to that data. A good example of this is the realization that Word2Vec word embeddings were exhibiting gender bias: when you captured the directional vector that represents gender (boy — girl, man — woman, etc), you’d find stereotypically female professions or roles far away from stereotypically male ones along that axis.

This is certainly an uncomfortable thing to see. Many of us normatively believe that women are every bit as capable as men, and aspire to a world where gender ratios equalize across professions. But the word embedding algorithm didn’t get that association out of thin air. It absorbed millions and millions of sentences where our gendered world, as it currently exists, was rendered in text. It learned gender bias as a semantically meaningful fact about the world because in the absence of some kind of “moral regularizer” to tell it that this kind of correlation isn’t one worth capturing, it seemed as salient a reality as any other.

I’m becoming a broken record now, but just as before: while there are some cool technical solutions being proposed to this specific problem (see the link earlier in this section), in order to really pre-emptively address it, we’d need to specify a priori what kinds of semantic patterns fall into this niche: true correlations that we don’t want the algorithm to represent, as a matter of normative preference.

In summary:

Bias is an unavoidably normative question

“Bias” is often framed as simply a technical problem, or a societal one, pushing everyone to tut-tut at those narrow-visioned engineers, who are biasing their algorithms to only work well for them. I think that does a disservice to anyone wishing to do clear, concrete work on the problem. Because, while technical concerns certainly weave themselves in with the moral ones, you can’t get anywhere without acknowledging that bias is fundamentally an assertion that something about the world we present to our models in the form of data is not as it should be. By definition, such as assertion isn’t something that can be proven or disproven, tested or checked, by referring to the world as it currently is.

This is mostly easily seen by the simple fact that almost all claims about bias are about outcomes differing between groups, and the questions of which kinds of cross-group differences have moral significance is fundamentally a moral one.

A lot of these issues come down to two key questions: “what aspects of our current world do we wish to not have represented in our algorithms”, and “what kinds of inequalities or errors are the ones we care most deeply about, and wish to see corrected”. If we can think deeply and clearly about the answers to those — fundamentally normative — questions, I think we’ll be able to make more serious progress in solving these problems.