What Is A Confounding Variable

Written by SeattleDataGuy | Published 2018/06/03
Tech Story Tags: healthcare | data-science | statistics | data | confounding-variable

TLDRvia the TL;DR App

Let’s say a group of researchers, or data scientists discover that the mortality rate in Florida is 20 deaths out of 1000 people a year compared to Washington State where it is 9.8 deaths out of 1000 people.

Being very concerned these researchers put in a proposal for millions of dollars to try to figure out how to decrease this mortality rate in Florida. However, they forgot to dig a little deeper. They forgot to exam the average age of the populations of the two states. If Florida’s average age is 52 and Washington’s is 25 that might play a role in the mortality rate. In this case, the age played the role of a confounding variable.

A variable that is not considered but plays a role in the outcome of an event is considered a confounding factor.

In epidemiology, a confounding variable refers to a variable that is a risk factor for a disease or is associated to the exposure of the disease but is not the actual exposure. Confounding variables are not limited to a disease. When developing algorithms for product recommendations, A/B testing or market segmentation confounding variables can creep in and mislead data scientists and analysts looking to create effective algorithms.

This can lead to invalid conclusions and incorrect comparisons. For instance, in the case of death rates in Washington and Florida, you are comparing two different populations. This is more like comparing apples to oranges. Although in this case you are comparing people to people, the populations have different attributes that make them unfair to compare.

In order to mitigate this there are two specific options. One is to develop the populations into more focused categories like age. Another method that could be used to standardize measurements like death rates is to use a direct adjustment rate. Using a direct adjustment rate allows you to create a standardized

If you would prefer, you can watch a video that will discuss this same concept.

A Real Life Example Of Confounding Variables

Let’s suppose you want to compare an outcome in two populations. In this case, let’s say you are comparing the case fatality rates of a trauma hospital to that of a general hospital. This example is actually based off real studies done because hospitals were having a difficult time fighting to keep their trauma centers open.

Let’s suppose that the trauma hospital has a case fatality rate of 8.6% where as the general hospital has a case fatality rate of 5.6%. From this perspective it would seem as if it would be safer if we went to a general hospital as exposure to a trauma hospital seems to lead to a greater chance of death. This was what some people assumed by reading these numbers and it is not such a far off conclusion based off the numbers.

However, in this case, we are ignoring the level of injury or trauma that the patients who visit both hospitals may have when they come in. In this case, we will refer to this as the triage level. There are technically 5 levels. In order to reduce the amount of math, I am only going to consider 1–4. Hopefully…there aren’t too many patients dying when they are considered level 5.

Looking at the spread of different patients who went to either a trauma or general hospital. You might notice that there is a distinct difference between which type of patients went where. We are seeing a much higher proportion of patients that visited trauma hospitals were triage level 1 compared to general hopsitals(320 patients for trauma hospitals and 80 patients for general hospitals). That makes sense, you would want to refer more of your triage level 1 cases to a trauma hospital.

However, the difference between the two data sets becomes more apparent in the next table.

This graph depicts the amount of patients who live and died after their hospital experience. L stands for lived, D stands for died.

Here we see the difference in death rates per triage level. So although the overall case fatality rate is much higher in the Trauma Hospitals, the overall percentage of deaths for triage level 1s is higher in the general hospital(9.4% for Trauma Hospitals compared to 25% for General Hospitals). That means, if there would have been more patients brought to the general hospital that are triage level 1, there is a good chance their case fatality rate would be worse.

Now the next question is, how do we adjust the rates so they more fairly and concisely compare each case fatality rate per hospital type?

That is where this final table comes into play. The goal will be to create one standardized populations by adding the two populations together. Then, we use the percentage rates per death for each of the hospital types on the new standardized population.

This creates a fairer comparison of the two populations. In the end, you end up without about 51 total deaths out of 1202 people for trauma hospitals and 123 deaths out of 1202 people for general hospitals.

This in turn provides a very different story and a more fairly compared one at that. Now the new death rates are 4.3% for trauma hospitals compared to 10.3% for general hospitals.

When these numbers are provided to hospital administrators, they will have a clearer picture on what is actually occuring. With this information they can make a more informed decision without having to ask to much about context and if they do, then you can explain your methodology. However, this method avoids bad assumptions being made without good questions being asked first.

Confounding variables are present in every area of research where correlations are being examined. Thus, it is important to look past the initial outputs and question whether there are possible underlying reasons for the numbers you are seeing.

When developing a report for directors and managers it is important to consider how you will simplify complex concepts like confounding variables. It is about telling a complete story. This occurs when you are able to concisely state a point across multiple data points and graphs.

If your team is looking for a team of data experts who can help develop your healthcare analytics, then feel free to reach out today.

If You’re interested in reading more about data science:

Time Series Model Development: Basic Models

How To Measure The Accuracy Of Predictive Models

What is A Decision Tree

How Algorithms Can Become Unethical and Biased

How Men’s Wearhouse Could Use Data Science Cont

Introduction To Time Series In R

How To Develop Robust Algorithms

4 Must Have Skills For Data Scientists


Published by HackerNoon on 2018/06/03