How To Meaningfully Interpret COVID-19 Data

Written by shreya_pathak | Published 2020/05/11
Tech Story Tags: covid19 | data-science | big-data | machine-learning | healthcare | data | ai | hackernoon-top-story

TLDR The COVID-19 pandemic has been plagued by a lack of data on the number of cases reported in the U.S., Italy and South Korea. This article is a guide to understanding and interpreting the various forms of data being reported on the pandemic. We are all desperately hoping to see signs that the outbreak might be improving. We need to look for trends and patterns in the data to see when the outbreak has been contained. The limited bandwidth for testing skews the reported daily numbers in a few ways to understand the data.via the TL;DR App

How Will We Know When the COVID-19 Pandemic is Getting Better?

With the endless stream of headline-grabbing numbers and figures being reported about the COVID-19 pandemic, it is easy to lose sight of what we might learn from the data. As a physician, I believe decisions must be guided by science and data, and for that we must understand data in context. Ultimately, we are all desperately hoping to see signs that the COVID-19 pandemic might be improving.
This article will be a brief guide to understanding and interpreting the various forms of data being reported on COVID-19. Questions I hope to answer in this post including the following:
  • What are the daily numbers telling us?
  • How do limitations on testing affect the data?
  • How can we tell if social distancing is working?
  • What is not being reported?
  • What kind of data should we be asking for?
  • What are the signs that an outbreak has been contained?

Principles for interpreting COVID-19 data

1. Look for trends and patterns
Drawing conclusions from a single data point is one of the easiest mistakes to make. Because day-to-day variations in testing or reporting remains unpredictable — for example, testing capacity is often reduced over the weekend — we must be prudent to verify a trend over several days. I suggest using at least a 3-day running average on daily data to smooth out the variance. In the figure below showing daily cases in Italy, you may have preemptively concluded that the epidemic peaked after seeing declines on Mar. 10 or Mar. 13. While we cannot predict the future, using a 3-day running average makes trends easier to identify.
2. Understand the utility of logarithmic scales
Logarithmic scales (short primer) are used to illustrate the rate of exponential growth. In the context of COVID-19, you will encounter logarithmic plots primarily to show the total number of cases (or deaths) over time. On a linear scale, an exponential growth curve looks like a hockey stick, quickly skyrocketing up the “handle” portion of the curve. On a logarithmic scale, however, the exponential growth rate is converted into the slope of the line. A straight line indicates a rate of change and the steepness of the line indicates the magnitude. Therefore, any changes in the slope of the line tells you whether growth is slowing down (flatter) or accelerating (steeper).
Let’s use the data from South Korea where a large outbreak was contained to illustrate:
South Korea experienced an outbreak originating from a member of a megachurch in Daegu starting in mid-February. Before this outbreak, there were very few new cases. Total cases is flat on both the linear and logarithmic scales. During the outbreak, the fastest growth on the linear plot appears to be between Feb. 27 and Mar. 4 — in terms of absolute number of cases, this was the period with the most new cases. However, the logarithmic plot is steepest between Feb. 18 and Feb. 22 indicating a the highest growth rate was earlier in the outbreak before gradually flattening out (slowing) to a near-horizontal line today. The flattening of the line on a logarithmic scale allows us to easily visualize when the growth rate declined and the outbreak was contained.
3. Understand the number of cases in the context of testing limitations (sources: COVID TrackingPolitico)
There is a saying in medicine that you can’t find a fever if you don’t measure a temperature. You can’t find COVID-19 cases if you don’t test for it. The delayed push for testing in the United States has been well-documented. While the situation has improved in the past week, testing is still extremely limited to symptomatic high-risk patients (the elderly or immunocompromised) with a known COVID-19 contact, or for acutely symptomatic patients being considered for hospitalization. The limited bandwidth for testing skews the reported daily cases in a few ways. First, we know there are likely hundreds of thousands of undiagnosed COVID-19 cases in the community because they were not sick enough to be tested. Second, bear in mind that test results take 2–3 days so a positive result today is a symptomatic patient 2–3 days ago. Third, regional variations in testing criteria and protocol make it difficult to compare data between areas. Fourth, and most importantly, I believe the most consistent group of patients being tested are those acutely sick requiring hospitalization.
To illustrate the variations in testing within the 9 states with the most cases on 3/25/20, consider the image below:
You should notice a few things. (1) Some states did not start testing widely until recently (NJ) while others started testing in early March (WA). (2) Some states are getting positive results on less than 10% of tests (MA, IL) while other states are getting over 50% positives (MI). (3) New York state has tested over 100,000 patients, while Michigan has tested fewer than 4,500.
The bottom line is that testing criteria and protocol still vary greatly from region to region, so the absolute number of cases should be taken with a grain of salt. In addition, each region may be at a different phase of their epidemic curve, so it is always best to look at the logarithmic growth curve for that locality to understand where they are headed.
4. Use data expressed in “per capita” if you want to compare between regions
The absolute number of patients tested, cases, or deaths between two regions can be misleading if you forget to account for differences in population. In other words, 1,000 cases in Rhode Island would be very different from 1,000 cases in Texas. Converting absolute numbers into per capita — such as per million people — allows for comparison of the magnitude of an outbreak in two different locations. As an example, Trump announced on Twitter on 3/25/20 that the United States has now performed more COVID-19 tests than South Korea. Per capita, however, the United States lags far behind at 1,400 tests per million compared to over 6,200 tests per million in South Korea.
As another shocking example, Italy has reported 136 deaths per million population compared to 9 deaths per million in the UK and 4 deaths per million in the US.
5. Be as granular as possible with data to answer your question
Having a well-formulated question before looking at data will give you the best chance to find a meaningful answer. For example — if you are interested in the outbreak in New York City, you should use city- and state-level data rather than national numbers. The corollary is that outbreaks in different regions may be at different phases of their epidemic curve. Absolute numbers don’t tell the whole story — you need to dig deeper and look at the growth rate of each region. When we say that “New Orleans is ten days behind New York City,” what we mean is that New Orleans has as many cases today as New York City did ten days ago with a similar growth rate and trajectory.
Here are two logarithmic plots to illustrate that epidemic curves vary widely between countries and states:
The country-by-country curves shows Hong Kong and Singapore maintaining the lowest number of cases nearly a month after their 100th case. China and South Korea have flattened their curves after their initial outbreaks. The US, unfortunately, appears to be charting the steepest growth rate and has overtaken the most cases in the world.
The state-by-state curves show Oregon and Nebraska maintaining a slow rate of growth whereas cases in Louisiana and Michigan are increasing at a rate faster than New York.
Here is how we should think about the following types of data being reported:
Total cases and deaths; daily cases and daily deaths
Focus on day-to-day trends and the growth rate. Use 3-day running averages to visualize trends and look at logarithmic charts to determine if the growth rate is slowing or increasing. Remember to convert data into per capita in order to compare magnitude of outbreaks between regions. Keep in mind any regional discrepancies in testing criteria and capacity.
Percentage change in daily cases or daily deaths
This data is typically presented visually as logarithmic graphs. As explained above, this is one of the most useful visuals to see whether the growth rate is slowing or increasing. The steeper the curve, the faster the growth. This is also the easiest way to tell if social distancing is having the intended effects — the mantra of #flattenthecurve is exactly what we’re looking for.
Percentage of COVID-19 tests that are positive
Keep in mind who is actually getting tested — in a state like Michigan where only the sickest patients are being tested, then you’ll get a much higher percentage of positive results. Generally I do not find this data helpful — EXCEPT if a region maintains the same testing criteria and protocol over a period of time. In this case, you can look at the percentage of test results that are positive as a measure of prevalence of COVID-19 within the population being tested.
Mortality rate
Take all mortality rate data with a grain of salt — the full scope of the epidemic will not be understood until years from now. The number we are interested in is the number of deaths out of all COVID-19 patients. However, due to limited testing, we do not know the full extent of asymptomatic and undiagnosed cases. The mortality rates currently reported are calculated against the number of diagnosed cases. A few caveats:
  1. The numerator (deaths) depends on how a region classifies a COVID-19 death. Some countries will assign any death in a COVID-19 positive patient to COVID-19, whereas other countries may still assign that death to their co-morbid medical conditions.
  2. The denominator (cases) depends on the number of diagnosed cases, which is limited by testing. As testing is prioritized for the sickest patients, current mortality rates will likely be an over-estimate.
  3. Keep in mind that deaths trail diagnosis by a period of 5–7 days. In other words, a patient who dies today was likely diagnosed 5–7 days ago — so the mortality rate based on current data does not factor in those who were recently diagnosed and are still being treated. This effect causes the reported mortality rate to be an under-estimate. The figure below shows this 5 day delay between case and deaths in Italy.
Number of patients hospitalized; ventilator utilization
This is the dataset that I wish every city, state, and country would report. Given regional inconsistencies in testing, the most consistent measure of COVID-19 burden in a community is the number of sick patients requiring hospitalization. This group of patients does not have the option of self-quarantine without testing. From the number of hospital admissions, you can even work backwards to estimate the total number of cases in that area. For example, early data from China showed approximately 20% of cases required hospitalization, and that approximately 86% of COVID-19 cases were undiagnosed. This extrapolates to approximately 35 cases in the community per hospital admission.
COVID Tracking has started to include the number of hospitalized patients in their dataset, but a state-by-state view shows that most states are not reporting this data. Ultimately, we know that healthcare is a limited resource. We have a fixed number of ICU beds and ventilators available. The key number for our healthcare system will be the number of new daily admissions — it must be balanced against discharges and deaths — or else capacity will be overwhelmed. This is complicated by the fact that even patients who recover may be expected to be in the hospital for 7–14 days.

What are the signs that an outbreak has been contained?

1. Decrease in rate of growth of daily cases (flattening of the logarithmic curve).
This will be the earliest sign that the spread of COVID-19 is slowing. We hope toAlso in this time I will suggest you to stay Healthy follow these healthy lifestyle tips by The Health Humans. see this approximately 2–3 weeks after social distancing measures are enacted. We are starting to see this sign in Italy’s logarithmic curve but not yet in the United States.
2. Decrease in absolute number of daily cases.
Fewer new cases day after day will always be a good sign. Use a 3-day running average and look for a sustained trend downwards. The number of cases in Italy appeared to stabilize for four days between Mar. 22 and Mar. 25 before ticking up again on Mar. 26.
3. Decrease in rate of growth of daily deaths and absolute number of daily deaths.
Remember that the deaths will trail cases by an average of 5–7 days. A healthcare system pushed past the capacity of ICU beds and ventilators will result in a higher mortality rates, so the death curve might peak higher and longer than the number of cases.
4. Decrease in hospital admissions and ventilator utilization
This is the holy grail. When the number of sick patients being admitted to hospitals begins to decrease, we will know that the number of COVID-19 cases in the community must be decreasing. Unfortunately, this data is not readily available or consistently reported. To get this data, we will have to listen to our doctors on the front lines in the ED and in the ICU. What they are seeing and reporting from across the country cannot be expressed in numbers alone.

Written by shreya_pathak | Tech Geeks and A Gym Freak
Published by HackerNoon on 2020/05/11