Detecting Changes in COVID-19 Cases with Bayesian Models

Written by jramkiss | Published 2020/05/20
Tech Story Tags: data-science | cor | machine-learning | statistics | python | programming | kaggle | coronavirus

TLDR Bayesian change point model to estimate the date that the number of new COVID-19 cases starts to flatten in different countries. The model was written in Pyro, a probabilistic programming language built on PyTorch. It is useful for estimating the date of change, not for predicting what will happen with the virus. It should not be mistaken for an amazing epidemiology model that will tell us when the quarantine will end, but instead a way of describing what we have already observed.via the TL;DR App

Bayesian change point model to estimate the date that the number of new COVID-19 cases starts to flatten in different countries.

Problem

With the current global pandemic and its associated resources (data, analyses, etc.), I’ve been trying for some time to come up with an interesting COVID-19 problem to attack with statistics. After looking at the number of confirmed cases for some counties, it was clear that at some date, the number of new cases stopped being exponential and its distribution changed.
However, this date was different for each country (obviously). This post introduces and discusses a Bayesian model for estimating the date that the distribution of new COVID-19 cases in a particular country changes. The model coefficients are interpretable and can be used for future analysis on the effectiveness of social distancing measures adopted by different countries.
An important reminder before we get into it is that all models are wrong, but some are useful. This model is useful for estimating the date of change, not for predicting what will happen with COVID-19.
It should not be mistaken for an amazing epidemiology model that will tell us when the quarantine will end, but instead a way of describing what we have already observed with probability distributions.
All the code for this post can be found here and you can read the original version of this post here.

Model

We want to describe y, log of the number of new COVID-19 cases in a particular country each day, as a function of t, the number of days since the virus started in that country. We’ll do this using a segmented regression model. The point at which we segment will be determined by a learned parameter, τ, as described below:
In other words, y will be modelled as w₁ + b₁ for days up until day τ. After that it will be modelled as w₂ + b₂.
The model was written in Pyro, a probabilistic programming language built on PyTorch. Chunks of the code are included in this post, but the majority of code is in this notebook.

Data

The data used was downloaded from Kaggle. Available to us is the number of daily confirmed cases in each country, and the figure below shows this data in Italy. It is clear that there are some inconsistencies in how the data is reported, for example, in Italy there are no new confirmed cases on March 12th, but nearly double the expected cases on March 13th.
In cases like this, the data was split between the two days.
The virus also starts at different times in different countries. Because we have a regression model, it is inappropriate to include data prior to the virus being in a particular country. This date is chosen by hand for each country based on the progression of new cases and is never the date the first patient is recorded.
The “start” date is better interpreted as the date the virus started to consistently grow, as opposed to the date the patient 0 was recorded.

Prior Specification

Virus growth is sensitive to population dynamics of individual countries and we are limited in the amount of data available, so prior specification is really important here.
Starting with w₁ and w₂, these parameters can be loosely interpreted as the growth rate of the virus before and after the date change. We know that the growth will be positive in the beginning and is not likely to be larger than 1 (after all, w₁ is a gradient). With these assumptions, w₁ ~ N(0.5, 0.25) is a suitable prior. We’ll use similar logic for p(w), but will have to keep in mind flexibility.
Without a flexible enough prior here, the model won’t do well in cases where there is no real change point in the data. In these cases, w₂ ≈ w₁ and we’ll see and example of this in the Results section. For now, we want p(w) to be symmetric about 0, with the majority of values lying between (-0.5, 0.5). We’ll use w₂ ~ N(0, 0.25).
Next are the bias terms, b₁ and b₂. Priors for these parameters are especially sensitive to country characteristics. Countries that are more exposed to COVID-19 (for whatever reason), will have more confirmed cases at its peak than countries that are less exposed. This will directly affect the posterior distribution for b₂ (which is the bias term for the second regression).
In order to automatically adapt this parameter to different countries, we use the mean of the first and forth quartiles of y as the prior means of b₁ and b respectively. The standard deviation for b₁ is taken as 1, which makes p(b) a relatively flat prior. The standard deviation of p(b) is taken as a quarter of its prior mean so that the prior scales with larger mean values.
As for τ, since at this time we don’t have access to all the data (the virus is ongoing), we’re unable to have a completely flat prior and have the model estimate it. Instead, the assumption is made that the change is more likely to occur in the second half of the date range at hand, so we use τ ~ Beta(4, 3).
Below is the model written in Pyro.
class COVID_change(PyroModule):
    def __init__(self, in_features, out_features, b1_mu, b2_mu):
        super().__init__()
        self.linear1 = PyroModule[nn.Linear](in_features, out_features, bias = False)
        self.linear1.weight = PyroSample(dist.Normal(0.5, 0.25).expand([1, 1]).to_event(1))
        self.linear1.bias = PyroSample(dist.Normal(b1_mu, 1.))

        self.linear2 = PyroModule[nn.Linear](in_features, out_features, bias = False)
        self.linear2.weight = PyroSample(dist.Normal(0., 0.25).expand([1, 1])) #.to_event(1))
        self.linear2.bias = PyroSample(dist.Normal(b2_mu, b2_mu/4))

    def forward(self, x, y=None):
        tau = pyro.sample("tau", dist.Beta(4, 3))
        sigma = pyro.sample("sigma", dist.Uniform(0., 3.))
        # fit lm's to data based on tau
        sep = int(np.ceil(tau.detach().numpy() * len(x)))
        mean1 = self.linear1(x[:sep]).squeeze(-1)
        mean2 = self.linear2(x[sep:]).squeeze(-1)
        mean = torch.cat((mean1, mean2))
        obs = pyro.sample("obs", dist.Normal(mean, sigma), obs=y)
        return mean
Hamiltonian Monte Carlo is used for posterior sampling. The code for this is shown below.
model = COVID_change(1, 1,
                     b1_mu = bias_1_mean,
                     b2_mu = bias_2_mean)

num_samples = 600
# mcmc
nuts_kernel = NUTS(model)
mcmc = MCMC(nuts_kernel,
            num_samples=num_samples,
            warmup_steps = 100,
            num_chains = 4)
mcmc.run(x_data, y_data)
samples = mcmc.get_samples()

Results

Since I live in Canada and have exposure to the dates precautions started, modelling will start here. We’ll use February 27th as the date the virus “started”.
Priors:
Posterior Distributions:
Starting with the posteriors for w₁ and w₂, if there was no change in the data we would expect to see these two distributions close to each other as they govern the growth rate of the virus. It is a good sign that these distributions, along with the posteriors for b₁ and b₂, don’t overlap. This is evidence that the change point estimated by our model is true.
This change point was estimated as: 2020–03–28
As a side note (warning: no science) my company issued a mandatory work from home policy on March 16th. Around this date is when most companies in Toronto would have issues mandatory work from home policies where applicable. Assuming the reported incubation period of the virus is up to 14 days, this estimated date change makes sense as it is 12 days after widespread social distancing measures began!
The model fit along with 90% credible interval bands can be seen in the plot below. On the left is log of the number of daily cases, which is what we used to fit the model, and on the right is the true number of daily cases. It is very difficult to visually determine a change point by simply looking at the number of daily cases, and even more difficult by looking at the total number of confirmed cases.

Assessing Convergence

When running these experiments, the most important step is to diagnose the MCMC for convergence. I adopt 3 ways of assessing convergence for this model by observing mixing and stationarity of the chains and R_hat. R_hat is the factor by which each posterior distribution will reduce by as the number of samples tends to infinity.
A perfect R_hat value is 1 but values less than 1.1 are a strong indication of convergence. We observe mixing and stationarity of the Markov chains in order to know if the HMC is producing appropriate posterior samples.
Below are trace plots for each parameter. Each chain is stationary and mixes well. Additionally, all R_hat values are less than 1.1.
After convergence, the last thing to check before moving on to other examples is how appropriate the model is for the data. Is it consistent with the assumptions made earlier? To test this we’ll use a residual plot and a QQ-plot, as shown below. I’ve outlined the estimated change point in order to compare residuals before and after the change to test for homoscedasticity.
The residuals follow a Normal distribution with zero mean, and no have dependence with time, before and after the date of change.

What About No Change?

Aha! What if the number of new cases has not started to flatten yet?! To test the model’s robustness to this case, we’ll look at data from Canada up until March 28th. This is the day that the model estimated curve flattening began in Canada.
Just because there isn’t a true change date doesn’t mean the model will output “No change” (lol). We’ll have to use the posterior distributions to reason that the change date provided by the model is inappropriate, and consequentially there is no change in the data.
Priors:
Posterior Distributions:
The posteriors for w₁ and w₂ have significant overlap, indicating that the growth rate of the virus hasn’t changed significantly. Posteriors for b₁ and b₂ are also overlapping. These show that the model is struggling to estimate a reasonable τ, which is good validation for us that the priors aren’t too strong.
Although we have already concluded that there is no change date for this data, we’ll still plot the model out of curiosity.
Similar to the previous example, the MCMC has converged. The trace plots below show sufficient mixing and stationarity of the chains, and most R_hat values less than 1.1.

Next Steps and Open Questions

This model is able to describe the data well enough to produce a reliable estimate of the day flattening the curve started. A useful byproduct of this is the coefficient term for the 2nd regression line, w₂. By calculating w₂ and b₂ for different countries, we can compare how effective their social distancing measures were. This analysis and more will likely come in a subsequent post.
Keep an eye out on my personal blog for related posts!
Thank you for reading, and definitely reach out to me by e-mail or other means if you have suggestions or recommendations, or even just to chat!

Published by HackerNoon on 2020/05/20