Basic Understanding of ARIMA/SARIMA vs Auto ARIMA/SARIMA using Covid-19 Data Predictions

Written by sharmi1206 | Published 2020/07/15
Tech Story Tags: time-series | timeseries | arima-model | covid19 | predictive-analytics | machine-learning | data-science | hackernoon-top-story

TLDR Python has 2 libraries StatsModels and Pyramid that helps to build forecasting models and predict values at a future time. In this blog, I try to summarise the functionalities of both of these libraries by demonstrating the Number of Active Cases for Covid-19 for any Indian state. We keep our scope limited to univariate time series analysis. Here, we have demonstrated using state Maharashtra data and modeling can be done using the processed data from https://github.com/sharmi1206/covid19-analysis.via the TL;DR App

Motivation

A time-series represents the sequence of a variable recorded with time over regular intervals of time. There are various applications of time series like Stock Market Prediction (with daily and hourly data), Annual Sales/Revenue, Seasonal Temperatures changes, monthly Cloud Infrastructure cost and second level prediction of
Python has 2 libraries StatsModels and Pyramid that helps to build forecasting models and predict values at a future time. As both these libraries provide forecasts for similar models namely,
AR (Auto-Regressive)MA (Moving Average)ARIMA (Auto-Regressive Integrated Moving Average)SARIMA (Seasonal Auto-Regressive Integrated Moving Average)
it will be interesting to see the features provided by each of the libraries along with the pros and cons of using each of them.
In this blog, I try to summarise the functionalities of both of these libraries by demonstrating the Number of Active Cases for Covid-19 for any Indian state. We keep our scope limited to univariate time series analysis.

DataSet

The scope of the data is fro 8th March-1st May where data from 8th March-14th April has been used for training and 15th March-1st May has been used for validation and prediction. Here, we have demonstrated using state Maharashtra, however all states data and modeling can be done using the processed data from https://github.com/sharmi1206/covid-19-analysis.

STATSMODEL

It primarily comprises of three main modules:
tsa– Comprises of functions useful for univariate time series analysis – autoregressive models (AR), vector autoregressive models (VAR), and univariate autoregressive moving average models (ARMA), along with Non-linear models like dynamic regression and autoregression. In addition, it also provides autocorrelation, partial autocorrelation function, periodogram, and other statistical properties related to ARMA, and methods that process moving average lag-polynomials.
statespace –  Equipped with functionality to model observations at any time instant with unobserved state vectors and irregular components. It has added flexibility to allow estimation with missing observations, forecasting, impulse response functions, etc. This flexibility helps it to differentiate from tsa by estimating additive and multiplicative seasonal effects on models (both on univariate and multi-variate), as well as arbitrary trend polynomials.
vector_ar-Allows simultaneous modeling and analyzing multiple time series.

ARIMA MODELS

An ARIMA model is composed of 3 constituent units which are :
ARAutoregression. This part explores any dependent relationship between an observation and some number of lagged variables.
IIntegrated. This part aims to make time-series stationery by subtracting or differencing an observation from observation at the previous time step of the same time series.
MAMoving Average. This part explores the relationship between an observation and a residual error by application of moving average to lagged observations, with any given time window.
In order to design a model (say ARIMA) by applying all of the above stated procedures we need to enable all parameters. We can also enable partial parameters and select to choose either AR or MA model. The parameters can be summarized as follows:
p: The number of lag observations included in the model, also called the lag order.
d: The number of times that the raw observations are differenced also called the degree of difference.
q: The size or the order of the moving average window.
To start with and in order to have different flavours of ARIMA model, the library imports are:
from statsmodels.tsa.ar_model import AR
from statsmodels.tsa.arima_model import ARIMA
from statsmodels.tsa.statespace.sarimax import SARIMAX
from statsmodels.graphics.tsaplots import plot_acf,plot_pacf

MANUAL MODEL SELECTION ARIMA

The following diagram illustrates the mechanism of using AR/MA/ARIMA/SARIMA by self choosing the model orders (p, q, d) and self tuning it.
AR Model
This Auto-Regressive model estimates a dependent variable Yt at any timing instant based on some number of lagged observations. The variables can be summarized as follows and formulated by the below-given equation:
Y(t-1): Lag 1 of the time series
β: Coefficient of lag1
α: Intercept term
Dependence of Lag in AR Estimation, where β and α are model estimates
Number of Active cases for Covid-19 can be formulated using AR model as follows. Here we put p, i.e. number of lag observations as 2, while d and q are set to 0.
model_ar = ARIMA(trainActiveCases, order=(2, 0, 0))  
model_ar_fit = model_ar.fit()  

prediction_ar=model_ar_fit.forecast(len(validActiveCases))[0]

print(model_ar_fit.summary())
# plot residual errors
residuals = pd.DataFrame(model_ar_fit.resid)
residuals.plot()
plt.show()
residuals.plot(kind='kde')
plt.show()
print(residuals.describe())
The summary and the predicted output obtained from the model are illustrated below:
AR Moel Summary
Predicted vs Actual AR
MA Model
The MA model also estimates a dependent variable Yt based on the lagged forecast errors.
MA Estimation with Lagged Errors
where the error terms are the errors introduced from respective lags. The errors Yt and Y(t-1) are the errors from the following equations :
Number of Active cases for Covid-19 can be formulated using MA model as follows. Here we put ma (moving average window), q as 2, while d and p are set to 0.
model_ma = ARIMA(trainActiveCases, order=(0, 0, 2))  
model_ma_fit = model_ma.fit()  

prediction_ma=model_ma_fit.forecast(len(validActiveCases))[0]

print(model_ma_fit.summary())
# plot residual errors
residuals = pd.DataFrame(model_ma_fit.resid)
residuals.plot()
plt.show()
residuals.plot(kind='kde')
plt.show()
print(residuals.describe())

model_scores.append(np.sqrt(mean_squared_error(validActiveCases,prediction_ma)))
print("Root Mean Square Error for MA Model: ",np.sqrt(mean_squared_error(validActiveCases,prediction_ma)))
The summary and the predicted output obtained from the model are illustrated below:
MA Model Summary
Predicted vs Actual MA

ARIMA Model
An ARIMA model comes along with both the flavours of AR and MA model where the time series is differenced at least once to make it stationary. The equation can be formulated as follows, where we see the term β coming from AR and ε–the error terms coming from MA model .
ARIMA – Differencing + Combination of Parameters from AR and MA model
Number of Active cases for Covid-19 can be formulated using MA model as follows. Here we put q i.e the ma (moving average window), d (order of differencing ) and p (number of lag observations) are set to 1.
The code snippet for this is :
model_arima = ARIMA(trainActiveCases, order=(1, 1, 1))  
model_arima_fit = model_arima.fit()  


prediction_arima=model_arima_fit.forecast(len(validActiveCases))[0]
valid_ml["ARIMA Model Prediction"]=list(np.exp(prediction_arima))

print(model_arima_fit.summary())
# plot residual errors
residuals = pd.DataFrame(model_arima_fit.resid)
residuals.plot()
plt.show()
residuals.plot(kind='kde')
plt.show()
print(residuals.describe()
Predicted vs Actual ARIMA
We can see it’s better to use ARIMA than AR or MA model, as this is the model that closely resembles the actual outcome.

SARIMAX Model

SARIMAX model is built by extending the ARIMA model, discussed in our previous section. In addition to terms 
AR, I and MA terms, there are four seasonal elements that are not part of ARIMA that must be configured; they are:
P: Seasonal autoregressive order.
D: Seasonal difference order.
Q: Seasonal moving average order.
m: The number of time steps for a single seasonal period.
Number of Active cases for Covid-19 can be formulated using the SARIMAX model as follows. Here we put ma (moving average window), d (order of differencing ) and p (number of lag observations) are set to 1. We also enforce seasonal order as (0,0,0,12) to study its behavior, even though Covid-19 does not show any seasonality, evident from research and daily reports.
The code snippet for SARIMAX (without exogenous variable) can be given us:
model_sarima = sm.tsa.statespace.SARIMAX(trainActiveCases, order=(1, 1, 1), seasonal_order=(0,0,0,0),
                                 enforce_stationarity=False, enforce_invertibility=False)
model_sarima_fit = model_sarima.fit()  


prediction_sarima=model_sarima_fit.forecast(np.shape(validActiveCases)[0])
valid_ml["SARIMA Model Prediction"]=list(np.exp(prediction_sarima))

print(model_sarima_fit.summary())
# plot residual errors
residuals = pd.DataFrame(model_sarima_fit.resid)
residuals.plot()
plt.show()
residuals.plot(kind='kde')
plt.show()
print(residuals.describe())
The model summary and the predicted outcome are illustrated from the below plots.
SARIMAX Model Summary
Predicted vs Actual SARIMA
In order to plot the trained, validation and predicted data under proper dates, we use the following code snippet.
plt.figure(figsize=(10,5))

index= pd.date_range(start=train_dates[0], periods=len(train_dates), freq='D')
valid_index = pd.date_range(start=valid_dates[0], periods=len(valid_dates), freq='D')

train_active =  pd.Series(train_ml['Active Cases'].values, index)
valid_active =  pd.Series(valid_ml['Active Cases'].values, valid_index)
pred_active =  pd.Series(prediction_ar, valid_index)

f, ax = plt.subplots(1,1 , figsize=(12,10))
plt.plot(train_active, marker='o',color='blue',label ="Train Data Set")
plt.plot(valid_active, marker='o',color='green',label ="Valid Data Set")
plt.plot(pred_active, marker='o',color='red',label ="Predicted ARIMA/SARIMAX")

plt.legend()
plt.xlabel("Date Time")
plt.ylabel('Active Cases')
plt.title("Active Cases ARIMA/SARIMAX Model Forecasting for state " + stateName)
Auto-Correlation Plots
Differencing helps to make time-series stationary. In order to find a right order of differencing, we use the ACF plot and see that it reaches zero, fast and with a fairly static mean. With Covid-19 data for state Maharashtra, the ACF plot and the code snippet are given below.
# Original Series
fig, axes = plt.subplots(3, 2, sharex=True, figsize=(10,5))
axes[0, 0].plot(df_per_State_features['Active Cases']); axes[0, 0].set_title('Original Series')
plot_acf(df_per_State_features['Active Cases'], ax=axes[0, 1])

# 1st Differencing
axes[1, 0].plot(df_per_State_features['Active Cases'].diff()); axes[1, 0].set_title('1st Order Differencing')
plot_acf(df_per_State_features['Active Cases'].diff().dropna(), ax=axes[1, 1])

# 2nd Differencing
axes[2, 0].plot(df_per_State_features['Active Cases'].diff().diff()); axes[2, 0].set_title('2nd Order Differencing')
plot_acf(df_per_State_features['Active Cases'].diff().diff().dropna(), ax=axes[2, 1])

plt.show()

AUTO ARIMA MODELS

This library automatically discovers the optimal order for an ARIMA model with stepwise execution of hyperparameters and parallel fitting of models.
This objective of this library (auto_arima) is to identify the most optimal parameters for an ARIMA/SARIMA and return a fitted ARIMA model. It does not depend on the PACF/Auto-Correlation (manual computation of differencing), but instead, it conducts differencing tests (i.e., Kwiatkowski–Phillips–Schmidt–Shin, Augmented Dickey-Fuller or Phillips–Perron) on its own to determine the order of differencing.
One other difference from Stats Model’s Vanilla ARIMA is that in case the model does not converge, it raises ValueError to signify the re-evaluation of stationarity-inducing measures.
For seasonality, SARIMA identifies the optimal P and Q hyperparameters after conducting the Canova-Hansen to determine the optimal order of seasonal differencing D, so that the auto-regressive/moving average portions of the seasonal model are defined by start_P, start_Q, max_P, max_Q
The model is fitted within ranges of defined start_p order (or number of time lags for AR), max_p, start_q (order of the moving average MA), max_q ranges.
Hence it can be formulated as : (p, q, d)x(P, Q, D)
The best model is obtained by optimizing information_criterion (AIC – Akaike Information Criterion, AICC – Corrected Akaike Information Criterion, BIC – Bayesian Information Criterion, HQIC – Hannan-Quinn Information Criterion, or “out of bag”–for validation scoring–respectively to a minimal value.
It also operates with exogenous variables (just like state space methods/models) for predicting added features in the regression operation.
Summary of AR with Auto-ARIMA
The following code and figure depicts AR model with Auto ARIMA with start_p=0, start_q=2 (by default), max_p=5, max_q=0
model_ar= auto_arima(trainActiveCases,trace=True, error_action='ignore', start_p=0,start_q=0,max_p=5,max_q=0,
                   suppress_warnings=True,stepwise=False,seasonal=False)
model_ar_fit = model_ar.fit(trainActiveCases)
prediction_ar=model_ar_fit.predict(len(validActiveCases))

print(model_ar_fit.summary())
plt.figure(figsize=(10,10))
# plot residual errors
residuals = pd.DataFrame(model_ar_fit.resid())
model_ar_fit.plot_diagnostics()
plt.show()
residuals.plot(kind='kde')
plt.show()
Predicted vs Actual AR Auto-ARIMA
Now let's look at the Moving Average model by changing start_p (AutoRegressive Moving part - set to 0), start_q, max_p, max_q (set to 5). Here's the code snippet for it.
model_ma= auto_arima(trainActiveCases,trace=True, error_action='ignore', start_p=0,start_q=0,max_p=0,max_q=5,
                   suppress_warnings=True,stepwise=False,seasonal=False)
model_ma_fit = model_ma.fit(trainActiveCases)
prediction_ma=model_ma_fit.predict(len(validActiveCases))

print(model_ma_fit.summary())
plt.figure(figsize=(10,10))
# plot residual errors
residuals = pd.DataFrame(model_ma_fit.resid())
model_ma_fit.plot_diagnostics()
plt.show()
residuals.plot(kind='kde')
plt.show()
The hyper-parameter tuning for evaluating the lowest and best AIC with the residual components are given below.
Summary of MA with Auto-ARIMA
Predicted vs Actual MA Auto-ARIMA
The Auto ARIMA is then used to model ARIMA by using the following parameters ------------------(start_p=1,start_q=1,max_p=5,max_q=5) as shown in the below code snippet.
model_arima= auto_arima(trainActiveCases,trace=True, error_action='ignore', start_p=1,start_q=1,max_p=5,max_q=5,
                   suppress_warnings=True,stepwise=False,seasonal=False)
model_arima_fit = model_arima.fit(trainActiveCases)
prediction_arima=model_arima_fit.predict(len(validActiveCases))

print(model_arima_fit.summary())
plt.figure(figsize=(10,10))
# plot residual errors
residuals = pd.DataFrame(model_arima_fit.resid())
model_arima_fit.plot_diagnostics()
plt.show()
residuals.plot(kind='kde')
plt.show()
Summary of ARIMA with Auto-ARIMA
Predicted vs Actual Auto-ARIMA

AUTO SARIMA MODEL

The SARIMA model is tuned with hyper-parameters to find the best model with the lowest AIC. The code snippet is given below, which does parameter tuning for (p, q, d)x(P, Q, D)

Here (P, Q, D) and (m) are tuned with values (1,0,1), (1,0,0), (0,0,1) and 12 respectively.
The parameters are initialized as , start_p=0,start_q=2,max_p=5,max_q=5,m=12. start_q even set to 0, takes default value of 2.
model_sarima= auto_arima(trainActiveCases,trace=True, error_action='ignore', 
                         start_p=0,start_q=2,max_p=5,max_q=5,m=12,
                   suppress_warnings=True,stepwise=True,seasonal=True)
model_sarima_fit = model_sarima.fit(trainActiveCases)
prediction_sarima=model_sarima_fit.predict(len(validActiveCases))

print(model_sarima_fit.summary())
plt.figure(figsize=(10,10))
# plot residual errors
residuals = pd.DataFrame(model_sarima_fit.resid())
model_sarima_fit.plot_diagnostics()
plt.show()
residuals.plot(kind='kde')
plt.show()

Summary of SARIMA with Auto-SARIMA

Predicted vs Actual Auto-SARIMA

Seasonal Differencing SARIMA

We also try to explore seasonality with Covid-19 dataset by applying regular differencing, where the values are differenced from the previous season.
In the equation of SARIMA(p,d,q)x(P,D,Q):
P denotes Seasonal AR, D denote Seasonal Order of seasonal differencing and Q denotes Seasonal Moving Average, x is the frequency of the time series.
If a model has defined seasonal patterns, we set D=1 for a given frequency ‘x’. For a SARIMA model, we keep the total differencing + D’ <=2

SELECTION OF BEST MODEL with AUTO-ARIMA

The following figure illustrates how to select the best Auto ARIMA Model, compute its error metrics for any input time-series data.

Comparative Study of RMSE

Conclusion and Future Work
  • More work needs to be done by studying the impact of the number of people quarantined, hospital beds, icu units, ventilators, rate of tests done by using Vector Autoregressive Moving-Average with eXogenous regressors (VARMAX).
  • Evaluate the model performance by sm.tsa.statespace.ExponentialSmoothing (multiplicative models) and sm.tsa.ExponentialSmoothing (linear model), i.e. the two different modules of StatsModel.
  • A detailed study of using different smoothing techniques with latency and accuracy to analyze the tradeoff between the two.
  • Evaluate confidence intervals for forecasts, based on an assumption of Gaussian errors as provided by sm.tsa.statespace.ExponentialSmoothingAnalyzing the performance impact by concentrating initial values out of the objective function, particularly in situations when the seasonal periodicity is large.
  • With more number of COVID-19 cases evolving, the right order of differencing (if needed multiple orders of differencing) needs to be computed to arrive at a near-stationary series (ACF plot nearing zero).
  • Here we have only computed RMSE, but an ideal situation all error metrics Mean Absolute Percentage Error (MAPE), Mean Error (ME), Mean Absolute Error (MAE), Mean Percentage Error (MPE), Root Mean Squared Error (RMSE), Lag 1 Autocorrelation of Error (ACF1), Correlation between the Actual and the Forecast (core), Min-Max Error (min-max) needs to be computed.

The complete source code is available at https://github.com/sharmi1206/covid-19-analysis

References


Written by sharmi1206 | https://www.linkedin.com/in/sharmistha-chatterjee-7a186310/
Published by HackerNoon on 2020/07/15