Solving Time Series Forecasting Problems: Principles and Techniques

Time series analysis plays a critical role in the decision-making process for an array of industries and domains. Whether it's forecasting stock prices, predicting customer behavior, or analyzing sensor data, understanding and harnessing the patterns and trends within time-dependent data can provide invaluable insights.

However, time series analysis comes with its fair share of challenges, including handling seasonality, dealing with missing values, and selecting appropriate forecasting models. This article will outline some effective methods for addressing these challenges, offering practical strategies and techniques that can enhance the accuracy and reliability of time series analysis.

Cross-Validation for Time Series
Decomposition and Transformations of Time Series
Feature Engineering for Time Series
Modelling in Time Series

Cross-Validation for Time Series

When performing cross-validation on time series data, we must respect the temporal order of observations. This means that classical methods like K-Fold cross-validation are not appropriate, as they randomly shuffle the data, destroying the temporal dependencies.

There are two main methods for performing cross-validation on time series data: the "Rolling Forecast Origin" and the "Expanding Window" methods.

Rolling Forecast Origin

In this method, we use a sliding window of data for training, and a test set immediately following the training window. After each iteration, we "roll" the window forward in time. This is also known as "walk-forward" validation.

In the next example, the data is split into 5 folds. In each iteration, the model is trained on the first fold, then the first and second folds, and so on, testing on the next fold each time.

from sklearn.model_selection import TimeSeriesSplit
import numpy as np

# Assume 'X' and 'y' are your features and target

tscv = TimeSeriesSplit(n_splits=5)

for train_index, test_index in tscv.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    # Fit and evaluate your model here

Expanding Window

The expanding window method is similar to the rolling forecast origin method, but instead of using a fixed-size sliding window, the training set expands over time to include more data.

In this example, we set max_train_size=None to ensure the training set size expands in each iteration. The model is first trained on the first fold, then the first and second folds, and so on, testing on the next fold each time.

from sklearn.model_selection import TimeSeriesSplit
import numpy as np

# Assume 'X' and 'y' are your features and target

tscv = TimeSeriesSplit(n_splits=5, max_train_size=None)

for train_index, test_index in tscv.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    # Fit and evaluate your model here

Decomposition and Transformation of Time Series

The transformation of time series into constituent parts can significantly facilitate analysis. This process falls into two categories: decomposing and transforming.

Decomposition of Time Series

Time series decomposition is a technique used to disassemble a time series into its constituent components to facilitate the understanding and analysis of underlying patterns and trends.

The main components of a time series are:

Trend: This denotes the long-term movement or overall direction of the time series data. It represents the underlying pattern or the general course the data is following.
Seasonality: These are repeating patterns or fluctuations in the time series data that occur at regular intervals, such as daily, weekly, monthly, or yearly. Seasonality is driven by external factors like weather, holidays, or other seasonal events.
Irregular Component: Also known as noise or residual, this is the unexplained variation in the time series data after removing the trend, seasonality, and cyclical components. It represents random, unpredictable fluctuations that cannot be attributed to any systematic pattern.

In Python, Decomposition can be performed using the statsmodels or statsforecast libraries:

from statsforecast import StatsForecast
from statsforecast.models import MSTL, AutoARIMA

models = [
    MSTL(
        season_length=[12 * 7],  # seasonalities of the time series
        trend_forecaster=AutoARIMA(),  # model used to forecast trend
    )
]

sf = StatsForecast(
    models=models,  # model used to fit each time series
    freq="D",  # frequency of the data
)
sf = sf.fit(data)
test = sf.fitted_[0, 0].model_
fig, ax = plt.subplots(1, 1, figsize=(10, 8))
test.plot(ax=ax, subplots=True, grid=True)
plt.tight_layout()
plt.show()

Various models handle these components differently. Linear models and Prophet have built-in handling for all of these components. However, other models, such as Boosting Trees and Neural Networks, require explicit data transformation and feature engineering.

One general approach after decomposition is to model each component separately. Once the individual forecasts for each component are obtained, they can be combined to form the final prediction. This approach allows for more granular analysis of the time series data and can often lead to improved forecasting results.

Transformation of Time Series

Transformations are often applied to time series data to stabilize the variance, reduce the impact of outliers, improve the overall forecasting accuracy, and make the series more conducive to modeling.

Here are a few common types of transformations:

Differencing: Differencing involves subtracting the previous value (t-1) from the current value (t). This transformation is primarily used to remove the trend and seasonality, making the data stationary—a requirement for certain time series models like ARIMA.

In Python, differencing can be performed using the diff() function in pandas:

import pandas as pd

df['diff'] = df['col'].diff()

Log Transformation: Log transformation is useful when dealing with data that exhibits exponential growth or has multiplicative seasonality. This transformation can help to stabilize variance and reduce heteroscedasticity. In Python, the numpy library can be used to apply a log transformation:

import numpy as np

df['log_col'] = np.log(df['col'])

Square Root Transformation: Square root transformation is used to stabilize variance and minimize the impact of extreme values or outliers. This transformation can be done using numpy:

df['sqrt_col'] = np.sqrt(df['col'])

Box-Cox Transformation: This transformation includes both log and square root transformations as special cases and helps stabilize the variance and make the data more normally distributed. The Box-Cox transformation can be represented as

The SciPy library in Python provides a function for the Box-Cox transformation:

from scipy import stats

df['boxcox_col'], lambda_param = stats.boxcox(df['col'])

After applying these transformations, it's crucial to remember that any forecasts made with the transformed data will need to be retransformed to the original scale before interpretation. For instance, if a log transformation was applied, the forecasted values should be exponentiated.

The choice of transformation depends on the specific problem, the characteristics of the data, and the assumptions of the subsequent modeling process. Each transformation has its strengths and weaknesses, and there is no one-size-fits-all solution. As such, it's recommended to experiment with different transformations and choose the one that results in the best model performance.

Feature Engineering for Time Series

Feature engineering in time series refers to the process of creating meaningful features or predictors from raw time series data to improve the performance of machine learning models. In contrast to tabular data problems, time series allows an additional dimension for moving around, that is, time.

Some effective feature engineering techniques for time series include:

Lag Features: These capture past behavior or dependencies and are extremely useful in solutions when created by shifting the time series by a specific number of time steps (lags).
Rolling and Expanding Window Statistics: These involve computing summary statistics (e.g., mean, median, standard deviation) over a moving or expanding the window of time to capture local patterns and trends.
Seasonal and Trend Features: These help describe the behavior of each component for the model.
Date/Time-Based Features: Details like the hour of the day, day of the week, month, or quarter can help capture recurring patterns in the data.
Fourier Transformations: These can be useful to uncover periodic patterns in the data.

Use of Derivatives in Time Series

Derivatives in time series refer to the rate of change of a time series variable with respect to time. They measure how a variable's value changes over time and can be used as features in time series analysis or machine learning models as they provide information about the underlying trends, seasonality, and other patterns in the data.

Here are some ways derivatives can be used as features in time series analysis:

First-Order Derivative: This represents the instantaneous rate of change of a variable with respect to time. It can help identify the direction and magnitude of the trend in the data.
Second-Order Derivative: This represents the rate of change of the first-order derivative. It can help identify acceleration or deceleration in the trend and detect points of inflection.
Seasonal Derivatives: These help capture seasonal variations in the data.
Cross-Derivative: This helps to capture the influence of pairwise variables.

All the other feature engineering approaches with code examples and explanations can be found in previous articles:

https://hackernoon.com/must-know-base-tips-for-feature-engineering-with-time-series-data?embedable=true

https://hackernoon.com/advanced-techniques-for-time-series-data-feature-engineering?embedable=true

Modeling in Time Series

Using baseline models in time series forecasting is a crucial step in the model evaluation as it helps establish a performance benchmark for more complex models. Exponential smoothing, ARIMA, and Prophet are all still excellent choices for baseline models.

Exponential Smoothing

In this example, we'll use the Holt-Winters method of exponential smoothing, which can handle trends and seasonality. The statsmodels library provides an implementation of this method.

from statsmodels.tsa.api import ExponentialSmoothing

# Assume 'df' is a pandas DataFrame and 'col' is the time series to be forecasted
y = df['col']

# Fit the model
model = ExponentialSmoothing(y, seasonal_periods=12, trend='add', seasonal='add')
model_fit = model.fit()

# Forecast the next 5 periods
forecast = model_fit.forecast(5)

ARIMA

AutoRegressive Integrated Moving Average (ARIMA) is a popular method for forecasting univariate time series data. The statsmodels library provides an implementation of ARIMA.

from statsmodels.tsa.arima.model import ARIMA

# Fit the model
model = ARIMA(y, order=(1,1,1))  # (p, d, q) parameters
model_fit = model.fit()

# Forecast the next 5 periods
forecast = model_fit.forecast(steps=5)

The (p, d, q) parameters represent the order of the autoregressive, integrated, and moving average parts of the model, respectively.

Prophet

Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects. It is developed by Facebook.

from fbprophet import Prophet

# The input to Prophet is always a DataFrame with two columns: ds and y.
df = df.rename(columns={'date': 'ds', 'col': 'y'})

# Fit the model
model = Prophet()
model.fit(df)

# Create DataFrame for future predictions
future = model.make_future_dataframe(periods=5)

# Forecast the next 5 periods
forecast = model.predict(future)

In these examples, the model parameters (such as the order of the ARIMA model and the type of trend and seasonality in the Exponential Smoothing model) are chosen somewhat arbitrarily. In practice, you would use model diagnostics, cross-validation, and possibly grid search to find the optimal parameters for your specific data.

On the other hand, there are new libraries with various built-in models like Nixlab that offer different perspectives. However, as baseline models, these methods provide a good starting point for time series forecasting and cover a wide range of time series characteristics.

Advanced models like boosting trees, Neural Networks (NNs), N-BEATS, and Temporal Fusion Transformers (TFT) can be employed after getting some baseline scores.

Gradient Boosting

When it comes to time series analysis, one of the primary advantages of gradient boosting is that it does not make any assumptions about the underlying data distribution or the relationships between variables. This makes it highly flexible and capable of modeling complex, non-linear relationships, which are often present in time series data.

While the application of gradient boosting to time series forecasting is not as straightforward as traditional time series models like ARIMA, it can still be highly effective. The key is to frame the forecasting problem as a supervised learning problem by creating lagged features that capture the temporal dependencies in the data.

Here's an example of how to use the XGBoost library, a highly efficient implementation of gradient boosting, for time series forecasting:

import xgboost as xgb
from sklearn.metrics import mean_squared_error

# Assume 'X_train', 'y_train', 'X_test', and 'y_test' are your training and test datasets

# Convert datasets into DMatrix (optimized data structure for XGBoost)
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test)

# Define the model parameters
params = {
    'max_depth': 3,
    'eta': 0.01,
    'objective': 'reg:squarederror',
    'eval_metric': 'rmse'
}

# Train the model
model = xgb.train(params, dtrain, num_boost_round=1000)

# Make predictions
y_pred = model.predict(dtest)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
print(f'Test RMSE: {np.sqrt(mse)}')

Recurrent Neural Networks (RNNs)

We'll use the LSTM (Long Short-Term Memory) variant of RNNs, which can capture long-term dependencies and is less prone to the vanishing gradient problem.

import torch
from torch import nn

# Assume 'X_train' and 'y_train' are your training dataset
# 'n_features' is the number of input features
# 'n_steps' is the number of time steps in each sample

X_train = torch.tensor(X_train, dtype=torch.float32)
y_train = torch.tensor(y_train, dtype=torch.float32)

class RNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(RNN, self).__init__()
        self.hidden_size = hidden_size
        self.rnn = nn.RNN(input_size, hidden_size, batch_first=True)
        self.fc = nn.Linear(hidden_size, output_size)
        
    def forward(self, x):
        x, _ = self.rnn(x)
        x = self.fc(x[:, -1, :])  # we only want the last time step
        return x

# Define the model
model = RNN(n_features, 50, 1)

# Define loss and optimizer
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

# Training loop
for epoch in range(200):
    model.train()
    output = model(X_train)
    loss = criterion(output, y_train)
    
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

# Make a prediction
model.eval()
x_input = ...  # some new input data
x_input = torch.tensor(x_input, dtype=torch.float32)
x_input = x_input.view((1, n_steps, n_features))
yhat = model(x_input)

N-BEATS

N-BEATS is a state-of-the-art model for univariate time series forecasting, available in the PyTorch-based pytorch-forecasting package.

from pytorch_forecasting import TimeSeriesDataSet, NBeats

# Assume 'data' is a pandas DataFrame containing your time series data
# 'group_ids' is the column(s) that identifies each time series
# 'target' is the column you want to forecast

# Create the dataset
max_encoder_length = 36
max_prediction_length = 6
training_cutoff = data["time_idx"].max() - max_prediction_length
context_length = max_encoder_length
prediction_length = max_prediction_length

training = TimeSeriesDataSet(
    data[lambda x: x.time_idx <= training_cutoff],
    time_idx="time_idx",
    target="target",
    group_ids=["group_ids"],
    max_encoder_length=context_length,
    max_prediction_length=prediction_length,
)

# Create the dataloaders
train_dataloader = training.to_dataloader(train=True, batch_size=128, num_workers=0)

# Define the model
pl.seed_everything(42)
trainer = pl.Trainer(gpus=0, gradient_clip_val=0.1)
net = NBeats.from_dataset(training, learning_rate=3e-2, weight_decay=1e-2, widths=[32, 512], backcast_loss_ratio=1.0)

# Fit the model
trainer.fit(
    net,
    train_dataloader=train_dataloader,
)

# Predict
raw_predictions, x = net.predict(val_dataloader, return_x=True)

Temporal Fusion Transformers (TFT)

TFT is also available in the pytorch-forecasting package. It's a powerful model that can handle multivariate time series and meta-data.

# Import libraries
import torch
import pandas as pd
from pytorch_forecasting import TimeSeriesDataSet, TemporalFusionTransformer, Baseline, Trainer
from pytorch_forecasting.metrics import SMAPE
from pytorch_forecasting.data.examples import get_stallion_data

# Load example data
data = get_stallion_data()
data["time_idx"] = data["date"].dt.year * 12 + data["date"].dt.month
data["time_idx"] -= data["time_idx"].min()

# Define dataset
max_prediction_length = 6
max_encoder_length = 24
training_cutoff = data["time_idx"].max() - max_prediction_length

context_length = max_encoder_length
prediction_length = max_prediction_length

training = TimeSeriesDataSet(
    data[lambda x: x.time_idx <= training_cutoff],
    time_idx="time_idx",
    target="volume",
    group_ids=["agency"],
    min_encoder_length=context_length,
    max_encoder_length=context_length,
    min_prediction_length=prediction_length,
    max_prediction_length=prediction_length,
    static_categoricals=["agency"],
    static_reals=["avg_population_2017"],
    time_varying_known_categoricals=["month"],
    time_varying_known_reals=["time_idx", "minimum", "mean", "maximum"],
    time_varying_unknown_categoricals=[],
    time_varying_unknown_reals=["volume"],
)

validation = TimeSeriesDataSet.from_dataset(training, data, min_prediction_idx=training_cutoff + 1)

# Create dataloaders
batch_size = 16
train_dataloader = training.to_dataloader(train=True, batch_size=batch_size, num_workers=0)
val_dataloader = validation.to_dataloader(train=False, batch_size=batch_size * 10, num_workers=0)

# Define model and trainer
pl.seed_everything(42)
trainer = Trainer(
    gpus=0,
    # clipping gradients is a hyperparameter and important to prevent divergance
    # of the gradient for recurrent neural networks
    gradient_clip_val=0.1,
)

tft = TemporalFusionTransformer.from_dataset(
    training,
    learning_rate=0.03,
    hidden_size=16,
    lstm_layers=1,
    dropout=0.1,
    hidden_continuous_size=8,
    output_size=7,  # 7 quantiles by default
    loss=SMAPE(),
    log_interval=10,
    reduce_on_plateau_patience=4,
)

# Fit the model
trainer.fit(
    tft,
    train_dataloader=train_dataloader,
    val_dataloaders=val_dataloader,
)

# Evaluate the model
raw_predictions, x = tft.predict(val_dataloader, mode="raw", return_x=True)

Most of the tasks will be covered by Boosting trees due to their scalability and solid performance. However, if time and resources are not a constraint, Recurrent Neural Networks (RNNs), N-BEATS, and TFT have shown strong performance in time series tasks, despite their high computational cost and specific scalability.

Conclusion

In conclusion, time series analysis can benefit from various strategies. Starting with the transformation of data and tasks, to extract more information about the data's nature, followed by introducing time into your features and outcomes. Then, modeling can start with common baselines, continue with, and perhaps finalize with boosting, while not rushing with neural networks and transformers.

Lastly, the use of Dickey-Fuller and Kwiatkowski-Phillips-Schmidt-Shin tests together can ensure the robustness of the results.

import numpy as np
import pandas as pd
from statsmodels.tsa.stattools import adfuller, kpss

# Assume 'data' is your time series data

# Dickey-Fuller test
result = adfuller(data)
print('ADF Statistic: %f' % result[0])
print('p-value: %f' % result[1])
print('Critical Values:')
for key, value in result[4].items():
    print('\t%s: %.3f' % (key, value))

# KPSS test
result = kpss(data, regression='c')
print('\nKPSS Statistic: %f' % result[0])
print('p-value: %f' % result[1])
print('Critical Values:')
for key, value in result[3].items():
    print('\t%s: %.3f' % (key, value))

Also published here.

Solving Time Series Forecasting Problems: Principles and Techniques

Table of contents

Cross-Validation for Time Series

Rolling Forecast Origin

Expanding Window

Decomposition and Transformation of Time Series

Decomposition of Time Series

Transformation of Time Series

Feature Engineering for Time Series

Use of Derivatives in Time Series

Modeling in Time Series

Exponential Smoothing

ARIMA

Prophet

Gradient Boosting

Recurrent Neural Networks (RNNs)

N-BEATS

Temporal Fusion Transformers (TFT)

Conclusion