Unveiling the Hidden Gems: A Journey into Exploratory Data Analysis (EDA)

Master the Art of Extracting Insights and Uncovering Patterns in Your Data to Make Informed Decisions

Introduction

Exploratory Data Analysis (EDA) is an essential step in data analysis, permitting developers to comprehend the structure and relationships of the data they are working with. EDA analyzes and summarizes datasets to discover anomalies, patterns, or correlations. Also, it assists in creating hypotheses about the underlying processes that produce the data. A well-executed EDA process can significantly impact the project’s overall success. This is because EDA provides data insights that can boost the data analysis and modeling processes. I will examine strategies and recommended practices for performing EDA in this article. Data cleaning, visualization, descriptive statistics, and hypothesis testing will all be discussed.

Data Cleaning

Data cleaning is detecting and removing errors, inconsistencies, and anomalies in data that could compromise the accuracy of future analysis and modeling.

Missing Values

One of the most common issues with real-world data is missing values. To deal with it, firstly, determine why they are missing. If the missing data are random, impute them using methods such as mean imputation, median imputation, or regression imputation. The imputation results can be biased and misleading if the missing data are not missing at random. In the following example, I will use the IterativeImputer class from the scikit-learn library to impute missing values in a dataset:

import numpy as np
import pandas as pd
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

# Generate data with missing values
np.random.seed(0)
X_true = np.arange(10).reshape(-1, 1) + np.random.normal(scale=0.1, size=(10, 1))
X = X_true.copy()
mask = np.random.choice([True, False], size=X.shape, p=[0.3, 0.7])
X[mask] = np.nan

# Impute missing values
imp = IterativeImputer(max_iter=10, random_state=0)
X_imp = imp.fit_transform(X)

# Compare imputed values with true values
df = pd.DataFrame({'True': X_true.flatten(), 'Imputed': X_imp.flatten()})
print(df)

In this example, I create data with missing values by randomly assigning 30% of the values to np.nan. I then use the IterativeImputer to impute the missing values in each iteration, using the mean and variance of the remaining data. The imputed values are subsequently compared to the true ones to confirm the correctness of the imputation process.

In this example, the values of a highly correlated variable can be used to fill in the missing values:

import numpy as np
import pandas as pd
import seaborn as sns

# Generate data with missing values
np.random.seed(0)
data = np.arange(20).reshape(-1, 2) + np.random.normal(scale=0.1, size=(10, 2))
data[1, 1] = np.nan
data[2, 0] = np.nan

# Convert data to pandas dataframe
df = pd.DataFrame(data, columns=["col1", "col2"])

# Visualize correlation between columns
sns.heatmap(df.corr(), annot=True)
plt.show()

# Fill missing values in col1 with values from col2
df["col1"].fillna(df["col2"], inplace=True)

# Fill remaining missing values in col2 with values from col1
df["col2"].fillna(df["col1"], inplace=True)

I initially got the correlation between the columns with a heatmap from the Seaborn library. The heatmap shows a strong association between the columns, with a correlation coefficient close to one. It helps me to fill in the missing values in col1 using values from col2, and the remaining ones in col2 using values from col1.

Outliers

Outliers are extreme values in data that can significantly impact statistical analysis and modeling results. Therefore, detecting and addressing outliers is necessary to obtain accurate data outputs. Here, I will demonstrate how to use the Z-score method and visualization tools to identify and remove outliers from the data:

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Generate data with outliers
def generate_outliers(n, mu, sigma):
    x = np.random.normal(mu, sigma, n)
    x[:5] = x[:5] - 3 * sigma
    x[-5:] = x[-5:] + 3 * sigma
    return x

np.random.seed(0)
X = generate_outliers(100, 1, 3)

# Detect outliers using Z-score method
mean = np.mean(X)
std = np.std(X)
z_scores = (X - mean) / std
outliers = np.where(np.abs(z_scores) > 3)
# Remove outliers from data
X_clean = np.delete(X, outliers, axis=0)

# Visualize data with outliers and without outliers
sns.boxplot(x=X.flatten())
plt.show()
sns.boxplot(x=X_clean.flatten())
plt.show()

I have data with two outliers, values much higher or lower than the rest. The Z-score method helps to identify outliers by calculating Z-scores for each value in the dataset and marking with Z-scores greater than 3 as outliers. I then use the np.delete method to remove any outliers from the dataset. Finally, I can compare the dataset with and without outliers using box plots. The box plot clearly demonstrates that the dataset’s outliers were removed.

Data visualization

Data visualization aids in understanding the underlying data structure. There are many libraries for visualization, like Matplotlib, Seaborn, Plotly, and Bokeh. I will focus on the first three as the most popular.

Matplotlib

Matplotlib is a Python data visualization library that includes a variety of plots, such as line plots, scatter plots, bar plots, histograms, and more. The following is an example of a line plot:

import numpy as np
import matplotlib.pyplot as plt

np.random.seed(0)
data = np.random.normal(size=100)

plt.plot(data)
plt.title("Line Plot using Matplotlib")
plt.xlabel("Index")
plt.ylabel("Value")
plt.show()

Seaborn

Seaborn is a Python data visualization toolkit built on Matplotlib that provides a high-level interface for constructing various plots. The following is an example of the scatter plot:

import numpy as np
import seaborn as sns

np.random.seed(0)
data1 = np.random.normal(loc=0, scale=1, size=100)
data2 = np.random.normal(loc=2, scale=1, size=100)

sns.scatterplot(data1, data2)
plt.title("Scatter Plot using Seaborn")
plt.xlabel("Data 1")
plt.ylabel("Data 2")
plt.show()

Plotly

Plotly is a powerful Python data visualization package that generates interactive and visually appealing graphs. So, I will make numerous plots, including a scatter plot, line plot, bar plot, histogram, and box plot.

import numpy as np
import plotly.express as px
import plotly.graph_objs as go

np.random.seed(0)
data1 = np.random.normal(loc=0, scale=1, size=100)
data2 = np.random.normal(loc=2, scale=1, size=100)
data3 = np.random.randint(low=1, high=10, size=100)

# Scatter Plot
fig = px.scatter(x=data1, y=data2)
fig.update_layout(title="Scatter Plot using Plotly")
fig.show()

# Line Plot
fig = go.Figure()
fig.add_trace(go.Scatter(x=np.arange(100), y=data1, name="Data 1"))
fig.add_trace(go.Scatter(x=np.arange(100), y=data2, name="Data 2"))
fig.update_layout(title="Line Plot using Plotly")
fig.show()

# Bar Plot
fig = px.bar(x=np.arange(100), y=data3)
fig.update_layout(title="Bar Plot using Plotly")
fig.show()

# Histogram
fig = px.histogram(data1)
fig.update_layout(title="Histogram using Plotly")
fig.show()

# Box Plot
fig = px.box(data2)
fig.update_layout(title="Box Plot using Plotly")
fig.show()

I first produce three sets of random data and then create several plots using the px.scatter, go.Figure, px.bar, px.histogram, and px.box functions. The interactivity of the plots, which allows zooming, panning, and hovering over points to display extra information, is one advantage of utilizing Plotly. Plotly also supports exporting plots in various formats, including HTML, SVG, and PNG, making sharing and presenting the visualizations simple.

Let’s create a 3D plot for gradient descent to show the power of visualization.

Gradient descent is an optimization algorithm commonly used in machine learning to minimize a loss function by updating the parameters in the direction of steepest decrease. It is a first-order optimization method that iteratively updates the parameters in the direction of the negative gradient of the loss function with respect to those parameters.

import numpy as np
import plotly.graph_objects as go

# Define the quadratic function and its gradient
def quadratic(x, y):
    return x**2 + y**2

def grad_quadratic(x, y):
    grad_x = 2 * x
    grad_y = 2 * y
    return grad_x, grad_y

# Visualize the surface of the quadratic function
x = np.linspace(-2, 2, 30)
y = np.linspace(-2, 2, 30)

X, Y = np.meshgrid(x, y)
Z = quadratic(X, Y)

fig = go.Figure(data=[go.Surface(z=Z, x=X, y=Y)])
fig.show()

# Optimize the quadratic function using gradient descent
def gradient_descent(x0, y0, n_iters, lr):
    path = np.zeros((n_iters+1, 2))
    path[0, :] = [x0, y0]
    
    for i in range(n_iters):
        grad_x, grad_y = grad_quadratic(path[i, 0], path[i, 1])
        path[i+1, 0] = path[i, 0] - lr * grad_x
        path[i+1, 1] = path[i, 1] - lr * grad_y
        
    return path

path = gradient_descent(x0=-1.5, y0=-1.5, n_iters=100, lr=0.01)

# Visualize the optimization process in 3D
fig = go.Figure(
    data=[go.Scatter3d(
       x=path[:, 0], y=path[:, 1], z=quadratic(path[:, 0], path[:, 1]),
       mode='markers',
       marker=dict(
           size=3, color=quadratic(path[:, 0], path[:, 1]),
           colorscale='Viridis', opacity=0.8
       )
    )]
)

fig.update_layout(
    scene=dict(xaxis_title='X', yaxis_title='Y', zaxis_title='Z'))
fig.show()

The code sample implements gradient descent optimization with 3D visualization using the plotly.graph_objects library. The code begins by defining a quadratic function and its gradient, grad_quadratic, employed in the optimization process. The plotly.graph_objects library is then used to display the quadratic function’s surface, where x and y values are defined using np.linspace, and z values are computed by evaluating the quadratic function over a grid of x and y values. A go.Surface instance is created with the x, y, and z values and added to a go.Figure instance. The show method is used to display the Figure instance.

Next, the gradient descent optimization is performed using the gradient_descent function, which inputs the starting point x0 and y0, the number of iterations n_iters, and the learning rate lr. The optimization process is stored in the path variable, which contains the sequence of points visited during optimization.

Finally, the optimization process is visualized in 3D using the plotly.graph_objects library. A go.Scatter3d instance is created with the x, y, and z values, which are the optimized path and the corresponding function values. The size and color of the markers are also defined in the marker parameter. The go.Scatter3d instance is then added to a go.Figure instance and the scene’s layout is updated with appropriate axis labels. The final visualization is displayed using the show method of the go.Figure instance.

This code provides a clear visual representation of the optimization process of gradient descent in 3D and can be a valuable tool for understanding and debugging optimization algorithms.

Descriptive statistics

EDA relies heavily on descriptive statistics. They summarize a dataset’s main characteristics, such as central tendency, dispersion, and shape. I will review the fundamentals of descriptive statistics using examples from the top Python libraries. Pandas is one of the most used Python packages for descriptive statistics. It has a lot of methods for quickly calculating descriptive statistics on a dataset, such as mean, median, mode, standard deviation, and others. Here’s an example of mean and standard deviation:

import pandas as pd

data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
df = pd.DataFrame(data, columns=['Values'])

mean = df['Values'].mean()
std = df['Values'].std()

print('Mean:', mean)
print('Standard Deviation:', std)

Another popular descriptive statistics library is SciPy, which includes functions for calculating various statistical tests and measurements. The SciPy, for example, includes functions for determining the skewness and kurtosis of a dataset, which provide information on the distribution’s shape:

from scipy.stats import skew, kurtosis

skewness = skew(data)
kurt = kurtosis(data)

print('Skewness:', skewness)
print('Kurtosis:', kurt)

However, more advanced statistical approaches are often required, and a lesser-known library might be helpful in these circumstances. Statsmodels is one such package that offers a complete set of functions for statistical modeling, hypothesis testing, and data exploration. Here’s an example of how to use Statsmodels to compute a dataset’s distribution:

import statsmodels.api as sm

results = sm.ProbPlot(data).qqplot()
plt.show()

The Q-Q plot demonstrates that the data distribution deviates from a normal distribution, which is to be expected given that the dataset is a blend of two normal distributions.

Hypothesis testing

Hypothesis testing is the final step in descriptive statistics analysis. It is a statistical strategy for determining whether an observed result is random or statistically significant. It is an essential component of EDA and is used to conclude population parameters based on sample data.

A null hypothesis and an alternate hypothesis are established during hypothesis testing. The null hypothesis is the default assumption that no difference exists between the population parameters under consideration. The alternative hypothesis, which argues that there is a difference between the population parameters, is the inverse of the null hypothesis. A test statistic is calculated based on the sample data to test the hypothesis. A p-value is derived, representing the likelihood of observing a result that is as severe or more extreme than the one observed if the null hypothesis is true. If the p-value is less than a certain threshold, such as 0.05 (this constant is commonly used), the null hypothesis is rejected, and the alternative hypothesis is accepted.

Python provides a wide range of hypothesis testing tools, including but not limited to:

scipy.stats: This module contains a variety of statistical functions, such as hypothesis testing functions for t-tests, ANOVA, and others.
statsmodels: This library includes a set of hypothesis testing routines, such as regression and time series analysis.
PyMC3: This probabilistic programming framework supports Bayesian hypothesis testing and Markov Chain Monte Carlo (MCMC) simulation.

Let’s test the hypothesis that the mean cholesterol level of patients who received a new medicine is the same as that of patients who did not receive the treatment. To test the hypothesis, I will use a two-sample t-test. The t-statistic calculates the mean difference between two samples in standard error units. A large t-statistic implies a significant difference in means, whereas a small t-statistic suggests a minor difference.

import pandas as pd
import numpy as np
import scipy.stats as stats
import statsmodels.stats.weightstats as ssw

# Generating the data
np.random.seed(42)
drug_treated = np.random.normal(loc=195, scale=25, size=100)
no_drug = np.random.normal(loc=190, scale=30, size=100)

# Performing two-sample t-test
t_statistic, p_value = stats.ttest_ind(drug_treated, no_drug)

# Checking the results
if p_value < 0.05:
    print("Reject null hypothesis. Mean cholesterol levels are not equal.")
else:
    print("Fail to reject null hypothesis. Mean cholesterol levels are equal.")

If the null hypothesis (the means are equal) is true, the p-value denotes the likelihood of observing a t-statistic as extreme or more extreme than the one calculated. A low p-value implies that the observed difference between means is statistically significant, and the null hypothesis is rejected. So, if the p-value is less than 0.05, the null hypothesis is rejected, and I can conclude that the mean cholesterol levels of patients treated with the new medicine are not equal to those of patients who did not get the treatment. If the p-value exceeds 0.05, the null hypothesis is not rejected, meaning the mean cholesterol levels are identical.

Example

Let’s try to do EDA on a healthcare dataset in this example. The dataset will include patient information such as age, height, weight, and blood pressure readings. To do EDA, I will employ all four steps — data generation, data cleaning, descriptive statistics, and visualization.

Let’s start by generating our data:

import numpy as np

np.random.seed(0)
age = np.random.normal(loc=30, scale=10, size=1000)
height = np.random.normal(loc=180, scale=15, size=1000)
weight = np.random.normal(loc=80, scale=20, size=1000)
blood_pressure = np.random.normal(loc=120, scale=10, size=1000)

After I create data, I will clean it up by looking for missing numbers or outliers. Data doesn’t have any missing numbers in this case, so it is necessary to go on looking for outliers:

from scipy import stats

z_scores_age = stats.zscore(age)
z_scores_height = stats.zscore(height)
z_scores_weight = stats.zscore(weight)
z_scores_blood_pressure = stats.zscore(blood_pressure)

Following that, I will use the Pandas package to generate a DataFrame from the data and compute some descriptive statistics:

import pandas as pd

df = pd.DataFrame(
   {
      'Age': age,
      'Height': height,
      'Weight': weight,
      'Blood Pressure': blood_pressure
   }
)

print(df.describe())

For each column in the DataFrame, the describe function will provide the following summary statistics: count, mean, standard deviation, minimum, 25th percentile, median, 75th percentile, and maximum.

Next, I will use SciPy to run a one-sample t-test to see if patients’ mean age differs substantially from 30:

t_statistic, p_value = stats.ttest_1samp(df['Age'], 30)

print('t-statistic:', t_statistic)
print('p-value:', p_value)

So, I can reject the null hypothesis and declare that the mean age of our patients is significantly different from 30 if the p-value is less than 0.05.

Finally, I will use the statsmodels to perform linear regression on data:

import statsmodels.api as sm

X = df[['Age', 'Height', 'Weight']]
y = df['Blood Pressure']

model = sm.OLS(y, X).fit()
predictions = model.predict(X)

print(model.summary())

The code above imports and uses the statsmodels.api module to fit an ordinary least squares (OLS) regression model. OLS regression is a linear regression that is used to model the connection between one or more independent variables and a dependent variable. In this example, the dependent variable is blood pressure, and the independent variables are age, height, and weight.

By passing in the dependent variable (y) and the independent variables, the sm.OLS function generates a model object (X). The model object’s fit method is then called to fit the OLS regression to the data. The method then provides predictions based on the fitted model and the independent variables.

Finally, the model.summary function is invoked to summarize the model’s statistical results, including the independent variable coefficients, the goodness of fit metrics, and the importance of each independent variable in predicting the dependent variable. The summary is helpful for the data analyst in evaluating the fit and performance of the OLS regression model.

Conclusion

EDA is a very important step in the data science process, enabling the discovery of patterns, relationships, and anomalies in data. It helps gain insights into the data, understand the distribution and correlations of variables, and discover potential areas for further investigation.

However, I have to mention several drawbacks to EDA. One of the most significant pitfalls is the subjectivity of EDA, with outcomes influenced by the analyst’s background and personal biases. Furthermore, EDA may not provide definitive answers but rather a preliminary interpretation of the data.

Another moment of EDA that should be considered is that it can be time-consuming and labor-intensive, especially when you work with large and complex datasets. Additionally, EDA may not be suitable for all types of data, such as high-dimensional ones.