Easy Data Visualization with AutoViz [Maybe Just a Quick One]

The “Maybe Just a Quick One” series title is inspired by my most common reply to “Fancy a drink?”, which, may or may not end up in a long night. Likewise, these posts are intended to be short but I get carried away sometimes, so, apologies in advance.

What is Exploratory Data Analysis and why is it important?

So we know what data is and we know what analysis is. But what is the meaning of “exploratory” in a data science context? What kind of conclusions are we trying to reach? Well, there are various reasons that make this step a necessity in a data science project's lifecycle, helping us to:

Make a good judgement on the quality of our data. Identify any missing values, outliers, possible differences in measurement units etc.
Take a closer look at the characteristics of the variables like their types, distributions, variance as well as the correlations between them.
Summarise the data, providing an “at a glance” way to understand it. And this works beneficially not just for the data scientist or the developer but any stakeholder involved in the project . One of the most common ways to achieve this is by using graphs and plots, in other words, to visualize the data.

What is Autoviz?

Autoviz is a Python library that can massively speed up the visualization of our data, making it fully automated. Let’s jump straight into coding. I am a firm believer in learning by doing.

First things first. We will create a new Conda environment and install the necessary packages:

conda create -n autoviz  python=3.8
conda activate autoviz
python -m pip install autoviz
conda install  scikit-learn

The reason I included scikit-learn is to use some of its datasets to demonstrate the use of Autoviz. You can of course download the dataset from other sources and skip this step.

Now, create a new notebook and start by importing the packages. I will use the infamous Boston House Prices dataset first.

from autoviz.AutoViz_Class import AutoViz_Class
from sklearn.datasets import load_boston,load_iris
import pandas as pd

boston = load_boston()
df_boston = pd.DataFrame(data=boston.data, columns=boston.feature_names)
df_boston["btarget"] = boston.target

We can now instantiate Autoviz and let it do its magic:

AV = AutoViz_Class()
filename = ""
sep = ","
dft = AV.AutoViz(
    filename,
    sep=",",
    depVar="btarget",
    dfte=df_boston,
    header=0,
    verbose=2,
    lowess=False,
    chart_format="svg",
    max_rows_analyzed=150000,
    max_cols_analyzed=30,
)

As you might have noticed, there are some arguments passed to AutoViz, but what do they mean? Let's see what the documentation says:

filename

- Make sure that you give filename as empty string ("") if there is no filename associated with this data and you want to use a dataframe, then use dfte to give the name of the dataframe. Otherwise, fill in the file name and leave dfte as empty string. Only one of these two is needed to load the data set.

sep

- this is the separator in the file. It can be comma, semi-colon or tab or any value that you see in your file that separates each column.

depVar

- target variable in your dataset. You can leave it as empty string if you don't have a target variable in your data.

dfte

- this is the input dataframe in case you want to load a pandas dataframe to plot charts. In that case, leave filename as an empty string.

header

- the row number of the header row in your file. If it is the first row, then this must be zero.

verbose

- it has 3 acceptable values: 0, 1 or 2. With zero, you get all charts but limited info. With 1 you get all charts and more info. With 2, you will not see any charts but they will be quietly generated and save in your local current directory under the AutoViz_Plots directory which will be created. Make sure you delete this folder periodically, otherwise, you will have lots of charts saved here if you used verbose=2 option a lot.

lowess

- this option is very nice for small datasets where you can see regression lines for each pair of continuous variable against the target variable. Don't use this for large data sets (that is over 100,000 rows)

chart_format

- this can be SVG, PNG or JPG. You will get charts generated and saved in this format if you used verbose=2 option. Very useful for generating charts and using them later.

max_rows_analyzed

- limits the max number of rows that is used to display charts. If you have a very large data set with millions of rows, then use this option to limit the amount of time it takes to generate charts. We will take a statistically valid sample.

max_cols_analyzed

- limits the number of continuous vars that can be analyzed

Wait...that was it?

Yes. It is that simple. Using 2 as the

verbose

level had the charts generated in the

AutoViz_Plots

folder. Let's take a look at some of them:

Violin Plots

Scatter Plots

Heatmaps

and more. You get the gist of it.

That is cool. I won't need to do any data vizualisation myself anymore! (I hear you say)

Not quite. I believe that we need the best of both worlds. Having an automated data visualization tool like Autoviz to quickly generate some graphs for your data is a great first step. It can very quickly give you a good summary of it. However, you might need to dig deeper and create some plots yourself, depending on the task.

Further reading:

Autoviz home page