Interactive Visualization for Exploratory Data Analysis in Jupyter Notebook

Phillip Peng
3 min readApr 11, 2020

Benefits of using interactive visualization

Data scientists/analysts often need to conduct exploratory data analysis (EDA) for insights either for the purpose of reporting and/or modeling. A visualization is a great approach to easily and quickly finding and showing the insights. Interactive visualization makes this approach even more efficient and powerful: with a few lines of codes, we can make generate many charts for insights and pack them in a concise format.

As a modeler, I usually conduct EDA by univariate analysis (checking the distribution of each variable) and bivariable analysis (checking the profiling/distribution of each independent variable related to the dependent variable). I am going to demo how interactive visualization helps me accomplish this.

Preparation

Before we start the visualization, we need to prepare our toolset. I use ipywidgets in Jupyter Notebook for the development.

Let’s first install ipywidgets in the working environment using pip or Conda.

With pip:

pip install ipywidgets
jupyter nbextension enable --py widgetsnbextension

With Conda:

conda install -c conda-forge ipywidgets

You may need to run the code above with the administrator right.

Once we install ipywidgets, we are ready for loading all the packages that we need including ipywidgets for interactive visualization.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import ipywidgets

Code Demo

Just for the illustration purpose, we will use the public dataset ‘tips’ included in the seaborn package. We first load the dataset.

tips = sns.load_dataset("tips")
tips.head()

There are four categorical variables and three numerical variables (two in floating format and one in integer format).

Univariate Analysis

We will use countplot to create the univariate count distribution plot of all categorical variables and numerical variables.

Categorical variables

@ipywidgets.interact
def plot(col=tips.select_dtypes(include = 'category').columns): # categorical univariate plot
sns.countplot(y=col, data=tips); # y indicates horizontal plot

For numerical variables, we will use the distribution plot to check the distribution and boxplot to detect outliers.

Numerical variables

@ipywidgets.interact
def plot(col=tips.select_dtypes(exclude = 'category').columns): # numerical variable univariate plot
f, ax = plt.subplots(figsize=(7, 3))
ax = sns.distplot(tips[col]) # visualize the distribution
plt.show();
ax = sns.boxplot(x=tips[col]) # detect outliers
plt.show();

Bivariate analysis

In predictive classification models, the target variable is a categorical variable. We can just add the target variable as the hue layer to show the distribution by each target class. If the variable is predictive, we shall see significant distribution across classes of the target variable.

Categorical varaibles

@ipywidgets.interact
def plot(col=tips.select_dtypes(include = 'category').columns): # categorical univariate plot
sns.countplot(y=col, hue = 'sex', data=tips);

For numerical variables, we first use a for-loop to select a subset of the DataFrame for plotting.

Numerical variables

@ipywidgets.interact
def plot(col_x=tips.select_dtypes(include = 'float').columns, col_y = 'sex'): # numerical variable univariate plot
targets = [tips.loc[tips[col_y] == val] for val in tips[col_y].unique()]
for target in targets:
ax = sns.distplot(target[col_x])
plt.show();

Of course, you can use other types of plots for interactive visualization purposes.

My other posts:

Turn on These Six Power-User Features of Your Jupyter Notebook Now

--

--