plot(): analyze distributions

Overview

The function plot() explores the distributions and statistics of the dataset. It generates a variety of visualizations and statistics which enables the user to achieve a comprehensive understanding of the column distributions and their relationships. The following describes the functionality of plot() for a given dataframe df.

  1. plot(df): plots the distribution of each column and computes dataset statistics

  2. plot(df, col1): plots the distribution of column col1 in various ways, and computes its statistics

  3. plot(df, col1, col2): generates plots depicting the relationship between columns col1 and col2

The generated plots are different for numerical, categorical and geography columns. The following table summarizes the output for the different column types.

col1

col2

Output

None

None

dataset statistics, histogram or bar chart for each column

Numerical

None

column statistics, histogram, kde plot, qq-normal plot, box plot

Categorical

None

column statistics, bar chart, pie chart, word cloud, word frequencies

Geography

None

column statistics, bar chart, pie chart, word cloud, word frequencies, world map

Numerical

Numerical

scatter plot, hexbin plot, binned box plot

Numerical

Categorical

categorical box plot, multi-line chart

Categorical

Numerical

categorical box plot, multi-line chart

Categorical

Categorical

nested bar chart, stacked bar chart, heat map

Categorical

Geography

nested bar chart, stacked bar chart, heat map

Geography

Categorical

nested bar chart, stacked bar chart, heat map

Geopoint

Categorical

nested bar chart, stacked bar chart, heat map

Categorical

Geopoint

nested bar chart, stacked bar chart, heat map

Numerical

Geography

categorical box plot, multi-line chart, world map

Geography

Numerical

categorical box plot, multi-line chart, world map

Numerical

Geopoint

geo map

Geopoint

Numerical

geo map

Next, we demonstrate the functionality of plot().

Load the dataset

dataprep.eda supports Pandas and Dask dataframes. Here, we will load the well-known adult dataset into a Pandas dataframe using the load_dataset function.

[1]:
from dataprep.datasets import load_dataset
import numpy as np
df = load_dataset('adult')
df = df.replace(" ?", np.NaN)

Get an overview of the dataset with plot(df)

We start by calling plot(df) which computes dataset-level statistics, a histogram for each numerical column, and a bar chart for each categorical column. The number of bins in the histogram can be specified with the parameter bins, and the number of categories in the bar chart can be specified with the parameter ngroups. If a column contains missing values, the percent of missing values is shown in the title and ignored when generating the plots.

[2]:
from dataprep.eda import plot
plot(df)