plot()
The function plot() explores the distributions and statistics of the dataset. It generates a variety of visualizations and statistics which enables the user to achieve a comprehensive understanding of the column distributions and their relationships. The following describes the functionality of plot() for a given dataframe df.
df
plot(df): plots the distribution of each column and computes dataset statistics
plot(df)
plot(df, col1): plots the distribution of column col1 in various ways, and computes its statistics
plot(df, col1)
col1
plot(df, col1, col2): generates plots depicting the relationship between columns col1 and col2
plot(df, col1, col2)
col2
The generated plots are different for numerical, categorical and geography columns. The following table summarizes the output for the different column types.
Output
None
dataset statistics, histogram or bar chart for each column
Numerical
column statistics, histogram, kde plot, qq-normal plot, box plot
Categorical
column statistics, bar chart, pie chart, word cloud, word frequencies
Geography
column statistics, bar chart, pie chart, word cloud, word frequencies, world map
scatter plot, hexbin plot, binned box plot
categorical box plot, multi-line chart
nested bar chart, stacked bar chart, heat map
Geopoint
categorical box plot, multi-line chart, world map
geo map
Next, we demonstrate the functionality of plot().
dataprep.eda supports Pandas and Dask dataframes. Here, we will load the well-known adult dataset into a Pandas dataframe using the load_dataset function.
dataprep.eda
[1]:
from dataprep.datasets import load_dataset import numpy as np df = load_dataset('adult') df = df.replace(" ?", np.NaN)
We start by calling plot(df) which computes dataset-level statistics, a histogram for each numerical column, and a bar chart for each categorical column. The number of bins in the histogram can be specified with the parameter bins, and the number of categories in the bar chart can be specified with the parameter ngroups. If a column contains missing values, the percent of missing values is shown in the title and ignored when generating the plots.
bins
ngroups
[2]:
from dataprep.eda import plot plot(df)
After getting an overview of the dataset, we can thoroughly investigate a column of interest col1 using plot(df, col1). The output is of plot(df, col1) is different for numerical and categorical columns.
When col1 is a numerical column, it computes column statistics, and generates a histogram, kde plot, box plot and qq-normal plot:
[3]:
plot(df, "age")
When x is a categorical column, it computes column statistics, and plots a bar chart, pie chart, word cloud, word frequency and word length:
x
[4]:
plot(df, "education")
When x is a Geography column, it computes column statistics, and plots a bar chart, pie chart, word cloud, word frequency, word length and world map:
[5]:
df_geo = load_dataset('countries') plot(df_geo, "Country")
Next, we can explore the relationship between columns col1 and col2 using plot(df, col1, col2). The output depends on the types of the columns.
When col1 and col2 are both numerical columns, it generates a scatter plot, hexbin plot and box plot:
[6]:
plot(df, "age", "hours-per-week")
When col1 and col2 are both categorical columns, it plots a nested bar chart, stacked bar chart and heat map:
[7]:
plot(df, "education", "marital-status")
When col1 and col2 are one each of type numerical and categorical, it generates a box plot per category and a multi-line chart:
[8]:
plot(df, "age", "education") # or plot(df, "education", "age")
When col1 and col2 are one each of type geopoint and categorical, or, geography and categorical, it generates a box plot per category and a multi-line chart:
[9]:
from dataprep.eda.dtypes_v2 import LatLong covid = load_dataset('covid19') latlong = LatLong("Lat", "Long") # create geopoint type using "LatLong" function by inputing two columns names plot(covid, latlong, "Country/Region") # or plot(covid, "Country/Region", latlong) plot(df_geo,"Country", "Region") # or plot(df_geo, "Region", "Country")
When col1 and col2 are one each of type geography and numerical, it generates a box plot per category, a multi-line chart and a world map:
[10]:
plot(df_geo,"Country", "Population") # or plot(df_geo, "Population", "Country")
When col1 and col2 are one each of type geopoint and numerical, it generates a geo map:
[11]:
plot(covid, latlong, "2/16/2020") # or plot(covid, "2/16/2020", latlong)