`plot_diff()`: analyze differences¶

Overview¶

The function plot_diff() explores the difference of column distributions and statistics across multiple datasets.

Next, we demonstrate the functionality of plot_diff()

Load the dataset¶

dataprep.eda supports Pandas and Dask dataframes. Here, we will load the house prices datasets for both training and testing into a Pandas dataframe.

[1]:

from dataprep.datasets import load_dataset
import numpy as np
df1 = load_dataset("house_prices_train")
df1 = df1.replace(" ?", np.NaN)
df2 = load_dataset("house_prices_test")
df2 = df2.replace(" ?", np.NaN)

Get an overview of the dataset with `plot_diff([df1, df2])`¶

We start by calling plot_diff([df1, df2]) which computes dataset-level statistics, a histogram for each numerical column, and a bar chart for each categorical column across two dataframes. The number of bins in the histogram can be specified with the parameter bins, and the number of categories in the bar chart can be specified with the parameter ngroups. If a column contains missing values, the percent of missing values is shown in the title and ignored when generating the plots.

[2]:

from dataprep.eda import plot_diff
plot_diff([df1, df2])

[2]:

DataPrep.EDA Report

Stats

Difference Overview

	df1	df2
Number of Variables	81	80
Number of Rows	1460	1459
Missing Cells	6965	7000
Missing Cells (%)	5.9%	6.0%
Duplicate Rows	0	0
Duplicate Rows (%)	0.0%	0.0%
Total Size in Memory	924.0 KB	912.0 KB
Average Row Size in Memory	922.6 KB	910.6 KB
Variable Types	Numerical: 27 Categorical: 53 GeoGraphy: 1	Numerical: 26 Categorical: 53 GeoGraphy: 1

df1

df2

Set the customized label in the comparison¶

Sometimes we want to give our datasets some better names, this can be specified with the parameter diff.label.

[3]:

plot_diff([df1, df2], config={"diff.label": ["train", "test"]})

[3]:

DataPrep.EDA Report

Stats

Difference Overview

	train	test
Number of Variables	81	80
Number of Rows	1460	1459
Missing Cells	6965	7000
Missing Cells (%)	5.9%	6.0%
Duplicate Rows	0	0
Duplicate Rows (%)	0.0%	0.0%
Total Size in Memory	924.0 KB	912.0 KB
Average Row Size in Memory	922.6 KB	910.6 KB
Variable Types	Numerical: 27 Categorical: 53 GeoGraphy: 1	Numerical: 26 Categorical: 53 GeoGraphy: 1

train

test

Change the baseline dataset used for comparison¶

By default, we use the first dataset as our baseline to compute the distributions and statistics. If this baseline is not properly set, we can specify this parameter with diff.baseline.

The baseline starts with index 0 instead of 1 which is in the default label parameter.

[4]:

plot_diff([df1, df2], config={"diff.baseline": 1})

[4]:

DataPrep.EDA Report

Stats

Difference Overview

	df1	df2
Number of Variables	81	80
Number of Rows	1460	1459
Missing Cells	6965	7000
Missing Cells (%)	5.9%	6.0%
Duplicate Rows	0	0
Duplicate Rows (%)	0.0%	0.0%
Total Size in Memory	924.0 KB	912.0 KB
Average Row Size in Memory	922.6 KB	910.6 KB
Variable Types	Numerical: 27 Categorical: 53 GeoGraphy: 1	Numerical: 26 Categorical: 53 GeoGraphy: 1

df1

df2

Change to density plot¶

By default, we will show a comparison of histogram for a numerical column. You can change it to a density plot using diff.density parameter.

[5]:

plot_diff([df1, df2], config = {"diff.density": True})

[5]:

DataPrep.EDA Report

Stats

Difference Overview

	df1	df2
Number of Variables	81	80
Number of Rows	1460	1459
Missing Cells	6965	7000
Missing Cells (%)	5.9%	6.0%
Duplicate Rows	0	0
Duplicate Rows (%)	0.0%	0.0%
Total Size in Memory	924.0 KB	912.0 KB
Average Row Size in Memory	922.6 KB	910.6 KB
Variable Types	Numerical: 27 Categorical: 53 GeoGraphy: 1	Numerical: 26 Categorical: 53 GeoGraphy: 1

df1

df2

plot_diff(): analyze differences¶

Overview¶

Load the dataset¶

Get an overview of the dataset with plot_diff([df1, df2])¶

Set the customized label in the comparison¶

Change the baseline dataset used for comparison¶

Change to density plot¶

`plot_diff()`: analyze differences¶

Get an overview of the dataset with `plot_diff([df1, df2])`¶