plot_diff(): analyze differences


The function plot_diff() explores the difference of column distributions and statistics across multiple datasets.

Next, we demonstrate the functionality of plot_diff()

Load the dataset

dataprep.eda supports Pandas and Dask dataframes. Here, we will load the house prices datasets for both training and testing into a Pandas dataframe.

from dataprep.datasets import load_dataset
import numpy as np
df1 = load_dataset("house_prices_train")
df1 = df1.replace(" ?", np.NaN)
df2 = load_dataset("house_prices_test")
df2 = df2.replace(" ?", np.NaN)

Get an overview of the dataset with plot_diff([df1, df2])

We start by calling plot_diff([df1, df2]) which computes dataset-level statistics, a histogram for each numerical column, and a bar chart for each categorical column across two dataframes. The number of bins in the histogram can be specified with the parameter bins, and the number of categories in the bar chart can be specified with the parameter ngroups. If a column contains missing values, the percent of missing values is shown in the title and ignored when generating the plots.

from dataprep.eda import plot_diff
plot_diff([df1, df2])