EDA

This section introduces the Exploratory Data Analysis component of DataPrep.

Introduction to Exploratory Data Analysis and dataprep.eda

Exploratory Data Analysis (EDA) is the process of exploring a dataset and getting an understanding of its main characteristics. The dataprep.eda package simplifies this process by allowing the user to explore important characteristics with simple APIs. Each API allows the user to analyze the dataset from a high level to a low level, and from different perspectives. Specifically, dataprep.eda provides the following functionality:

  • Analyze column distributions with plot(). The function plot() explores the column distributions and statistics of the dataset. It will detect the column type, and then output various plots and statistics that are appropriate for the respective type. The user can optionally pass one or two columns of interest as parameters: If one column is passed, its distribution will be plotted in various ways, and column statistics will be computed. If two columns are passed, plots depicting the relationship between the two columns will be generated.

  • Analyze correlations with plot_correlation(). The function plot_correlation() explores the correlation between columns in various ways and using multiple correlation metrics. By default, it plots correlation matrices with various metrics. The user can optionally pass one or two columns of interest as parameters: If one column is passed, the correlation between this column and all other columns will be computed and ranked. If two columns are passed, a scatter plot and regression line will be plotted.

  • Analyze missing values with plot_missing(). The function plot_missing() enables thorough analysis of the missing values and their impact on the dataset. By default, it will generate various plots which display the amount of missing values for each column and any underlying patterns of the missing values in the dataset. To understand the impact of the missing values in one column on the other columns, the user can pass the column name as a parameter. Then, plot_missing() will generate the distribution of each column with and without the missing values from the given column, enabling a thorough understanding of their impact.

The following sections give a simple demonstration of plot(), plot_correlation(), and plot_missing(), using an example dataset.

Analyze distributions with plot()

The function plot() explores the distributions and statistics of the dataset. The following describes the functionality of plot() for a given dataframe df.

  1. plot(df): plots the distribution of each column and calculates dataset statistics

  2. plot(df, x): plots the distribution of column x in various ways and calculates column statistics

  3. plot(df, x, y): generates plots depicting the relationship between columns x and y

The following shows an example of plot(df). It plots a histogram for each numerical column, a bar chart for each categorical column, and computes dataset statistics.

[1]:
from dataprep.eda import plot
from dataprep.datasets import load_dataset
import numpy as np
df = load_dataset('house_prices_train')
plot(df)