plot_missing(): analyze missing values

Overview

The function plot_missing() enables thorough analysis of the missing values and their impact on the dataset. The impact is the change in the dataset’s characteristics (e.g., the histogram of a numerical column or bar chart of a categorical column) after removing the rows with missing values from the dataset. The following describes the functionality of plot_missing() for a given dataframe df.

  1. plot_missing(df): plots the amount and position of missing values, and their relationship between columns

  2. plot_missing(df, col1): plots the impact of the missing values in column col1 on all other columns

  3. plot_missing(df, col1, col2): plots the impact of the missing values from column col1 on column col2 in various ways.

Next, we demonstrate the functionality of plot_missing().

Load the dataset

dataprep.eda supports Pandas and Dask dataframes. Here, we will load the well-known Titanic dataset into a Pandas dataframe.

[1]:
from dataprep.datasets import load_dataset
import numpy as np
df = load_dataset('titanic')
df = df.replace(" ?", np.NaN)

Get an overview of the missing values with plot_missing(df)

plot_missing(df) will generate four visualizations that lead to different understandings of the missing values in the dataset: 1. A statistics table. This table shows the statistics of missing value for the entire dataframe. “Missing Cell” represents the total number of missing cells in the whole dataframe. “Missing Cell (%)” represents the percent of missing cells in the whole dataframe. “Missing Columns” and “Missing Rows” represent the number of columns/rows which contain at least one missing cell. “Avg Missing Cells per Column” and “Avg Missing Cells per Row” represent the average number of missing cells within one column/row. 2. A bar chart depicting the amount of missing values in each column. There is an insight tab in the upper right-hand corner, which shows names of the columns and rows containing the most missing values, as well as their missing rate. 3. A missing spectrum plot. In this visualization, the dataset is divided into bins, and each bin corresponds to a rectangle in the plot. Then, each rectangle is gray-scaled depending on the number of missing values in the bin. A light colour represents none or few missing values, and a dark colour represents many missing values. 4. A nullity correlation heatmap. This visualization depcits how strongly the presence or absence of one variable affects the presence of another. From the Pyhton library missingno: Nullity correlation ranges from -1 (if one variable appears the other definitely does not) to 0 (variables appearing or not appearing have no effect on one another) to 1 (if one variable appears the other definitely also does). 5. The fifth tab displays a dendrogram which allows one to correlate variable completion, revealing trends deeper than the pairwise ones visible in the correlation heatmaps. The dendrogram uses a hierarchical clustering algorithm to bin variables against one another by their nullity correlation (measured in terms of binary distance). At each step of the tree the variables are split up based on which combination minimizes the distance of the remaining clusters. The more monotone the set of variables, the closer their total distance is to zero, and the closer their average distance (the y-axis) is to zero.

The following is an example:

[2]:
from dataprep.eda.missing import plot_missing
plot_missing(df)
[2]:
DataPrep.EDA Report

Missing Statistics

Missing Cells866
Missing Cells (%)8.1%
Missing Columns3
Missing Rows708
Avg Missing Cells per Column72.17
Avg Missing Cells per Row0.97
'height': 500
Height of the plot
'width': 500
Width of the plot
'spectrum.bins': 20
Number of bins
'height': 500
Height of the plot
'width': 500
Width of the plot
'height': 500
Height of the plot
'width': 500
Width of the plot
'height': 500
Height of the plot
'width': 500
Width of the plot

Note that the nullity correlation heatmap will be empty if less than two columns are partially missing.

Understand the impact of the missing values in column x with plot_missing(df, col1)

After getting an overview of the missing values with plot_missing(df), we can analyze the impact of the missing values in a specific column col1 with plot_missing(df, col1). The impact of the missing values in column col1 is the change in the dataset’s characteristics after removing the rows where column col1’s values are missing. Here, we consider two types of characteristics: the histogram (for numerical columns) and the bar chart (for categorical columns). plot_missing(df, col1) plots the histogram or bar chart (for appropriate column types) for each column before and after removing the rows that contain missing values in column col1.

The following shows an example:

[3]:
plot_missing(df, "Age")
[3]:
DataPrep.EDA Report
Orignal data
After drop missing values

Understand the impact of the missing values in column col1 on column col2 with plot_missing(df, col1, col2)

plot_missing(df, col1) only displays the frequency distribution of each column before and after removing the rows containing missing values in column col1. If the user is specifically concerned with the impact of the missing values in one column col1 on another column col2, she/he can call plot_missing(df, col1, col2). plot_missing(df, col1, col2) plots the impact of the missing values in column col1 on column col2 in different ways depending on the type of column col2.

If col2 is a numerical column, plot_missing(df, col1, col2) shows the impact as a histogram, pdf, cdf, and box plot. The following shows an example:

[4]:
plot_missing(df, "Age", "Fare")
[4]:
DataPrep.EDA Report
'hist.bins': 50
Number of bins in the histogram
'hist.yscale': 'linear'
Y-axis scale ("linear" or "log")
'hist.color': '#aec7e8'
Color
'height': 400
Height of the plot
'width': 400
Width of the plot
'pdf.sample_size': 100
Number of evenly spaced samples between the minimum and maximum values to compute the pdf at
'height': 400
Height of the plot
'width': 400
Width of the plot
'pdf.sample_size': 100
Number of evenly spaced samples between the minimum and maximum values to compute the pdf at
'height': 400
Height of the plot
'width': 400
Width of the plot
'box.color': #1f77b4
Color
'height': 400
Height of the plot
'width': 400
Width of the plot

If y is a categorical column, plot_missing(df, col1, col2) shows the impact as a bar chart. The following shows an example:

[5]:
plot_missing(df, "Age", "Sex")
[5]:
DataPrep.EDA Report
'bar.bars': 10
Maximum number of bars to display
'bar.sort_descending': True
Whether to sort the bars in descending order
'bar.yscale': 'linear'
Y-axis scale ("linear" or "log")
'bar.color': '#1f77b4'
Color
'height': 400
Height of the plot
'width': 400
Width of the plot