plot_missing(): analyze missing values

Overview

The function plot_missing() enables thorough analysis of the missing values and their impact on the dataset. The impact is the change in the dataset’s characteristics (e.g., the histogram of a numerical column or bar chart of a categorical column) after removing the rows with missing values from the dataset. The following describes the functionality of plot_missing() for a given dataframe df.

  1. plot_missing(df): plots the amount and position of missing values, and their relationship between columns

  2. plot_missing(df, col1): plots the impact of the missing values in column col1 on all other columns

  3. plot_missing(df, col1, col2): plots the impact of the missing values from column col1 on column col2 in various ways.

Next, we demonstrate the functionality of plot_missing().

Load the dataset

dataprep.eda supports Pandas and Dask dataframes. Here, we will load the well-known Titanic dataset into a Pandas dataframe.

[1]:
from dataprep.datasets import load_dataset
import numpy as np
df = load_dataset('titanic')
df = df.replace(" ?", np.NaN)

Get an overview of the missing values with plot_missing(df)

plot_missing(df) will generate four visualizations that lead to different understandings of the missing values in the dataset: 1. A statistics table. This table shows the statistics of missing value for the entire dataframe. “Missing Cell” represents the total number of missing cells in the whole dataframe. “Missing Cell (%)” represents the percent of missing cells in the whole dataframe. “Missing Columns” and “Missing Rows” represent the number of columns/rows which contain at least one missing cell. “Avg Missing Cells per Column” and “Avg Missing Cells per Row” represent the average number of missing cells within one column/row. 2. A bar chart depicting the amount of missing values in each column. There is an insight tab in the upper right-hand corner, which shows names of the columns and rows containing the most missing values, as well as their missing rate. 3. A missing spectrum plot. In this visualization, the dataset is divided into bins, and each bin corresponds to a rectangle in the plot. Then, each rectangle is gray-scaled depending on the number of missing values in the bin. A light colour represents none or few missing values, and a dark colour represents many missing values. 4. A nullity correlation heatmap. This visualization depcits how strongly the presence or absence of one variable affects the presence of another. From the Pyhton library missingno: Nullity correlation ranges from -1 (if one variable appears the other definitely does not) to 0 (variables appearing or not appearing have no effect on one another) to 1 (if one variable appears the other definitely also does). 5. The fifth tab displays a dendrogram which allows one to correlate variable completion, revealing trends deeper than the pairwise ones visible in the correlation heatmaps. The dendrogram uses a hierarchical clustering algorithm to bin variables against one another by their nullity correlation (measured in terms of binary distance). At each step of the tree the variables are split up based on which combination minimizes the distance of the remaining clusters. The more monotone the set of variables, the closer their total distance is to zero, and the closer their average distance (the y-axis) is to zero.

The following is an example:

[2]:
from dataprep.eda.missing import plot_missing
plot_missing(df)