`plot_missing()`: analyze missing values¶

Overview¶

The function plot_missing() enables thorough analysis of the missing values and their impact on the dataset. The impact is the change in the dataset’s characteristics (e.g., the histogram of a numerical column or bar chart of a categorical column) after removing the rows with missing values from the dataset. The following describes the functionality of plot_missing() for a given dataframe df.

plot_missing(df): plots the amount and position of missing values, and their relationship between columns
plot_missing(df, col1): plots the impact of the missing values in column col1 on all other columns
plot_missing(df, col1, col2): plots the impact of the missing values from column col1 on column col2 in various ways.

Next, we demonstrate the functionality of plot_missing().

Load the dataset¶

dataprep.eda supports Pandas and Dask dataframes. Here, we will load the well-known Titanic dataset into a Pandas dataframe.

[1]:

from dataprep.datasets import load_dataset
import numpy as np
df = load_dataset('titanic')
df = df.replace(" ?", np.NaN)

Get an overview of the missing values with `plot_missing(df)`¶

plot_missing(df) will generate four visualizations that lead to different understandings of the missing values in the dataset: 1. A statistics table. This table shows the statistics of missing value for the entire dataframe. “Missing Cell” represents the total number of missing cells in the whole dataframe. “Missing Cell (%)” represents the percent of missing cells in the whole dataframe. “Missing Columns” and “Missing Rows” represent the number of columns/rows which contain at least one missing cell. “Avg Missing Cells per Column” and “Avg Missing Cells per Row” represent the average number of missing cells within one column/row. 2. A bar chart depicting the amount of missing values in each column. There is an insight tab in the upper right-hand corner, which shows names of the columns and rows containing the most missing values, as well as their missing rate. 3. A missing spectrum plot. In this visualization, the dataset is divided into bins, and each bin corresponds to a rectangle in the plot. Then, each rectangle is gray-scaled depending on the number of missing values in the bin. A light colour represents none or few missing values, and a dark colour represents many missing values. 4. A nullity correlation heatmap. This visualization depcits how strongly the presence or absence of one variable affects the presence of another. From the Pyhton library missingno: Nullity correlation ranges from -1 (if one variable appears the other definitely does not) to 0 (variables appearing or not appearing have no effect on one another) to 1 (if one variable appears the other definitely also does). 5. The fifth tab displays a dendrogram which allows one to correlate variable completion, revealing trends deeper than the pairwise ones visible in the correlation heatmaps. The dendrogram uses a hierarchical clustering algorithm to bin variables against one another by their nullity correlation (measured in terms of binary distance). At each step of the tree the variables are split up based on which combination minimizes the distance of the remaining clusters. The more monotone the set of variables, the closer their total distance is to zero, and the closer their average distance (the y-axis) is to zero.

The following is an example:

[2]:

from dataprep.eda.missing import plot_missing
plot_missing(df)

[2]:

DataPrep.EDA Report

Stats Bar Chart Spectrum Heat Map Dendrogram

Missing Statistics

Missing Cells	866
Missing Cells (%)	8.1%
Missing Columns	3
Missing Rows	708
Avg Missing Cells per Column	72.17
Avg Missing Cells per Row	0.97

'height': 500

Height of the plot

'width': 500

Width of the plot

'spectrum.bins': 20

Number of bins

'height': 500

Height of the plot

'width': 500

Width of the plot

'height': 500

Height of the plot

'width': 500

Width of the plot

'height': 500

Height of the plot

'width': 500

Width of the plot

Note that the nullity correlation heatmap will be empty if less than two columns are partially missing.

`plot_missing()`: analyze missing values¶

Overview¶

Load the dataset¶

Get an overview of the missing values with `plot_missing(df)`¶

Missing Statistics

Understand the impact of the missing values in column x with `plot_missing(df, col1)`¶

Understand the impact of the missing values in column `col1` on column `col2` with `plot_missing(df, col1, col2)`¶

plot_missing(): analyze missing values¶

Overview¶

Load the dataset¶

Get an overview of the missing values with plot_missing(df)¶

Missing Statistics

Understand the impact of the missing values in column x with plot_missing(df, col1)¶

Understand the impact of the missing values in column col1 on column col2 with plot_missing(df, col1, col2)¶

`plot_missing()`: analyze missing values¶

Get an overview of the missing values with `plot_missing(df)`¶

Understand the impact of the missing values in column x with `plot_missing(df, col1)`¶

Understand the impact of the missing values in column `col1` on column `col2` with `plot_missing(df, col1, col2)`¶