plot_missing()
The function plot_missing() enables thorough analysis of the missing values and their impact on the dataset. The impact is the change in the dataset’s characteristics (e.g., the histogram of a numerical column or bar chart of a categorical column) after removing the rows with missing values from the dataset. The following describes the functionality of plot_missing() for a given dataframe df.
df
plot_missing(df): plots the amount and position of missing values, and their relationship between columns
plot_missing(df)
plot_missing(df, col1): plots the impact of the missing values in column col1 on all other columns
plot_missing(df, col1)
col1
plot_missing(df, col1, col2): plots the impact of the missing values from column col1 on column col2 in various ways.
plot_missing(df, col1, col2)
col2
Next, we demonstrate the functionality of plot_missing().
dataprep.eda supports Pandas and Dask dataframes. Here, we will load the well-known Titanic dataset into a Pandas dataframe.
dataprep.eda
[1]:
from dataprep.datasets import load_dataset import numpy as np df = load_dataset('titanic') df = df.replace(" ?", np.NaN)
plot_missing(df) will generate four visualizations that lead to different understandings of the missing values in the dataset: 1. A statistics table. This table shows the statistics of missing value for the entire dataframe. “Missing Cell” represents the total number of missing cells in the whole dataframe. “Missing Cell (%)” represents the percent of missing cells in the whole dataframe. “Missing Columns” and “Missing Rows” represent the number of columns/rows which contain at least one missing cell. “Avg Missing Cells per Column” and “Avg Missing Cells per Row” represent the average number of missing cells within one column/row. 2. A bar chart depicting the amount of missing values in each column. There is an insight tab in the upper right-hand corner, which shows names of the columns and rows containing the most missing values, as well as their missing rate. 3. A missing spectrum plot. In this visualization, the dataset is divided into bins, and each bin corresponds to a rectangle in the plot. Then, each rectangle is gray-scaled depending on the number of missing values in the bin. A light colour represents none or few missing values, and a dark colour represents many missing values. 4. A nullity correlation heatmap. This visualization depcits how strongly the presence or absence of one variable affects the presence of another. From the Pyhton library missingno: Nullity correlation ranges from -1 (if one variable appears the other definitely does not) to 0 (variables appearing or not appearing have no effect on one another) to 1 (if one variable appears the other definitely also does). 5. The fifth tab displays a dendrogram which allows one to correlate variable completion, revealing trends deeper than the pairwise ones visible in the correlation heatmaps. The dendrogram uses a hierarchical clustering algorithm to bin variables against one another by their nullity correlation (measured in terms of binary distance). At each step of the tree the variables are split up based on which combination minimizes the distance of the remaining clusters. The more monotone the set of variables, the closer their total distance is to zero, and the closer their average distance (the y-axis) is to zero.
The following is an example:
[2]:
from dataprep.eda.missing import plot_missing plot_missing(df)
Note that the nullity correlation heatmap will be empty if less than two columns are partially missing.
After getting an overview of the missing values with plot_missing(df), we can analyze the impact of the missing values in a specific column col1 with plot_missing(df, col1). The impact of the missing values in column col1 is the change in the dataset’s characteristics after removing the rows where column col1’s values are missing. Here, we consider two types of characteristics: the histogram (for numerical columns) and the bar chart (for categorical columns). plot_missing(df, col1) plots the histogram or bar chart (for appropriate column types) for each column before and after removing the rows that contain missing values in column col1.
The following shows an example:
[3]:
plot_missing(df, "Age")
plot_missing(df, col1) only displays the frequency distribution of each column before and after removing the rows containing missing values in column col1. If the user is specifically concerned with the impact of the missing values in one column col1 on another column col2, she/he can call plot_missing(df, col1, col2). plot_missing(df, col1, col2) plots the impact of the missing values in column col1 on column col2 in different ways depending on the type of column col2.
If col2 is a numerical column, plot_missing(df, col1, col2) shows the impact as a histogram, pdf, cdf, and box plot. The following shows an example:
[4]:
plot_missing(df, "Age", "Fare")
If y is a categorical column, plot_missing(df, col1, col2) shows the impact as a bar chart. The following shows an example:
y
[5]:
plot_missing(df, "Age", "Sex")