plot_correlation()
: analyze correlations¶
Overview¶
The function plot_correlation()
explores the correlation between columns in various ways and using multiple correlation metrics. The following describes the functionality of plot_correlation()
for a given dataframe df
.
plot_correlation(df)
: plots correlation matrices (correlations between all pairs of columns)plot_correlation(df, col1)
: plots the most correlated columns to columncol1
plot_correlation(df, col1, col2)
: plots the joint distribution of columncol1
and columncol2
and computes a regression line
The following table summarizes the output plots for different settings of col1
and col2
.


Output 

None 
None 
n*n correlation matrix, computed with Person, Spearman, and KendallTau correlation coefficients 
Numerical 
None 
n*1 correlation matrix, computed with Pearson, Spearman, and KendallTau correlation coefficients 
Categorical 
None 
TODO 
Numerical 
Numerical 
scatter plot with a regression line 
Numerical 
Categorical 
TODO 
Categorical 
Numerical 
TODO 
Categorical 
Categorical 
TODO 
Next, we demonstrate the functionality of plot_correlation()
.
Load the dataset¶
dataprep.eda
supports Pandas and Dask dataframes. Here, we will load the wellknown wine quality dataset into a Pandas dataframe.
[1]:
from dataprep.datasets import load_dataset
df = load_dataset("winequalityred")
Get an overview of the correlations with plot_correlation(df)
¶
We start by calling plot_correlation(df)
to compute the statistics and correlation matrices using Pearson, Spearman, and KendallTau correlation coefficients. For the Stats tab, we list four statistics for these three correlation coefficients respectively. Other three tabs are the lower triangular matrices. In each matrix, a cell represents the correlation value between two columns. There is an “insight” tab (!) in the upper righthand corner of each matrix, which shows some insight
information. The following shows an example:
[2]:
from dataprep.eda import plot_correlation
plot_correlation(df)