# plot_correlation(): analyze correlations¶

## Overview¶

The function plot_correlation() explores the correlation between columns in various ways and using multiple correlation metrics. The following describes the functionality of plot_correlation() for a given dataframe df.

1. plot_correlation(df): plots correlation matrices (correlations between all pairs of columns)

2. plot_correlation(df, col1): plots the most correlated columns to column col1

3. plot_correlation(df, col1, col2): plots the joint distribution of column col1 and column col2 and computes a regression line

The following table summarizes the output plots for different settings of col1 and col2.

col1

col2

Output

None

None

n*n correlation matrix, computed with Person, Spearman, and KendallTau correlation coefficients

Numerical

None

n*1 correlation matrix, computed with Pearson, Spearman, and KendallTau correlation coefficients

Categorical

None

TODO

Numerical

Numerical

scatter plot with a regression line

Numerical

Categorical

TODO

Categorical

Numerical

TODO

Categorical

Categorical

TODO

Next, we demonstrate the functionality of plot_correlation().

dataprep.eda supports Pandas and Dask dataframes. Here, we will load the well-known wine quality dataset into a Pandas dataframe.

[1]:

from dataprep.datasets import load_dataset

## Get an overview of the correlations with plot_correlation(df)¶
We start by calling plot_correlation(df) to compute the statistics and correlation matrices using Pearson, Spearman, and KendallTau correlation coefficients. For the Stats tab, we list four statistics for these three correlation coefficients respectively. Other three tabs are the lower triangular matrices. In each matrix, a cell represents the correlation value between two columns. There is an “insight” tab (!) in the upper right-hand corner of each matrix, which shows some insight information. The following shows an example:
[2]:

from dataprep.eda import plot_correlation