The goal of create_report is to generate profile reports from a pandas DataFrame. create_report utilizes the functionalities and formats the plots from dataprep. It provides the following information:
Overview: detect the types of columns in a dataframe
Variables: variable type, unique values, distint count, missing values
Quantile statistics like minimum value, Q1, median, Q3, maximum, range, interquartile range
Descriptive statistics like mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness
Text analysis for length, sample and letter
Correlations: highlighting of highly correlated variables, Spearman, Pearson and Kendall matrices
Missing Values: bar chart, heatmap and spectrum of missing values
In the following, we break down the report into different sections to demonstrate each part of the report.
Here we load the titanic dataset into a pandas dataframe and use it to demonstrate our functionality:
from dataprep.datasets import load_dataset
df = load_dataset("titanic")
After getting a dataset, we could generate the report object by calling create_report(df). The following shows an example:
from dataprep.eda import create_report
report = create_report(df, title='My Report')
Once we have a report object, we can show it in the notebook:
Or we want to open the report in browser:
Or just save the report to local:
You can see the full report here
In this section, we can see the types of columns and the statistics of the dataset.
In this section, we can see the statistics and plots for each of variable in the dataset.
For numerical variable, the report shows quantile statistics, descriptive statistics, histogram, KDE plot, QQ norm plot and box plot.
For categorical variable, the report shows text analysis, bar chart, pie chart, word cloud, word frequencies and word length.
For datetime variable, the report shows line chart
In this section, the report will show an interactive plot, user can use the dropdown menu above the plot to select which two variables user wants to compare.
The plot has scatter plot and the regression line regarding to the two variabes.
In this section, we can see the correlations bewteen variables in Spearman, Pearson and Kendall matrices.
In this section, we can see the missing values in the dataset through bar chart, spectrum and heatmap.