Customize your output

Overview

Dataprep supports customizability for plot(), plot_missing(), plot_correlation() and create_report(). There are mainly two settings, display and config.

  1. display is a list of names which controls the Tabs, Sections and Sessions you want to show.

  2. config is a dictionary that contains the customizable parameters and designated values.

For your convenience, the input for display can directly be copied from the Tabs. You can save the computations by diaplaying less contents.

And for config, we developed the how-to guide function to help you mange the frequently-use parameters. Click the question mark icon in the upper right corner of each plot, in the pop-up you will see the customizable parameters for that plot, parameters’ brief descriptions and parameters’ default settings. You can easily use the Copy All Parameters button to copy the parameters with their default settings into a dictionary. Then customize the settings and pass to the config argument.

Global and local parameter

There are two types of parameters, global and local.

  1. Local parameters are plot-specified and the names are separated by .. The portion before the first . is plot name and the portion after the . is parameter name. e.g. bar.bars.

  2. Global parameter applies to all the plots which has that parameter. It is single-word. e.g. ngroups .

When global and local parameter are both given, the global parameter will be overwrote by local parameters for specific plots. You can find more details about parameters in parameter_configurations.

Exmaple 1: Choose the Tabs, Sections and Sessions you want

[1]:
from dataprep.eda import plot,create_report
from dataprep.datasets import load_dataset
df = load_dataset('titanic')
plot(df, 'Pclass', display=['Stats', 'Bar Chart', 'Pie Chart'])
[1]:
DataPrep.EDA Report

Overview

Approximate Distinct Count3
Approximate Unique (%)0.3%
Missing0
Missing (%)0.0%
Memory Size57.4 KB

Length

Mean1
Standard Deviation0
Median1
Minimum1
Maximum1

Sample

1st row3
2nd row1
3rd row3
4th row1
5th row3

Letter

Count0
Lowercase Letter0
Space Separator0
Uppercase Letter0
Dash Punctuation0
Decimal Number891
'bar.bars': 10
Maximum number of bars to display
'bar.sort_descending': True
Whether to sort the bars in descending order
'bar.yscale': 'linear'
Y-axis scale ("linear" or "log")
'bar.color': '#1f77b4'
Color
'height': 400
Height of the plot
'width': 450
Width of the plot
'pie.slices': 10
Maximum number of pie slices to display
'pie.sort_descending': True
Whether to sort the slices in descending order of frequency
'pie.colors': ['#1f77b4', '#aec7e8', '#ff7f0e']
List of colors
'height': 400
Height of the plot
'width': 450
Width of the plot
[2]:
create_report(df,display=["Overview","Interactions"])
[2]:
DataPrep Report

Overview

Dataset Statistics

Number of Variables 12
Number of Rows 891
Missing Cells 866
Missing Cells (%) 8.1%
Duplicate Rows 0
Duplicate Rows (%) 0.0%
Total Size in Memory 315.0 KB
Average Row Size in Memory 362.1 B
Variable Types
  • Numerical: 3
  • Categorical: 9

Dataset Insights

PassengerId is uniformly distributed Uniform
Age has 177 (19.87%) missing values Missing
Cabin has 687 (77.1%) missing values Missing
Fare is skewed Skewed
Name has a high cardinality: 891 distinct values High Cardinality
Ticket has a high cardinality: 681 distinct values High Cardinality
Cabin has a high cardinality: 147 distinct values High Cardinality
Survived has constant length 1 Constant Length
Pclass has constant length 1 Constant Length
SibSp has constant length 1 Constant Length
Parch has constant length 1 Constant Length
Embarked has constant length 1 Constant Length
Name has all distinct values Unique
  • 1
  • 2

Interactions

[3]:
plot(df, display=["Stats", "Insights"])
[3]:
DataPrep.EDA Report
Dataset Statistics
Number of Variables 12
Number of Rows 891
Missing Cells 866
Missing Cells (%) 8.1%
Duplicate Rows 0
Duplicate Rows (%) 0.0%
Total Size in Memory 315.0 KB
Average Row Size in Memory 362.1 B
Variable Types
  • Numerical: 3
  • Categorical: 9
Dataset Insights
PassengerId is uniformly distributed Uniform
Age has 177 (19.87%) missing values Missing
Cabin has 687 (77.1%) missing values Missing
Fare is skewed Skewed
Name has a high cardinality: 891 distinct values High Cardinality
Ticket has a high cardinality: 681 distinct values High Cardinality
Cabin has a high cardinality: 147 distinct values High Cardinality
Survived has constant length 1 Constant Length
Pclass has constant length 1 Constant Length
SibSp has constant length 1 Constant Length
Dataset Insights
Parch has constant length 1 Constant Length
Embarked has constant length 1 Constant Length
Name has all distinct values Unique
  • 1
  • 2

Example 2: Customize your plot

[4]:
plot(df, "Pclass", config={'bar.bars': 10, 'bar.sort_descending': True, 'bar.yscale': 'linear', 'height': 400, 'width': 450, })
[4]:
DataPrep.EDA Report

Overview

Approximate Distinct Count3
Approximate Unique (%)0.3%
Missing0
Missing (%)0.0%
Memory Size57.4 KB

Length

Mean1
Standard Deviation0
Median1
Minimum1
Maximum1

Sample

1st row3
2nd row1
3rd row3
4th row1
5th row3

Letter

Count0
Lowercase Letter0
Space Separator0
Uppercase Letter0
Dash Punctuation0
Decimal Number891
'bar.bars': 10
Maximum number of bars to display
'bar.sort_descending': True
Whether to sort the bars in descending order
'bar.yscale': 'linear'
Y-axis scale ("linear" or "log")
'bar.color': '#1f77b4'
Color
'height': 400
Height of the plot
'width': 450
Width of the plot
  • The largest value (3) is over 2.27 times larger than the second largest value (1)
'pie.slices': 10
Maximum number of pie slices to display
'pie.sort_descending': True
Whether to sort the slices in descending order of frequency
'pie.colors': ['#1f77b4', '#aec7e8', '#ff7f0e']
List of colors
'height': 400
Height of the plot
'width': 450
Width of the plot
  • The top 2 categories (3, 1) take over 50.0%
'wordcloud.top_words': 30
Maximum number of most frequent words to display
'wordcloud.stopword': True
Whether to remove stopwords
'wordcloud.lemmatize': False
Whether to lemmatize the words
'wordcloud.stem': False
Whether to apply Potter Stem on the words
'height': 400
Height of the plot
'width': 450
Width of the plot
'wordfreq.top_words': 30
Maximum number of most frequent words to display
'wordfreq.stopword': True
Whether to remove stopwords
'wordfreq.lemmatize': False
Whether to lemmatize the words
'wordfreq.stem': False
Whether to apply Potter Stem on the words
'wordfreq.color': #1f77b4
Color
'height': 400
Height of the plot
'width': 450
Width of the plot
  • The largest value (3) is over 2.27 times larger than the second largest value (1)
  • Pclass has words of constant length
'wordlen.bins': 50
Number of bins in the histogram
'wordlen.yscale': 'linear'
Y-axis scale ("linear" or "log")
'wordlen.color': '#aec7e8'
Color
'height': 400
Height of the plot
'width': 450
Width of the plot
'value_table.ngroups': 10
The number of distinct values to show
Value Count Frequency (%)
3 491
55.1%
1 216
24.2%
2 184
20.7%

Example 3: Customize your Insights

[5]:
plot(df,config={'insight.missing.threshold':20, 'insight.duplicates.threshold':20})
[5]:
DataPrep.EDA Report
Dataset Statistics
Number of Variables 12
Number of Rows 891
Missing Cells 866
Missing Cells (%) 8.1%
Duplicate Rows 0
Duplicate Rows (%) 0.0%
Total Size in Memory 315.0 KB
Average Row Size in Memory 362.1 B
Variable Types
  • Numerical: 3
  • Categorical: 9
Dataset Insights
PassengerId is uniformly distributed Uniform
Cabin has 687 (77.1%) missing values Missing
Fare is skewed Skewed
Name has a high cardinality: 891 distinct values High Cardinality
Ticket has a high cardinality: 681 distinct values High Cardinality
Cabin has a high cardinality: 147 distinct values High Cardinality
Survived has constant length 1 Constant Length
Pclass has constant length 1 Constant Length
SibSp has constant length 1 Constant Length
Parch has constant length 1 Constant Length
Dataset Insights
Embarked has constant length 1 Constant Length
Name has all distinct values Unique
  • 1
  • 2