Customize your output¶

Overview¶

Dataprep supports customizability for plot(), plot_missing(), plot_correlation() and create_report(). There are mainly two settings, display and config.

display is a list of names which controls the Tabs, Sections and Sessions you want to show.
config is a dictionary that contains the customizable parameters and designated values.

For your convenience, the input for display can directly be copied from the Tabs. You can save the computations by diaplaying less contents.

And for config, we developed the how-to guide function to help you mange the frequently-use parameters. Click the question mark icon in the upper right corner of each plot, in the pop-up you will see the customizable parameters for that plot, parameters’ brief descriptions and parameters’ default settings. You can easily use the Copy All Parameters button to copy the parameters with their default settings into a dictionary. Then customize the settings and pass to the config argument.

Global and local parameter¶

There are two types of parameters, global and local.

Local parameters are plot-specified and the names are separated by .. The portion before the first . is plot name and the portion after the . is parameter name. e.g. bar.bars.
Global parameter applies to all the plots which has that parameter. It is single-word. e.g. ngroups .

When global and local parameter are both given, the global parameter will be overwrote by local parameters for specific plots. You can find more details about parameters in parameter_configurations.

Exmaple 1: Choose the Tabs, Sections and Sessions you want¶

[1]:

from dataprep.eda import plot,create_report
from dataprep.datasets import load_dataset
df = load_dataset('titanic')
plot(df, 'Pclass', display=['Stats', 'Bar Chart', 'Pie Chart'])

[1]:

DataPrep.EDA Report

Stats Bar Chart Pie Chart

Overview

Approximate Distinct Count	3
Approximate Unique (%)	0.3%
Missing	0
Missing (%)	0.0%
Memory Size	57.4 KB

Length

Mean	1
Standard Deviation	0
Median	1
Minimum	1
Maximum	1

Sample

1st row	3
2nd row	1
3rd row	3
4th row	1
5th row	3

Letter

Count	0
Lowercase Letter	0
Space Separator	0
Uppercase Letter	0
Dash Punctuation	0
Decimal Number	891

'bar.bars': 10

Maximum number of bars to display

'bar.sort_descending': True

Whether to sort the bars in descending order

'bar.yscale': 'linear'

Y-axis scale ("linear" or "log")

'bar.color': '#1f77b4'

Color

'height': 400

Height of the plot

'width': 450

Width of the plot

'pie.slices': 10

Maximum number of pie slices to display

'pie.sort_descending': True

Whether to sort the slices in descending order of frequency

'pie.colors': ['#1f77b4', '#aec7e8', '#ff7f0e']

List of colors

'height': 400

Height of the plot

'width': 450

Width of the plot

[2]:

create_report(df,display=["Overview","Interactions"])

[2]:

DataPrep Report

Overview

Dataset Statistics

Number of Variables	12
Number of Rows	891
Missing Cells	866
Missing Cells (%)	8.1%
Duplicate Rows	0
Duplicate Rows (%)	0.0%
Total Size in Memory	315.0 KB
Average Row Size in Memory	362.1 B
Variable Types	Numerical: 3 Categorical: 9

Dataset Insights

PassengerId is uniformly distributed	Uniform
Age has 177 (19.87%) missing values	Missing
Cabin has 687 (77.1%) missing values	Missing
Fare is skewed	Skewed
Name has a high cardinality: 891 distinct values	High Cardinality
Ticket has a high cardinality: 681 distinct values	High Cardinality
Cabin has a high cardinality: 147 distinct values	High Cardinality
Survived has constant length 1	Constant Length
Pclass has constant length 1	Constant Length
SibSp has constant length 1	Constant Length

Parch has constant length 1	Constant Length
Embarked has constant length 1	Constant Length
Name has all distinct values	Unique

1
2

Interactions

[3]:

plot(df, display=["Stats", "Insights"])

[3]:

DataPrep.EDA Report

Stats and Insights

Dataset Statistics

Number of Variables	12
Number of Rows	891
Missing Cells	866
Missing Cells (%)	8.1%
Duplicate Rows	0
Duplicate Rows (%)	0.0%
Total Size in Memory	315.0 KB
Average Row Size in Memory	362.1 B
Variable Types	Numerical: 3 Categorical: 9

Dataset Insights

PassengerId is uniformly distributed	Uniform
Age has 177 (19.87%) missing values	Missing
Cabin has 687 (77.1%) missing values	Missing
Fare is skewed	Skewed
Name has a high cardinality: 891 distinct values	High Cardinality
Ticket has a high cardinality: 681 distinct values	High Cardinality
Cabin has a high cardinality: 147 distinct values	High Cardinality
Survived has constant length 1	Constant Length
Pclass has constant length 1	Constant Length
SibSp has constant length 1	Constant Length

Dataset Insights

Parch has constant length 1	Constant Length
Embarked has constant length 1	Constant Length
Name has all distinct values	Unique

1
2

Example 2: Customize your plot¶

[4]:

plot(df, "Pclass", config={'bar.bars': 10, 'bar.sort_descending': True, 'bar.yscale': 'linear', 'height': 400, 'width': 450, })

[4]:

DataPrep.EDA Report

Stats Bar Chart Pie Chart Word Cloud Word Frequency Word Length Value Table

Overview

Approximate Distinct Count	3
Approximate Unique (%)	0.3%
Missing	0
Missing (%)	0.0%
Memory Size	57.4 KB

Length

Mean	1
Standard Deviation	0
Median	1
Minimum	1
Maximum	1

Sample

1st row	3
2nd row	1
3rd row	3
4th row	1
5th row	3

Letter

Count	0
Lowercase Letter	0
Space Separator	0
Uppercase Letter	0
Dash Punctuation	0
Decimal Number	891

'bar.bars': 10

Maximum number of bars to display

'bar.sort_descending': True

Whether to sort the bars in descending order

'bar.yscale': 'linear'

Y-axis scale ("linear" or "log")

'bar.color': '#1f77b4'

Color

'height': 400

Height of the plot

'width': 450

Width of the plot

The largest value (3) is over 2.27 times larger than the second largest value (1)

'pie.slices': 10

Maximum number of pie slices to display

'pie.sort_descending': True

Whether to sort the slices in descending order of frequency

'pie.colors': ['#1f77b4', '#aec7e8', '#ff7f0e']

List of colors

'height': 400

Height of the plot

'width': 450

Width of the plot

The top 2 categories (3, 1) take over 50.0%

'wordcloud.top_words': 30

Maximum number of most frequent words to display

'wordcloud.stopword': True

Whether to remove stopwords

'wordcloud.lemmatize': False

Whether to lemmatize the words

'wordcloud.stem': False

Whether to apply Potter Stem on the words

'height': 400

Height of the plot

'width': 450

Width of the plot

'wordfreq.top_words': 30

Maximum number of most frequent words to display

'wordfreq.stopword': True

Whether to remove stopwords

'wordfreq.lemmatize': False

Whether to lemmatize the words

'wordfreq.stem': False

Whether to apply Potter Stem on the words

'wordfreq.color': #1f77b4

Color

'height': 400

Height of the plot

'width': 450

Width of the plot

The largest value (3) is over 2.27 times larger than the second largest value (1)
Pclass has words of constant length

'wordlen.bins': 50

Number of bins in the histogram

'wordlen.yscale': 'linear'

Y-axis scale ("linear" or "log")

'wordlen.color': '#aec7e8'

Color

'height': 400

Height of the plot

'width': 450

Width of the plot

'value_table.ngroups': 10

The number of distinct values to show

Value	Count	Frequency (%)
3	491	55.1%
1	216	24.2%
2	184	20.7%

Example 3: Customize your Insights¶

[5]:

plot(df,config={'insight.missing.threshold':20, 'insight.duplicates.threshold':20})

[5]:

DataPrep.EDA Report

Stats and Insights

Dataset Statistics

Number of Variables	12
Number of Rows	891
Missing Cells	866
Missing Cells (%)	8.1%
Duplicate Rows	0
Duplicate Rows (%)	0.0%
Total Size in Memory	315.0 KB
Average Row Size in Memory	362.1 B
Variable Types	Numerical: 3 Categorical: 9

Dataset Insights

PassengerId is uniformly distributed	Uniform
Cabin has 687 (77.1%) missing values	Missing
Fare is skewed	Skewed
Name has a high cardinality: 891 distinct values	High Cardinality
Ticket has a high cardinality: 681 distinct values	High Cardinality
Cabin has a high cardinality: 147 distinct values	High Cardinality
Survived has constant length 1	Constant Length
Pclass has constant length 1	Constant Length
SibSp has constant length 1	Constant Length
Parch has constant length 1	Constant Length

Dataset Insights

Embarked has constant length 1	Constant Length
Name has all distinct values	Unique

1
2