This section introduces the insights supported by dataprep
[1]:
%reload_ext autoreload %autoreload 2 from dataprep.datasets import load_dataset from dataprep.eda import plot, plot_correlation, plot_missing
[2]:
df = load_dataset("titanic")
[3]:
df
891 rows × 12 columns
We give an example in the following: 1. Click the bottom “Show Stats and Insights” 2. We could see the insights provided by Dataprep.
Then we could find the insights that are provided by Dataprep in this section.
If we use plot(df, col) function, we have to click the following buttom:
plot(df, col)
Then we could see the following insights:
Here we give an example to show insights that could be provided by plot(df).
insights
applied plots
type
threshold
discription
Duplicates
Overview
int
1
Warn if the percent of duplicated values is above this threshold.
Negatives
Warn if the percent of megatives is above this threshold.
Similar_distribution
float
0.05
The significance level for Kolmogorov–Smirnov test.
Uniform
Histogram
0.999
The p-value threshold for chi-square test.
Missing
Warn if the percent of missing values is above this threshold.
Skewed
1e-5
The p-value for the scipy.skewtest which test whether the skew is different from the normal distributionin.
Infinity
Warn if the percent of infinites is above this threshold.
Zeros
5
It shows some columns that have zero values larger than the threshold.
Normal
0.99
The p-value threshold for normal test, it is based on D’Agostino and Pearson’s test that combines skew and kurtosis to produce an omnibus test of normality.
High Cardinality
Bar Chart
50
The threshold for unique values count, count larger than threshold yields high cardinality.
Constant
The threshold for unique values count, count equals to threshold yields constant value.
[4]:
plot(df)
Here we give an example to show the insights could be yielded by plot(df, x), when x is a continues column.
Stats
Warn if the percent of zeros is above this threshold.
Histogram, Normal Q-Q Plot
Outliers
Box Plot
0
It shows how many outliers a column has.
[5]:
plot(df, "Age")
Here we give an example to show the insights could be presented by plot(df, col), when col is a nominal column.
High_cardinality
Outstanding_no1
1.5
It measures the ratio of the largest category count to the second-largest category count.
Attribution
Pie Chart
0.5
It measures the percentage of the top 2 categories.
High_word_cardinality
Word Cloud
1000
The threshold for the high word cardinality insight, which measures the number of words of that cateogory.
Outstanding_no1_word
The threshold for the outstanding no1 word threshold, which measures the ratio of the most frequent word count to the second most frequent word count.
[6]:
plot(df, "Sex")