Datasets¶

DataPrep provides a collections of datasets. You could easily load them using one line of code and explore the functionalities of dataprep on them.

List Available Datasets¶

You could list the name of all available datasets by calling get_dataset_names, as shown in below.

[1]:

from dataprep.datasets import get_dataset_names
get_dataset_names()

[1]:

['covid19',
 'wine-quality-red',
 'iris',
 'waste_hauler',
 'countries',
 'patient_info',
 'house_prices_train',
 'adult',
 'house_prices_test',
 'titanic']

Load Dataset¶

After you know the available dataset names from get_dataset_names. Next you could load the dataset by calling load_dataset.

[2]:

from dataprep.datasets import load_dataset
df = load_dataset("titanic")
df

[2]:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S
...	...	...	...	...	...	...	...	...	...	...	...	...
886	887	0	2	Montvila, Rev. Juozas	male	27.0	0	0	211536	13.0000	NaN	S
887	888	1	1	Graham, Miss. Margaret Edith	female	19.0	0	0	112053	30.0000	B42	S
888	889	0	3	Johnston, Miss. Catherine Helen "Carrie"	female	NaN	1	2	W./C. 6607	23.4500	NaN	S
889	890	1	1	Behr, Mr. Karl Howell	male	26.0	0	0	111369	30.0000	C148	C
890	891	0	3	Dooley, Mr. Patrick	male	32.0	0	0	370376	7.7500	NaN	Q

891 rows × 12 columns

Analyze Dataset¶

After you get the dataset, you could try to use dataprep to explore the dataset. For example, you may want to create a profiling report of the dataset using dataprep.eda.

[3]:

from dataprep.eda import create_report
report = create_report(df)
report

[3]:

DataPrep Report

Overview

Dataset Statistics

Number of Variables	12
Number of Rows	891
Missing Cells	866
Missing Cells (%)	8.1%
Duplicate Rows	0
Duplicate Rows (%)	0.0%
Total Size in Memory	315.0 KB
Average Row Size in Memory	362.1 B
Variable Types	Numerical: 3 Categorical: 9

Dataset Insights

PassengerId is uniformly distributed	Uniform
Age has 177 (19.87%) missing values	Missing
Cabin has 687 (77.1%) missing values	Missing
Fare is skewed	Skewed
Name has a high cardinality: 891 distinct values	High Cardinality
Ticket has a high cardinality: 681 distinct values	High Cardinality
Cabin has a high cardinality: 147 distinct values	High Cardinality
Survived has constant length 1	Constant Length
Pclass has constant length 1	Constant Length
SibSp has constant length 1	Constant Length

Parch has constant length 1	Constant Length
Embarked has constant length 1	Constant Length
Name has all distinct values	Unique

1
2

Variables

PassengerId

numerical

Approximate Distinct Count	891
Approximate Unique (%)	100.0%
Missing	0
Missing (%)	0.0%
Infinite	0
Infinite (%)	0.0%
Memory Size	13.9 KB
Mean	446
Minimum	1
Maximum	891
Zeros	0
Zeros (%)	0.0%
Negatives	0
Negatives (%)	0.0%

PassengerId is uniformly distributed

Stats KDE Plot Normal Q-Q Plot Box Plot

Quantile Statistics

Minimum	1
5-th Percentile	45.5
Q1	223.5
Median	446
Q3	668.5
95-th Percentile	846.5
Maximum	891
Range	890
IQR	445

Descriptive Statistics

Mean	446
Standard Deviation	257.3538
Variance	66231
Sum	397386
Skewness	0
Kurtosis	-1.2
Coefficient of Variation	0.577

PassengerId is not normally distributed (p-value 7.259388077973426e-05)

Survived

categorical

Approximate Distinct Count	2
Approximate Unique (%)	0.2%
Missing	0
Missing (%)	0.0%
Memory Size	57.4 KB

The largest value (0) is over 1.61 times larger than the second largest value (1)

Stats Pie Chart Word Cloud Word Frequency Word Length

Length

Mean	1
Standard Deviation	0
Median	1
Minimum	1
Maximum	1

Sample

1st row	0
2nd row	1
3rd row	1
4th row	1
5th row	0

Letter

Count	0
Lowercase Letter	0
Space Separator	0
Uppercase Letter	0
Dash Punctuation	0
Decimal Number	891

The top 2 categories (0, 1) take over 50.0%

The largest value (0) is over 1.61 times larger than the second largest value (1)
Survived has words of constant length

Pclass

categorical

Approximate Distinct Count	3
Approximate Unique (%)	0.3%
Missing	0
Missing (%)	0.0%
Memory Size	57.4 KB

The largest value (3) is over 2.27 times larger than the second largest value (1)

Stats Pie Chart Word Cloud Word Frequency Word Length

Length

Mean	1
Standard Deviation	0
Median	1
Minimum	1
Maximum	1

Sample

1st row	3
2nd row	1
3rd row	3
4th row	1
5th row	3

Letter

Count	0
Lowercase Letter	0
Space Separator	0
Uppercase Letter	0
Dash Punctuation	0
Decimal Number	891

The top 2 categories (3, 1) take over 50.0%

The largest value (3) is over 2.27 times larger than the second largest value (1)
Pclass has words of constant length

Name

categorical

Approximate Distinct Count	891
Approximate Unique (%)	100.0%
Missing	0
Missing (%)	0.0%
Memory Size	80.0 KB

Stats Pie Chart Word Cloud Word Frequency Word Length

Length

Mean	26.9652
Standard Deviation	9.2816
Median	25
Minimum	12
Maximum	82

Sample

1st row	Braund, Mr. Owen H...
2nd row	Cumings, Mrs. John...
3rd row	Heikkinen, Miss. L...
4th row	Futrelle, Mrs. Jac...
5th row	Allen, Mr. William...

Letter

Count	19091
Lowercase Letter	15446
Space Separator	2735
Uppercase Letter	3645
Dash Punctuation	13
Decimal Number	0

Name contains many words: 1522 words

The largest value (mr) is over 2.86 times larger than the second largest value (miss)

Sex

categorical

Approximate Distinct Count	2
Approximate Unique (%)	0.2%
Missing	0
Missing (%)	0.0%
Memory Size	60.7 KB

The largest value (male) is over 1.84 times larger than the second largest value (female)

Stats Pie Chart Word Cloud Word Frequency Word Length

Length

Mean	4.7048
Standard Deviation	0.956
Median	4
Minimum	4
Maximum	6

Sample

1st row	male
2nd row	female
3rd row	female
4th row	female
5th row	male

Letter

Count	4192
Lowercase Letter	4192
Space Separator	0
Uppercase Letter	0
Dash Punctuation	0
Decimal Number	0

The top 2 categories (male, female) take over 50.0%

The largest value (male) is over 1.84 times larger than the second largest value (female)

Age

numerical

Approximate Distinct Count	88
Approximate Unique (%)	12.3%
Missing	177
Missing (%)	19.9%
Infinite	0
Infinite (%)	0.0%
Memory Size	11.2 KB
Mean	29.6991
Minimum	0.42
Maximum	80
Zeros	0
Zeros (%)	0.0%
Negatives	0
Negatives (%)	0.0%

Age is skewed right (γ1 = 0.3883)

Stats KDE Plot Normal Q-Q Plot Box Plot

Quantile Statistics

Minimum	0.42
5-th Percentile	4
Q1	20.125
Median	28
Q3	38
95-th Percentile	56
Maximum	80
Range	79.58
IQR	17.875

Descriptive Statistics

Mean	29.6991
Standard Deviation	14.5265
Variance	211.0191
Sum	21205.17
Skewness	0.3883
Kurtosis	0.1686
Coefficient of Variation	0.4891

Age has 11 outliers

SibSp

categorical

Approximate Distinct Count	7
Approximate Unique (%)	0.8%
Missing	0
Missing (%)	0.0%
Memory Size	57.4 KB

The largest value (0) is over 2.91 times larger than the second largest value (1)

Stats Pie Chart Word Cloud Word Frequency Word Length

Length

Mean	1
Standard Deviation	0
Median	1
Minimum	1
Maximum	1

Sample

1st row	1
2nd row	1
3rd row	0
4th row	1
5th row	0

Letter

Count	0
Lowercase Letter	0
Space Separator	0
Uppercase Letter	0
Dash Punctuation	0
Decimal Number	891

The top 2 categories (0, 1) take over 50.0%

The largest value (0) is over 2.91 times larger than the second largest value (1)
SibSp has words of constant length

Parch

categorical

Approximate Distinct Count	7
Approximate Unique (%)	0.8%
Missing	0
Missing (%)	0.0%
Memory Size	57.4 KB

The largest value (0) is over 5.75 times larger than the second largest value (1)

Stats Pie Chart Word Cloud Word Frequency Word Length

Length

Mean	1
Standard Deviation	0
Median	1
Minimum	1
Maximum	1

Sample

1st row	0
2nd row	0
3rd row	0
4th row	0
5th row	0

Letter

Count	0
Lowercase Letter	0
Space Separator	0
Uppercase Letter	0
Dash Punctuation	0
Decimal Number	891

The top 2 categories (0, 1) take over 50.0%

The largest value (0) is over 5.75 times larger than the second largest value (1)
Parch has words of constant length

Ticket

categorical

Approximate Distinct Count	681
Approximate Unique (%)	76.4%
Missing	0
Missing (%)	0.0%
Memory Size	62.4 KB

Stats Pie Chart Word Cloud Word Frequency Word Length

Length

Mean	6.7508
Standard Deviation	2.7455
Median	6
Minimum	3
Maximum	18

Sample

1st row	A/5 21171
2nd row	PC 17599
3rd row	STON/O2. 3101282
4th row	113803
5th row	373450

Letter

Count	673
Lowercase Letter	21
Space Separator	239
Uppercase Letter	652
Dash Punctuation	0
Decimal Number	4808

Fare

numerical

Approximate Distinct Count	248
Approximate Unique (%)	27.8%
Missing	0
Missing (%)	0.0%
Infinite	0
Infinite (%)	0.0%
Memory Size	13.9 KB
Mean	32.2042
Minimum	0
Maximum	512.3292
Zeros	15
Zeros (%)	1.7%
Negatives	0
Negatives (%)	0.0%

Fare is skewed right (γ1 = 4.7793)

Stats KDE Plot Normal Q-Q Plot Box Plot

Quantile Statistics

Minimum	0
5-th Percentile	7.225
Q1	7.9104
Median	14.4542
Q3	31
95-th Percentile	112.0791
Maximum	512.3292
Range	512.3292
IQR	23.0896

Descriptive Statistics

Mean	32.2042
Standard Deviation	49.6934
Variance	2469.4368
Sum	28693.9493
Skewness	4.7793
Kurtosis	33.2043
Coefficient of Variation	1.5431

Fare is not normally distributed (p-value 5.925743764895219e-18)

Fare has 116 outliers

Cabin

categorical

Approximate Distinct Count	147
Approximate Unique (%)	72.1%
Missing	687
Missing (%)	77.1%
Memory Size	13.7 KB

Stats Pie Chart Word Cloud Word Frequency Word Length

Length

Mean	3.5882
Standard Deviation	2.0743
Median	3
Minimum	1
Maximum	15

Sample

1st row	C85
2nd row	C123
3rd row	E46
4th row	G6
5th row	C103

Letter

Count	238
Lowercase Letter	0
Space Separator	34
Uppercase Letter	238
Dash Punctuation	0
Decimal Number	460

Embarked

categorical

Approximate Distinct Count	3
Approximate Unique (%)	0.3%
Missing	2
Missing (%)	0.2%
Memory Size	57.3 KB

The largest value (S) is over 3.83 times larger than the second largest value (C)

Stats Pie Chart Word Cloud Word Frequency Word Length

Length

Mean	1
Standard Deviation	0
Median	1
Minimum	1
Maximum	1

Sample

1st row	S
2nd row	C
3rd row	S
4th row	S
5th row	S

Letter

Count	889
Lowercase Letter	0
Space Separator	0
Uppercase Letter	889
Dash Punctuation	0
Decimal Number	0

The top 2 categories (S, C) take over 50.0%

The largest value (s) is over 3.83 times larger than the second largest value (c)
Embarked has words of constant length

Datasets¶

List Available Datasets¶

Load Dataset¶

Analyze Dataset¶

Overview

Dataset Statistics

Dataset Insights

Variables

PassengerId

Quantile Statistics

Descriptive Statistics

Survived

Length

Sample

Letter

Pclass

Length

Sample

Letter

Name

Length

Sample

Letter

Sex

Length

Sample

Letter

Age

Quantile Statistics

Descriptive Statistics

SibSp

Length

Sample

Letter

Parch

Length

Sample

Letter

Ticket

Length

Sample

Letter

Fare

Quantile Statistics

Descriptive Statistics

Cabin

Length

Sample

Letter

Embarked

Length

Sample

Letter

Interactions

Correlations

Missing Values