Datasets

DataPrep provides a collections of datasets. You could easily load them using one line of code and explore the functionalities of dataprep on them.

List Available Datasets

You could list the name of all available datasets by calling get_dataset_names, as shown in below.

[1]:
from dataprep.datasets import get_dataset_names
get_dataset_names()
[1]:
['covid19',
 'wine-quality-red',
 'iris',
 'waste_hauler',
 'countries',
 'patient_info',
 'house_prices_train',
 'adult',
 'house_prices_test',
 'titanic']

Load Dataset

After you know the available dataset names from get_dataset_names. Next you could load the dataset by calling load_dataset.

[2]:
from dataprep.datasets import load_dataset
df = load_dataset("titanic")
df
[2]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
... ... ... ... ... ... ... ... ... ... ... ... ...
886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 NaN S
887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.0000 B42 S
888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.4500 NaN S
889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C
890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q

891 rows × 12 columns

Analyze Dataset

After you get the dataset, you could try to use dataprep to explore the dataset. For example, you may want to create a profiling report of the dataset using dataprep.eda.

[3]:
from dataprep.eda import create_report
report = create_report(df)
report
[3]:
DataPrep Report

Overview

Dataset Statistics

Number of Variables 12
Number of Rows 891
Missing Cells 866
Missing Cells (%) 8.1%
Duplicate Rows 0
Duplicate Rows (%) 0.0%
Total Size in Memory 315.0 KB
Average Row Size in Memory 362.1 B
Variable Types
  • Numerical: 3
  • Categorical: 9

Dataset Insights

PassengerId is uniformly distributed Uniform
Age has 177 (19.87%) missing values Missing
Cabin has 687 (77.1%) missing values Missing
Fare is skewed Skewed
Name has a high cardinality: 891 distinct values High Cardinality
Ticket has a high cardinality: 681 distinct values High Cardinality
Cabin has a high cardinality: 147 distinct values High Cardinality
Survived has constant length 1 Constant Length
Pclass has constant length 1 Constant Length
SibSp has constant length 1 Constant Length
Parch has constant length 1 Constant Length
Embarked has constant length 1 Constant Length
Name has all distinct values Unique
  • 1
  • 2

Variables

PassengerId

numerical

Approximate Distinct Count 891
Approximate Unique (%) 100.0%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 13.9 KB
Mean 446
Minimum 1
Maximum 891
Zeros 0
Zeros (%) 0.0%
Negatives 0
Negatives (%) 0.0%
  • PassengerId is uniformly distributed

Quantile Statistics

Minimum 1
5-th Percentile 45.5
Q1 223.5
Median 446
Q3 668.5
95-th Percentile 846.5
Maximum 891
Range 890
IQR 445

Descriptive Statistics

Mean 446
Standard Deviation 257.3538
Variance 66231
Sum 397386
Skewness 0
Kurtosis -1.2
Coefficient of Variation 0.577
  • PassengerId is not normally distributed (p-value 7.259388077973426e-05)

Survived

categorical

Approximate Distinct Count 2
Approximate Unique (%) 0.2%
Missing 0
Missing (%) 0.0%
Memory Size 57.4 KB
  • The largest value (0) is over 1.61 times larger than the second largest value (1)

Length

Mean 1
Standard Deviation 0
Median 1
Minimum 1
Maximum 1

Sample

1st row 0
2nd row 1
3rd row 1
4th row 1
5th row 0

Letter

Count 0
Lowercase Letter 0
Space Separator 0
Uppercase Letter 0
Dash Punctuation 0
Decimal Number 891
  • The top 2 categories (0, 1) take over 50.0%
  • The largest value (0) is over 1.61 times larger than the second largest value (1)
  • Survived has words of constant length

Pclass

categorical

Approximate Distinct Count 3
Approximate Unique (%) 0.3%
Missing 0
Missing (%) 0.0%
Memory Size 57.4 KB
  • The largest value (3) is over 2.27 times larger than the second largest value (1)

Length

Mean 1
Standard Deviation 0
Median 1
Minimum 1
Maximum 1

Sample

1st row 3
2nd row 1
3rd row 3
4th row 1
5th row 3

Letter

Count 0
Lowercase Letter 0
Space Separator 0
Uppercase Letter 0
Dash Punctuation 0
Decimal Number 891
  • The top 2 categories (3, 1) take over 50.0%
  • The largest value (3) is over 2.27 times larger than the second largest value (1)
  • Pclass has words of constant length

Name

categorical

Approximate Distinct Count 891
Approximate Unique (%) 100.0%
Missing 0
Missing (%) 0.0%
Memory Size 80.0 KB

Length

Mean 26.9652
Standard Deviation 9.2816
Median 25
Minimum 12
Maximum 82

Sample

1st row Braund, Mr. Owen H...
2nd row Cumings, Mrs. John...
3rd row Heikkinen, Miss. L...
4th row Futrelle, Mrs. Jac...
5th row Allen, Mr. William...

Letter

Count 19091
Lowercase Letter 15446
Space Separator 2735
Uppercase Letter 3645
Dash Punctuation 13
Decimal Number 0
  • Name contains many words: 1522 words
  • The largest value (mr) is over 2.86 times larger than the second largest value (miss)

Sex

categorical

Approximate Distinct Count 2
Approximate Unique (%) 0.2%
Missing 0
Missing (%) 0.0%
Memory Size 60.7 KB
  • The largest value (male) is over 1.84 times larger than the second largest value (female)

Length

Mean 4.7048
Standard Deviation 0.956
Median 4
Minimum 4
Maximum 6

Sample

1st row male
2nd row female
3rd row female
4th row female
5th row male

Letter

Count 4192
Lowercase Letter 4192
Space Separator 0
Uppercase Letter 0
Dash Punctuation 0
Decimal Number 0
  • The top 2 categories (male, female) take over 50.0%
  • The largest value (male) is over 1.84 times larger than the second largest value (female)

Age

numerical

Approximate Distinct Count 88
Approximate Unique (%) 12.3%
Missing 177
Missing (%) 19.9%
Infinite 0
Infinite (%) 0.0%
Memory Size 11.2 KB
Mean 29.6991
Minimum 0.42
Maximum 80
Zeros 0
Zeros (%) 0.0%
Negatives 0
Negatives (%) 0.0%
  • Age is skewed right (γ1 = 0.3883)

Quantile Statistics

Minimum 0.42
5-th Percentile 4
Q1 20.125
Median 28
Q3 38
95-th Percentile 56
Maximum 80
Range 79.58
IQR 17.875

Descriptive Statistics

Mean 29.6991
Standard Deviation 14.5265
Variance 211.0191
Sum 21205.17
Skewness 0.3883
Kurtosis 0.1686
Coefficient of Variation 0.4891
  • Age has 11 outliers

SibSp

categorical

Approximate Distinct Count 7
Approximate Unique (%) 0.8%
Missing 0
Missing (%) 0.0%
Memory Size 57.4 KB
  • The largest value (0) is over 2.91 times larger than the second largest value (1)

Length

Mean 1
Standard Deviation 0
Median 1
Minimum 1
Maximum 1

Sample

1st row 1
2nd row 1
3rd row 0
4th row 1
5th row 0

Letter

Count 0
Lowercase Letter 0
Space Separator 0
Uppercase Letter 0
Dash Punctuation 0
Decimal Number 891
  • The top 2 categories (0, 1) take over 50.0%
  • The largest value (0) is over 2.91 times larger than the second largest value (1)
  • SibSp has words of constant length

Parch

categorical

Approximate Distinct Count 7
Approximate Unique (%) 0.8%
Missing 0
Missing (%) 0.0%
Memory Size 57.4 KB
  • The largest value (0) is over 5.75 times larger than the second largest value (1)

Length

Mean 1
Standard Deviation 0
Median 1
Minimum 1
Maximum 1

Sample

1st row 0
2nd row 0
3rd row 0
4th row 0
5th row 0

Letter

Count 0
Lowercase Letter 0
Space Separator 0
Uppercase Letter 0
Dash Punctuation 0
Decimal Number 891
  • The top 2 categories (0, 1) take over 50.0%
  • The largest value (0) is over 5.75 times larger than the second largest value (1)
  • Parch has words of constant length

Ticket

categorical

Approximate Distinct Count 681
Approximate Unique (%) 76.4%
Missing 0
Missing (%) 0.0%
Memory Size 62.4 KB

Length

Mean 6.7508
Standard Deviation 2.7455
Median 6
Minimum 3
Maximum 18

Sample

1st row A/5 21171
2nd row PC 17599
3rd row STON/O2. 3101282
4th row 113803
5th row 373450

Letter

Count 673
Lowercase Letter 21
Space Separator 239
Uppercase Letter 652
Dash Punctuation 0
Decimal Number 4808

Fare

numerical

Approximate Distinct Count 248
Approximate Unique (%) 27.8%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 13.9 KB
Mean 32.2042
Minimum 0
Maximum 512.3292
Zeros 15
Zeros (%) 1.7%
Negatives 0
Negatives (%) 0.0%
  • Fare is skewed right (γ1 = 4.7793)

Quantile Statistics

Minimum 0
5-th Percentile 7.225
Q1 7.9104
Median 14.4542
Q3 31
95-th Percentile 112.0791
Maximum 512.3292
Range 512.3292
IQR 23.0896

Descriptive Statistics

Mean 32.2042
Standard Deviation 49.6934
Variance 2469.4368
Sum 28693.9493
Skewness 4.7793
Kurtosis 33.2043
Coefficient of Variation 1.5431
  • Fare is not normally distributed (p-value 5.925743764895219e-18)
  • Fare has 116 outliers

Cabin

categorical

Approximate Distinct Count 147
Approximate Unique (%) 72.1%
Missing 687
Missing (%) 77.1%
Memory Size 13.7 KB

Length

Mean 3.5882
Standard Deviation 2.0743
Median 3
Minimum 1
Maximum 15

Sample

1st row C85
2nd row C123
3rd row E46
4th row G6
5th row C103

Letter

Count 238
Lowercase Letter 0
Space Separator 34
Uppercase Letter 238
Dash Punctuation 0
Decimal Number 460

Embarked

categorical

Approximate Distinct Count 3
Approximate Unique (%) 0.3%
Missing 2
Missing (%) 0.2%
Memory Size 57.3 KB
  • The largest value (S) is over 3.83 times larger than the second largest value (C)

Length

Mean 1
Standard Deviation 0
Median 1
Minimum 1
Maximum 1

Sample

1st row S
2nd row C
3rd row S
4th row S
5th row S

Letter

Count 889
Lowercase Letter 0
Space Separator 0
Uppercase Letter 889
Dash Punctuation 0
Decimal Number 0
  • The top 2 categories (S, C) take over 50.0%
  • The largest value (s) is over 3.83 times larger than the second largest value (c)
  • Embarked has words of constant length

Interactions

Correlations

Missing Values