EDA Case Study: Titanic¶

1. Task Description¶

Titanic is a classical Kaggle competition. The task is to predicts which passengers survived the Titanic shipwreck. For more detail, refer to https://www.kaggle.com/c/titanic/overview/description.

2. Goal of this notebook¶

As it is a famous competition, there exists lots of excelent analysis on how to do eda and how to build model for this task. See https://www.kaggle.com/startupsci/titanic-data-science-solutions for a reference. In this notebook, we will show how dataprep.eda can simplify the eda process using a few lines of code.

3. Load data¶

[1]:

from dataprep.eda import *
from dataprep.datasets import load_dataset
train_df = load_dataset("titanic")
train_df

[1]:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S
...	...	...	...	...	...	...	...	...	...	...	...	...
886	887	0	2	Montvila, Rev. Juozas	male	27.0	0	0	211536	13.0000	NaN	S
887	888	1	1	Graham, Miss. Margaret Edith	female	19.0	0	0	112053	30.0000	B42	S
888	889	0	3	Johnston, Miss. Catherine Helen "Carrie"	female	NaN	1	2	W./C. 6607	23.4500	NaN	S
889	890	1	1	Behr, Mr. Karl Howell	male	26.0	0	0	111369	30.0000	C148	C
890	891	0	3	Dooley, Mr. Patrick	male	32.0	0	0	370376	7.7500	NaN	Q

891 rows × 12 columns

4. Glimpse of the data¶

The first thing we need to do is to rounghly understand the data. I.e., how many columns are available, which columns are categorical, which columns are numerical, and which column contains missing values. In dataprep.eda, all of the above questions could be answered in just one line of code!

[2]:

plot(train_df)

[2]:

DataPrep.EDA Report

Stats and Insights

Dataset Statistics

Number of Variables	12
Number of Rows	891
Missing Cells	866
Missing Cells (%)	8.1%
Duplicate Rows	0
Duplicate Rows (%)	0.0%
Total Size in Memory	315.0 KB
Average Row Size in Memory	362.1 B
Variable Types	Numerical: 3 Categorical: 9

Dataset Insights

PassengerId is uniformly distributed	Uniform
Age has 177 (19.87%) missing values	Missing
Cabin has 687 (77.1%) missing values	Missing
Fare is skewed	Skewed
Name has a high cardinality: 891 distinct values	High Cardinality
Ticket has a high cardinality: 681 distinct values	High Cardinality
Cabin has a high cardinality: 147 distinct values	High Cardinality
Survived has constant length 1	Constant Length
Pclass has constant length 1	Constant Length
SibSp has constant length 1	Constant Length

Dataset Insights

Parch has constant length 1	Constant Length
Embarked has constant length 1	Constant Length
Name has all distinct values	Unique

1
2

The plot(df) shows the distribution of each column. For a categorical column, it shows the bar chart with blue color. For a numeric column, it shows the histgorm with gray color. Currently, the column type (i.e., categorical or numeric) is based on the column type in input dataframe. Hence, if some column types is wrongly identified, you could change its type on the dataframe. For example, by calling df[col] = df[col].astype(“object”) you could identify col as a categorical column.

From the output of plot(df), we know: 1. All Columns: there are 1 label column Survived and 11 feature columns, which are PassengerId, Pclass, Name, Sex, Age, SibSp, Parch, Ticket, Fare, Cabin, Embarked. 2. Categorical Columns: Survived, PassengerId, Pclass, Name, Sex, Ticket, Embarked. 3. Numeric Columns: Age, SibSp, Parch, Fare. 4. Missing Values: From the figure title, we can find there are 3 columns with missing values. I.e., Age (19.9%), Cabin (77.1%), Embarked(0.2%). 5. Label Balance: From the distribution of Survived, we are aware that the positive and negative training examples and not very balanced. There are 38% data with label Survived = 1.

5. Identify useful features¶

After we roungly know the data, next we want to understand how each feature is correlated to the label column. ### 5.1 Age, Cabin, and Embarked: features with missing values. We first take a look at features with missing values: age, cabin and embarked. To understand the missing value, we first call plot_missing(df) to see whether the missing values have any underlaying pattern.

[3]:

plot_missing(train_df)

[3]:

DataPrep.EDA Report

Stats Bar Chart Spectrum Heat Map Dendrogram

Missing Statistics

Missing Cells	866
Missing Cells (%)	8.1%
Missing Columns	3
Missing Rows	708
Avg Missing Cells per Column	72.17
Avg Missing Cells per Row	0.97

'height': 500

Height of the plot

'width': 500

Width of the plot

'spectrum.bins': 20

Number of bins

'height': 500

Height of the plot

'width': 500

Width of the plot

'height': 500

Height of the plot

'width': 500

Width of the plot

'height': 500

Height of the plot

'width': 500

Width of the plot

plot_missing(df) shows how missing values are distributed in the input data. From the output, we know that the missing value is uniformly distribution among records, and there is no underlaying pattern. Next, we decide how to handle the missing values: should we remove the feature, remove the rows contain missing values, or filling the missing values? We first analyze whether they are correlated to Survived. If they are correlated, then we may do not want to remove the feature. We analyze the correlation between two columns by calling plot(df, x, y).

[4]:

for feature in ['Age', 'Cabin', 'Embarked']:
    plot(train_df, feature, 'Survived').show()

DataPrep.EDA Report

Box Plot Line Chart

'box.ngroups': 15

Maximum number of groups to display

'.box.sort_descending': True

Whether to sort the boxes in descending order of frequency

'height': 400

Height of the plot

'width': 450

Width of the plot

'line.ngroups': 10

Maximum number of groups to display

'line.sort_descending': True

Whether to sort the groups in descending order of frequency

'height': 400

Height of the plot

'width': 450

Width of the plot

DataPrep.EDA Report

Nested Bar Chart Stacked Bar Chart Heat Map

'nested.ngroups': 10

Maximum number of most frequent values in column Cabin to display

'nested.nsubgroups': 5

Maximum number of most frequent values in column Survived to display (computed on the filtered data consisting of the most frequent values in column Cabin)

'height': 300

Height of the plot

'width': 972

Width of the plot

'stacked.ngroups': 10

Maximum number of most frequent values in column Cabin to display

'stacked.nsubgroups': 5

Maximum number of most frequent values in column Survived to display (computed on the filtered data consisting of the most frequent values in column Cabin)

'height': 300

Height of the plot

'width': 972

Width of the plot

'heatmap.ngroups': 10

Maximum number of most frequent values in column Cabin to display

'heatmap.nsubgroups': 5

Maximum number of most frequent values in column Survived to display (computed on the filtered data consisting of the most frequent values in column Cabin)

'height': 300

Height of the plot

'width': 972

Width of the plot

DataPrep.EDA Report

Nested Bar Chart Stacked Bar Chart Heat Map

'nested.ngroups': 10

Maximum number of most frequent values in column Embarked to display

'nested.nsubgroups': 5

Maximum number of most frequent values in column Survived to display (computed on the filtered data consisting of the most frequent values in column Embarked)

'height': 300

Height of the plot

'width': 972

Width of the plot

'stacked.ngroups': 10

Maximum number of most frequent values in column Embarked to display

'stacked.nsubgroups': 5

Maximum number of most frequent values in column Survived to display (computed on the filtered data consisting of the most frequent values in column Embarked)

'height': 300

Height of the plot

'width': 972

Width of the plot

'heatmap.ngroups': 10

Maximum number of most frequent values in column Embarked to display

'heatmap.nsubgroups': 5

Maximum number of most frequent values in column Survived to display (computed on the filtered data consisting of the most frequent values in column Embarked)

'height': 300

Height of the plot

'width': 972

Width of the plot

From the output, we can find that: 1. The Age feature is correlated to Survived. Younger people is more likely to be survived. 2. The Embarked feature is correlated to survived. Passenger with Embarked = C is more likely to be survived. 3. The correlation of Cabin to Survived is unclear, since Cabin contains many missing values (77.1%) and many distinct values (147), hence each distinct value only contains a few useful tuples.

Hence, we could safely remove Cabin column. For Age and Embarked feature, we should fill the missing values.

Result: keep Age and Embarked and filling their missing values; remove Cabin.

5.4 Result¶

After the processing, we now left the following features that are useful to predicat Survived: Age, Embarked, Pclass, Sex, Fare, SibSp and Parch.

6. Identify Correlated Features¶

For now, we identified the useful features one by one, and removed the useless features. Altough each feature is useful to predict Survived, when we consider them together, we may not want correlated features. Hence, we first identity correlated features. This could be done by simply calling plot_correlation(df).

[7]:

plot_correlation(train_df)

[7]:

DataPrep.EDA Report

Stats Pearson Spearman KendallTau

	Pearson	Spearman	KendallTau
Highest Positive Correlation	0.415	0.45	0.425
Highest Negative Correlation	-0.549	-0.688	-0.574
Lowest Correlation	0.002	0.001	0.001
Mean Correlation	-0.024	-0.001	0.0

'height': 400

Height of the plot

'width': 400

Width of the plot

Most positive correlated: (SibSp, Parch)
Most negative correlated: (Pclass, Fare)
Least correlated: (PassengerId, Parch)

'height': 400

Height of the plot

'width': 400

Width of the plot

Most positive correlated: (SibSp, Parch)
Most negative correlated: (Pclass, Fare)
Least correlated: (PassengerId, Parch)

'height': 400

Height of the plot

'width': 400

Width of the plot

Most positive correlated: (SibSp, Parch)
Most negative correlated: (Pclass, Fare)
Least correlated: (PassengerId, Parch)

From the output, we know that: 1. The most correlated columns are Parch and SibSp, with a Pearson correlation 0.41. 2. There does not exist two columns that are highly correlated. Hence, we do not need to worry much about correlated features. However, as Parch and SibSp are slightly correlated in both computation and semantics, we may want to construct a new feature named Family, based on Parch and SibSp, which counts the total family members for each passienger.

Result: Construct a new feature Family based on Parch and SibSp.

EDA Case Study: Titanic¶

1. Task Description¶

2. Goal of this notebook¶

3. Load data¶

4. Glimpse of the data¶

5. Identify useful features¶

Missing Statistics

5.2 Pclass, Sex, Fare: features that look related to survived.¶

5.3 PassengerId, Name, SibSp, Parch, Ticket: left features¶

5.4 Result¶

6. Identify Correlated Features¶