EDA Case Study: House Price¶

Task Description¶

House Prices is a classical Kaggle competition. The task is to predicts final price of each house. For more detail, refer to https://www.kaggle.com/c/house-prices-advanced-regression-techniques/.

Goal of this notebook¶

As it is a famous competition, there exists lots of excelent analysis on how to do eda and how to build model for this task. See https://www.kaggle.com/khandelwallaksya/house-prices-eda for a reference. In this notebook, we will show how dataprep.eda can simply the eda process using a few lines of code.

In conclusion: * Understand the problem. We’ll look at each variable and do a philosophical analysis about their meaning and importance for this problem. * Univariable study. We’ll just focus on the dependent variable (‘SalePrice’) and try to know a little bit more about it. * Multivariate study. We’ll try to understand how the dependent variable and independent variables relate. * Basic cleaning. We’ll clean the dataset and handle the missing data, outliers and categorical variables.

Import libraries¶

[1]:

from dataprep.eda import plot
from dataprep.eda import plot_correlation
from dataprep.eda import plot_missing
from dataprep.datasets import load_dataset

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="whitegrid", color_codes=True)
sns.set(font_scale=1)

Load data¶

[2]:

houses = load_dataset("house_prices_train")
houses.head()

[2]:

	Id	MSSubClass	MSZoning	LotFrontage	LotArea	Street	Alley	LotShape	LandContour	Utilities	...	PoolQC	Fence	MiscFeature	MoSold	YrSold	SaleType	SaleCondition	SalePrice
0	1	60	RL	65.0	8450	Pave	NaN	Reg	Lvl	AllPub	...	NaN	NaN	NaN	2	2008	WD	Normal	208500
1	2	20	RL	80.0	9600	Pave	NaN	Reg	Lvl	AllPub	...	NaN	NaN	NaN	5	2007	WD	Normal	181500
2	3	60	RL	68.0	11250	Pave	NaN	IR1	Lvl	AllPub	...	NaN	NaN	NaN	9	2008	WD	Normal	223500
3	4	70	RL	60.0	9550	Pave	NaN	IR1	Lvl	AllPub	...	NaN	NaN	NaN	2	2006	WD	Abnorml	140000
4	5	60	RL	84.0	14260	Pave	NaN	IR1	Lvl	AllPub	...	NaN	NaN	NaN	12	2008	WD	Normal	250000

5 rows × 81 columns

[3]:

houses_test = load_dataset("house_prices_test")
houses_test.head()

[3]:

	Id	MSSubClass	MSZoning	LotFrontage	LotArea	Street	Alley	LotShape	LandContour	Utilities	...	ScreenPorch	PoolQC	Fence	MiscFeature	MiscVal	MoSold	YrSold	SaleType	SaleCondition
0	1461	20	RH	80.0	11622	Pave	NaN	Reg	Lvl	AllPub	...	120	NaN	MnPrv	NaN	0	6	2010	WD	Normal
1	1462	20	RL	81.0	14267	Pave	NaN	IR1	Lvl	AllPub	...	0	NaN	NaN	Gar2	12500	6	2010	WD	Normal
2	1463	60	RL	74.0	13830	Pave	NaN	IR1	Lvl	AllPub	...	0	NaN	MnPrv	NaN	0	3	2010	WD	Normal
3	1464	60	RL	78.0	9978	Pave	NaN	IR1	Lvl	AllPub	...	0	NaN	NaN	NaN	0	6	2010	WD	Normal
4	1465	120	RL	43.0	5005	Pave	NaN	IR1	HLS	AllPub	...	144	NaN	NaN	NaN	0	1	2010	WD	Normal

5 rows × 80 columns

[4]:

houses.shape

[4]:

(1460, 81)

There are total 1460 tuples, each tuple contains 80 features and 1 target value.

[5]:

houses_test.shape

[5]:

(1459, 80)

Variable identification¶

[6]:

plot(houses)

[6]:

DataPrep.EDA Report

Stats and Insights

Dataset Statistics

Number of Variables	81
Number of Rows	1460
Missing Cells	6965
Missing Cells (%)	5.9%
Duplicate Rows	0
Duplicate Rows (%)	0.0%
Total Size in Memory	3.9 MB
Average Row Size in Memory	2.7 KB
Variable Types	Numerical: 27 Categorical: 53 GeoGraphy: 1

Dataset Insights

Id is uniformly distributed	Uniform
LowQualFinSF and 3SsnPorch have similar distributions	Similar Distribution
LowQualFinSF and MiscVal have similar distributions	Similar Distribution
3SsnPorch and MiscVal have similar distributions	Similar Distribution
ScreenPorch and MiscVal have similar distributions	Similar Distribution
LotFrontage has 259 (17.74%) missing values	Missing
Alley has 1369 (93.77%) missing values	Missing
BsmtQual has 37 (2.53%) missing values	Missing
BsmtCond has 37 (2.53%) missing values	Missing
BsmtExposure has 38 (2.6%) missing values	Missing

Dataset Insights

BsmtFinType1 has 37 (2.53%) missing values	Missing
BsmtFinType2 has 38 (2.6%) missing values	Missing
FireplaceQu has 690 (47.26%) missing values	Missing
GarageType has 81 (5.55%) missing values	Missing
GarageYrBlt has 81 (5.55%) missing values	Missing
GarageFinish has 81 (5.55%) missing values	Missing
GarageQual has 81 (5.55%) missing values	Missing
GarageCond has 81 (5.55%) missing values	Missing
PoolQC has 1453 (99.52%) missing values	Missing
Fence has 1179 (80.75%) missing values	Missing

Dataset Insights

MiscFeature has 1406 (96.3%) missing values	Missing
MSSubClass is skewed	Skewed
LotFrontage is skewed	Skewed
LotArea is skewed	Skewed
OverallQual is skewed	Skewed
YearBuilt is skewed	Skewed
YearRemodAdd is skewed	Skewed
MasVnrArea is skewed	Skewed
BsmtFinSF1 is skewed	Skewed
BsmtFinSF2 is skewed	Skewed

Dataset Insights

TotalBsmtSF is skewed	Skewed
2ndFlrSF is skewed	Skewed
LowQualFinSF is skewed	Skewed
TotRmsAbvGrd is skewed	Skewed
WoodDeckSF is skewed	Skewed
OpenPorchSF is skewed	Skewed
EnclosedPorch is skewed	Skewed
3SsnPorch is skewed	Skewed
ScreenPorch is skewed	Skewed
MiscVal is skewed	Skewed

Dataset Insights

MoSold is skewed	Skewed
Street has constant length 4	Constant Length
Alley has constant length 4	Constant Length
LotShape has constant length 3	Constant Length
LandContour has constant length 3	Constant Length
Utilities has constant length 6	Constant Length
LandSlope has constant length 3	Constant Length
OverallCond has constant length 1	Constant Length
ExterQual has constant length 2	Constant Length
ExterCond has constant length 2	Constant Length

Dataset Insights

BsmtQual has constant length 2	Constant Length
BsmtCond has constant length 2	Constant Length
BsmtExposure has constant length 2	Constant Length
BsmtFinType1 has constant length 3	Constant Length
BsmtFinType2 has constant length 3	Constant Length
HeatingQC has constant length 2	Constant Length
CentralAir has constant length 1	Constant Length
BsmtFullBath has constant length 1	Constant Length
BsmtHalfBath has constant length 1	Constant Length
FullBath has constant length 1	Constant Length

Dataset Insights

HalfBath has constant length 1	Constant Length
BedroomAbvGr has constant length 1	Constant Length
KitchenAbvGr has constant length 1	Constant Length
KitchenQual has constant length 2	Constant Length
Fireplaces has constant length 1	Constant Length
FireplaceQu has constant length 2	Constant Length
GarageFinish has constant length 3	Constant Length
GarageCars has constant length 1	Constant Length
GarageQual has constant length 2	Constant Length
GarageCond has constant length 2	Constant Length

Dataset Insights

PavedDrive has constant length 1	Constant Length
PoolQC has constant length 2	Constant Length
MiscFeature has constant length 4	Constant Length
YrSold has constant length 4	Constant Length
MasVnrArea has 861 (58.97%) zeros	Zeros
BsmtFinSF1 has 467 (31.99%) zeros	Zeros
BsmtFinSF2 has 1293 (88.56%) zeros	Zeros
BsmtUnfSF has 118 (8.08%) zeros	Zeros
2ndFlrSF has 829 (56.78%) zeros	Zeros
LowQualFinSF has 1434 (98.22%) zeros	Zeros

Dataset Insights

GarageArea has 81 (5.55%) zeros	Zeros
WoodDeckSF has 761 (52.12%) zeros	Zeros
OpenPorchSF has 656 (44.93%) zeros	Zeros
EnclosedPorch has 1252 (85.75%) zeros	Zeros
3SsnPorch has 1436 (98.36%) zeros	Zeros
ScreenPorch has 1344 (92.05%) zeros	Zeros
MiscVal has 1408 (96.44%) zeros	Zeros

1
2
3
4
5
6
7
8
9

Overview of the data¶

We could get the following information: * Variable-Variable name * Type-There are 43 categorical columns and 38 numerical columns. * Missing value-How many missing values each column contains. For instance, Fence contains 80.8% * 1460 = 1180 missing tuples. Usually, some model does not allow the input data contains missing value such as SVM, we have to clean the data before we utilize it. * Target Value-The distribution of target value (SalePrice). According to the distribution of the target value, we could get the information that the target value is numerical and the distribution of the target value conforms to the norm distribution. Thus, we are not confronted with imbalanced classes problem. It is really great. * Guess-According to the columns’ name, we reckon GrLivArea, YearBuilt and OverallQual are likely to be correlated to the target value (SalePrice).

Heatmap¶

[9]:

plot_correlation(houses)

[9]:

DataPrep.EDA Report

Stats Pearson Spearman KendallTau

	Pearson	Spearman	KendallTau
Highest Positive Correlation	0.882	0.891	0.839
Highest Negative Correlation	-0.495	-0.574	-0.413
Lowest Correlation	0.0	0.0	0.0
Mean Correlation	0.096	0.098	0.078

'height': 400

Height of the plot

'width': 400

Width of the plot

Most positive correlated: (GarageCars, GarageArea)
Most negative correlated: (BsmtFinSF1, BsmtUnfSF)
Least correlated: (Id, GarageYrBlt)

'height': 400

Height of the plot

'width': 400

Width of the plot

Most positive correlated: (YearBuilt, GarageYrBlt)
Most negative correlated: (BsmtFinSF1, BsmtUnfSF)
Least correlated: (Id, GarageYrBlt)

'height': 400

Height of the plot

'width': 400

Width of the plot

Most positive correlated: (YearBuilt, GarageYrBlt)
Most negative correlated: (BsmtFinSF1, BsmtUnfSF)
Least correlated: (BsmtUnfSF, KitchenAbvGr)

In summary¶

In my opinion, this heatmap is the best way to get a quick overview of features’ relationships.

At first sight, there are two red colored squares that get my attention. The first one refers to the ‘TotalBsmtSF’ and ‘1stFlrSF’ variables, and the second one refers to the ‘GarageX’ variables. Both cases show how significant the correlation is between these variables. Actually, this correlation is so strong that it can indicate a situation of multicollinearity. If we think about these variables, we can conclude that they give almost the same information so multicollinearity really occurs. Heatmaps are great to detect this kind of situations and in problems dominated by feature selection, like ours, they are an essential tool.

Another thing that got my attention was the ‘SalePrice’ correlations. We can see our well-known ‘GrLivArea’, ‘TotalBsmtSF’, and ‘OverallQual’, but we can also see many other variables that should be taken into account. That’s what we will do next.

[10]:

plot_correlation(houses[["SalePrice","OverallQual","GrLivArea","GarageCars",
                  "GarageArea","GarageYrBlt","TotalBsmtSF","1stFlrSF","FullBath",
                  "TotRmsAbvGrd","YearBuilt","YearRemodAdd"]])

[10]:

DataPrep.EDA Report

Stats Pearson Spearman KendallTau

	Pearson	Spearman	KendallTau
Highest Positive Correlation	0.882	0.891	0.839
Highest Negative Correlation	0.0	0.0	0.0
Lowest Correlation	0.096	0.177	0.131
Mean Correlation	0.442	0.461	0.364

'height': 400

Height of the plot

'width': 400

Width of the plot

Most positive correlated: (GarageCars, GarageArea)
Most negative correlated: None
Least correlated: (TotRmsAbvGrd, YearBuilt)

'height': 400

Height of the plot

'width': 400

Width of the plot

Most positive correlated: (GarageYrBlt, YearBuilt)
Most negative correlated: None
Least correlated: (TotRmsAbvGrd, YearBuilt)

'height': 400

Height of the plot

'width': 400

Width of the plot

Most positive correlated: (GarageYrBlt, YearBuilt)
Most negative correlated: None
Least correlated: (TotRmsAbvGrd, YearBuilt)

As we saw above there are few feature which shows high multicollinearity from heatmap. Lets focus on red squares on diagonal line and few on the sides.

SalePrice and OverallQual

GarageArea and GarageCars

TotalBsmtSF and 1stFlrSF

GrLiveArea and TotRmsAbvGrd

YearBulit and GarageYrBlt

We have to create a single feature from them before we use them as predictors.

[11]:

plot_correlation(houses, value_range=[0.5, 1])

[11]:

DataPrep.EDA Report

Stats Pearson Spearman KendallTau

	Pearson	Spearman	KendallTau
Highest Positive Correlation	0.882	0.891	0.839
Highest Negative Correlation	-0.495	-0.574	-0.413
Lowest Correlation	0.0	0.0	0.0
Mean Correlation	0.096	0.098	0.078

'height': 400

Height of the plot

'width': 400

Width of the plot

Most positive correlated: (GarageCars, GarageArea)
Most negative correlated: (BsmtFinSF1, BsmtUnfSF)
Least correlated: (Id, GarageYrBlt)

'height': 400

Height of the plot

'width': 400

Width of the plot

Most positive correlated: (YearBuilt, GarageYrBlt)
Most negative correlated: (BsmtFinSF1, BsmtUnfSF)
Least correlated: (Id, GarageYrBlt)

'height': 400

Height of the plot

'width': 400

Width of the plot

Most positive correlated: (YearBuilt, GarageYrBlt)
Most negative correlated: (BsmtFinSF1, BsmtUnfSF)
Least correlated: (BsmtUnfSF, KitchenAbvGr)

[12]:

plot_correlation(houses, k=30)

[12]:

DataPrep.EDA Report

Stats Pearson Spearman KendallTau

	Pearson	Spearman	KendallTau
Highest Positive Correlation	0.882	0.891	0.839
Highest Negative Correlation	-0.495	-0.574	-0.413
Lowest Correlation	0.0	0.0	0.0
Mean Correlation	0.096	0.098	0.078

'height': 400

Height of the plot

'width': 400

Width of the plot

Most positive correlated: (GarageCars, GarageArea)
Most negative correlated: (BsmtFinSF1, BsmtUnfSF)
Least correlated: (Id, GarageYrBlt)

'height': 400

Height of the plot

'width': 400

Width of the plot

Most positive correlated: (YearBuilt, GarageYrBlt)
Most negative correlated: (BsmtFinSF1, BsmtUnfSF)
Least correlated: (Id, GarageYrBlt)

'height': 400

Height of the plot

'width': 400

Width of the plot

Most positive correlated: (YearBuilt, GarageYrBlt)
Most negative correlated: (BsmtFinSF1, BsmtUnfSF)
Least correlated: (BsmtUnfSF, KitchenAbvGr)

Attribute Pair Correlation

7 (GarageArea, GarageCars) 0.882475

11 (GarageYrBlt, YearBuilt) 0.825667

15 (GrLivArea, TotRmsAbvGrd) 0.825489

18 (1stFlrSF, TotalBsmtSF) 0.819530

19 (2ndFlrSF, GrLivArea) 0.687501

9 (BedroomAbvGr, TotRmsAbvGrd) 0.676620

0 (BsmtFinSF1, BsmtFullBath) 0.649212

2 (GarageYrBlt, YearRemodAdd) 0.642277

24 (FullBath, GrLivArea) 0.630012

8 (2ndFlrSF, TotRmsAbvGrd) 0.616423

1 (2ndFlrSF, HalfBath) 0.609707

4 (GarageCars, OverallQual) 0.600671

16 (GrLivArea, OverallQual) 0.593007

23 (YearBuilt, YearRemodAdd) 0.592855

22 (GarageCars, GarageYrBlt) 0.588920

12 (OverallQual, YearBuilt) 0.572323

5 (1stFlrSF, GrLivArea) 0.566024

25 (GarageArea, GarageYrBlt) 0.564567

6 (GarageArea, OverallQual) 0.562022

17 (FullBath, TotRmsAbvGrd) 0.554784

13 (OverallQual, YearRemodAdd) 0.550684

14 (FullBath, OverallQual) 0.550600

3 (GarageYrBlt, OverallQual) 0.547766

10 (GarageCars, YearBuilt) 0.537850

27 (OverallQual, TotalBsmtSF) 0.537808

20 (BsmtFinSF1, TotalBsmtSF) 0.522396

21 (BedroomAbvGr, GrLivArea) 0.521270

26 (2ndFlrSF, BedroomAbvGr) 0.502901

This shows multicollinearity. In regression, “multicollinearity” refers to features that are correlated with other features. Multicollinearity occurs when your model includes multiple factors that are correlated not just to your target variable, but also to each other.

Problem:

Multicollinearity increases the standard errors of the coefficients. That means, multicollinearity makes some variables statistically insignificant when they should be significant.

To avoid this we can do 3 things:

Completely remove those variables Make new feature by adding them or by some other operation. Use PCA, which will reduce feature set to small number of non-collinear features. Reference:http://blog.minitab.com/blog/understanding-statistics/handling-multicollinearity-in-regression-analysis

Univariate Analysis¶

How 1 single variable is distributed in numeric range. What is statistical summary of it. Is it positively skewed or negatively.

[13]:

plot(houses, "SalePrice")

[13]:

DataPrep.EDA Report

Stats Histogram KDE Plot Normal Q-Q Plot Box Plot Value Table

Overview

Approximate Distinct Count	663
Approximate Unique (%)	45.4%
Missing	0
Missing (%)	0.0%
Infinite	0
Infinite (%)	0.0%
Memory Size	22.8 KB
Mean	180921.1959
Minimum	34900
Maximum	755000
Zeros	0
Zeros (%)	0.0%
Negatives	0
Negatives (%)	0.0%

Quantile Statistics

Minimum	34900
5-th Percentile	88000
Q1	129975
Median	163000
Q3	214000
95-th Percentile	326100
Maximum	755000
Range	720100
IQR	84025

Descriptive Statistics

Mean	180921.1959
Standard Deviation	79442.5029
Variance	6.3111e+09
Sum	2.6414e+08
Skewness	1.8809
Kurtosis	6.5098
Coefficient of Variation	0.4391

'hist.bins': 50

Number of bins in the histogram

'hist.yscale': 'linear'

Y-axis scale ("linear" or "log")

'hist.color': '#aec7e8'

Color

'height': 400

Height of the plot

'width': 450

Width of the plot

SalePrice is skewed right (γ1 = 1.8809)

'kde.bins': 50

Number of bins in the histogram

'kde.yscale': 'linear'

Y-axis scale ("linear" or "log")

'kde.hist_color': '#aec7e8'

Color of the density histogram

'kde.line_color': '#d62728'

Color of the density line

'height': 400

Height of the plot

'width': 450

Width of the plot

'qqnorm.point_color': #1f77b4

Color of the points

'qqnorm.line_color': #d62728

Color of the line

'height': 400

Height of the plot

'width': 450

Width of the plot

SalePrice is not normally distributed (p-value 2.792946582081966e-06)

'box.color': #1f77b4

Color

'height': 400

Height of the plot

'width': 450

Width of the plot

SalePrice has 61 outliers

'value_table.ngroups': 10

The number of distinct values to show

Value	Count	Frequency (%)
140000	20	1.4%
135000	17	1.2%
145000	14	1.0%
155000	14	1.0%
110000	13	0.9%
190000	13	0.9%
115000	12	0.8%
160000	12	0.8%
130000	11	0.8%
139000	11	0.8%
Other values (653)	1323	90.6%

In summary¶

Based on the above analysis, we can conclude that:

‘GrLivArea’ and ‘TotalBsmtSF’ seem to be linearly related with ‘SalePrice’. Both relationships are positive, which means that as one variable increases, the other also increases. In the case of ‘TotalBsmtSF’, we can see that the slope of the linear relationship is particularly high. ‘OverallQual’ and ‘YearBuilt’ also seem to be related with ‘SalePrice’. The relationship seems to be stronger in the case of ‘OverallQual’, where the box plot shows how sales prices increase with the overall quality. We just analysed four variables, but there are many other that we should analyse. The trick here seems to be the choice of the right features (feature selection) and not the definition of complex relationships between them (feature engineering).

That said, let’s separate the wheat from the chaff.

Missing Value Imputation¶

Missing values in the training data set can affect prediction or classification of a model negatively.

Also some machine learning algorithms can’t accept missing data eg. SVM, Neural Network.

But filling missing values with mean/median/mode or using another predictive model to predict missing values is also a prediction which may not be 100% accurate, instead you can use models like Decision Trees and Random Forest which handle missing values very well.

Some of this part is based on this kernel: https://www.kaggle.com/bisaria/house-prices-advanced-regression-techniques/handling-missing-data

[21]:

plot_missing(houses)

[21]:

DataPrep.EDA Report

Stats Bar Chart Spectrum Heat Map Dendrogram

Missing Statistics

Missing Cells	6965
Missing Cells (%)	5.9%
Missing Columns	19
Missing Rows	1460
Avg Missing Cells per Column	85.99
Avg Missing Cells per Row	4.77

'height': 500

Height of the plot

'width': 500

Width of the plot

'spectrum.bins': 20

Number of bins

'height': 500

Height of the plot

'width': 500

Width of the plot

'height': 500

Height of the plot

'width': 500

Width of the plot

'height': 500

Height of the plot

'width': 500

Width of the plot

[22]:

# plot_missing(houses, "BsmtQual")
basement_cols=['BsmtQual','BsmtCond','BsmtExposure','BsmtFinType1','BsmtFinType2','BsmtFinSF1','BsmtFinSF2']
houses[basement_cols][houses['BsmtQual'].isnull()==True]

[22]:

	BsmtQual	BsmtCond	BsmtExposure	BsmtFinType1	BsmtFinType2
17	NaN	NaN	NaN	NaN	NaN
39	NaN	NaN	NaN	NaN	NaN
90	NaN	NaN	NaN	NaN	NaN
102	NaN	NaN	NaN	NaN	NaN
156	NaN	NaN	NaN	NaN	NaN
182	NaN	NaN	NaN	NaN	NaN
259	NaN	NaN	NaN	NaN	NaN
342	NaN	NaN	NaN	NaN	NaN
362	NaN	NaN	NaN	NaN	NaN
371	NaN	NaN	NaN	NaN	NaN
392	NaN	NaN	NaN	NaN	NaN
520	NaN	NaN	NaN	NaN	NaN
532	NaN	NaN	NaN	NaN	NaN
533	NaN	NaN	NaN	NaN	NaN
553	NaN	NaN	NaN	NaN	NaN
646	NaN	NaN	NaN	NaN	NaN
705	NaN	NaN	NaN	NaN	NaN
736	NaN	NaN	NaN	NaN	NaN
749	NaN	NaN	NaN	NaN	NaN
778	NaN	NaN	NaN	NaN	NaN
868	NaN	NaN	NaN	NaN	NaN
894	NaN	NaN	NaN	NaN	NaN
897	NaN	NaN	NaN	NaN	NaN
984	NaN	NaN	NaN	NaN	NaN
1000	NaN	NaN	NaN	NaN	NaN
1011	NaN	NaN	NaN	NaN	NaN
1035	NaN	NaN	NaN	NaN	NaN
1045	NaN	NaN	NaN	NaN	NaN
1048	NaN	NaN	NaN	NaN	NaN
1049	NaN	NaN	NaN	NaN	NaN
1090	NaN	NaN	NaN	NaN	NaN
1179	NaN	NaN	NaN	NaN	NaN
1216	NaN	NaN	NaN	NaN	NaN
1218	NaN	NaN	NaN	NaN	NaN
1232	NaN	NaN	NaN	NaN	NaN
1321	NaN	NaN	NaN	NaN	NaN
1412	NaN	NaN	NaN	NaN	NaN

All categorical variables contains NAN whereas continuous ones have 0. So that means there is no basement for those houses. we can replace it with ‘None’.

[23]:

for col in basement_cols:
    if 'FinSF'not in col:
        houses[col] = houses[col].fillna('None')

[24]:

# plot_missing(houses, "FireplaceQu")
houses["FireplaceQu"] = houses["FireplaceQu"].fillna('None')
pd.crosstab(houses.Fireplaces, houses.FireplaceQu)

[24]:

FireplaceQu	Ex	Fa	Gd	None	Po	TA
Fireplaces
0	0	0	0	690	0	0
1	19	28	324	0	20	259
2	4	4	54	0	0	53
3	1	1	2	0	0	1

[25]:

garage_cols=['GarageType','GarageQual','GarageCond','GarageYrBlt','GarageFinish','GarageCars','GarageArea']
houses[garage_cols][houses['GarageType'].isnull()==True]

[25]:

	GarageType	GarageQual	GarageCond	GarageYrBlt	GarageFinish	GarageCars	GarageArea
39	NaN	NaN	NaN	NaN	NaN	0	0
48	NaN	NaN	NaN	NaN	NaN	0	0
78	NaN	NaN	NaN	NaN	NaN	0	0
88	NaN	NaN	NaN	NaN	NaN	0	0
89	NaN	NaN	NaN	NaN	NaN	0	0
...	...	...	...	...	...	...	...
1349	NaN	NaN	NaN	NaN	NaN	0	0
1407	NaN	NaN	NaN	NaN	NaN	0	0
1449	NaN	NaN	NaN	NaN	NaN	0	0
1450	NaN	NaN	NaN	NaN	NaN	0	0
1453	NaN	NaN	NaN	NaN	NaN	0	0

81 rows × 7 columns

All garage related features are missing values in same rows. that means we can replace categorical variables with None and continuous ones with 0.

[26]:

for col in garage_cols:
    if houses[col].dtype==np.object:
        houses[col] = houses[col].fillna('None')
    else:
        houses[col] = houses[col].fillna(0)

EDA Case Study: House Price¶

Task Description¶

Goal of this notebook¶

Import libraries¶

Load data¶

Variable identification¶

Overview of the data¶

Correlation in data¶

Heatmap¶

In summary¶

Univariate Analysis¶

Overview

Quantile Statistics

Descriptive Statistics

Pivotal Features¶

In summary¶

Missing Value Imputation¶

Missing Statistics