EDA Case Study: House Price

Task Description

House Prices is a classical Kaggle competition. The task is to predicts final price of each house. For more detail, refer to https://www.kaggle.com/c/house-prices-advanced-regression-techniques/.

Goal of this notebook

As it is a famous competition, there exists lots of excelent analysis on how to do eda and how to build model for this task. See https://www.kaggle.com/khandelwallaksya/house-prices-eda for a reference. In this notebook, we will show how dataprep.eda can simply the eda process using a few lines of code.

In conclusion: * Understand the problem. We’ll look at each variable and do a philosophical analysis about their meaning and importance for this problem. * Univariable study. We’ll just focus on the dependent variable (‘SalePrice’) and try to know a little bit more about it. * Multivariate study. We’ll try to understand how the dependent variable and independent variables relate. * Basic cleaning. We’ll clean the dataset and handle the missing data, outliers and categorical variables.

Import libraries

[1]:
from dataprep.eda import plot
from dataprep.eda import plot_correlation
from dataprep.eda import plot_missing
from dataprep.datasets import load_dataset

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="whitegrid", color_codes=True)
sns.set(font_scale=1)

Load data

[2]:
houses = load_dataset("house_prices_train")
houses.head()
[2]:
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ... PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
0 1 60 RL 65.0 8450 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 2 2008 WD Normal 208500
1 2 20 RL 80.0 9600 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 5 2007 WD Normal 181500
2 3 60 RL 68.0 11250 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 9 2008 WD Normal 223500
3 4 70 RL 60.0 9550 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 2 2006 WD Abnorml 140000
4 5 60 RL 84.0 14260 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 12 2008 WD Normal 250000

5 rows × 81 columns

[3]:
houses_test = load_dataset("house_prices_test")
houses_test.head()
[3]:
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ... ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition
0 1461 20 RH 80.0 11622 Pave NaN Reg Lvl AllPub ... 120 0 NaN MnPrv NaN 0 6 2010 WD Normal
1 1462 20 RL 81.0 14267 Pave NaN IR1 Lvl AllPub ... 0 0 NaN NaN Gar2 12500 6 2010 WD Normal
2 1463 60 RL 74.0 13830 Pave NaN IR1 Lvl AllPub ... 0 0 NaN MnPrv NaN 0 3 2010 WD Normal
3 1464 60 RL 78.0 9978 Pave NaN IR1 Lvl AllPub ... 0 0 NaN NaN NaN 0 6 2010 WD Normal
4 1465 120 RL 43.0 5005 Pave NaN IR1 HLS AllPub ... 144 0 NaN NaN NaN 0 1 2010 WD Normal

5 rows × 80 columns

[4]:
houses.shape
[4]:
(1460, 81)

There are total 1460 tuples, each tuple contains 80 features and 1 target value.

[5]:
houses_test.shape
[5]:
(1459, 80)

Variable identification

[6]:
plot(houses)
[6]:
DataPrep.EDA Report
Dataset Statistics
Number of Variables 81
Number of Rows 1460
Missing Cells 6965
Missing Cells (%) 5.9%
Duplicate Rows 0
Duplicate Rows (%) 0.0%
Total Size in Memory 3.9 MB
Average Row Size in Memory 2.7 KB
Variable Types
  • Numerical: 27
  • Categorical: 53
  • GeoGraphy: 1
Dataset Insights
Id is uniformly distributed Uniform
LowQualFinSF and 3SsnPorch have similar distributions Similar Distribution
LowQualFinSF and MiscVal have similar distributions Similar Distribution
3SsnPorch and MiscVal have similar distributions Similar Distribution
ScreenPorch and MiscVal have similar distributions Similar Distribution
LotFrontage has 259 (17.74%) missing values Missing
Alley has 1369 (93.77%) missing values Missing
BsmtQual has 37 (2.53%) missing values Missing
BsmtCond has 37 (2.53%) missing values Missing
BsmtExposure has 38 (2.6%) missing values Missing
Dataset Insights
BsmtFinType1 has 37 (2.53%) missing values Missing
BsmtFinType2 has 38 (2.6%) missing values Missing
FireplaceQu has 690 (47.26%) missing values Missing
GarageType has 81 (5.55%) missing values Missing
GarageYrBlt has 81 (5.55%) missing values Missing
GarageFinish has 81 (5.55%) missing values Missing
GarageQual has 81 (5.55%) missing values Missing
GarageCond has 81 (5.55%) missing values Missing
PoolQC has 1453 (99.52%) missing values Missing
Fence has 1179 (80.75%) missing values Missing
Dataset Insights
MiscFeature has 1406 (96.3%) missing values Missing
MSSubClass is skewed Skewed
LotFrontage is skewed Skewed
LotArea is skewed Skewed
OverallQual is skewed Skewed
YearBuilt is skewed Skewed
YearRemodAdd is skewed Skewed
MasVnrArea is skewed Skewed
BsmtFinSF1 is skewed Skewed
BsmtFinSF2 is skewed Skewed
Dataset Insights
TotalBsmtSF is skewed Skewed
2ndFlrSF is skewed Skewed
LowQualFinSF is skewed Skewed
TotRmsAbvGrd is skewed Skewed
WoodDeckSF is skewed Skewed
OpenPorchSF is skewed Skewed
EnclosedPorch is skewed Skewed
3SsnPorch is skewed Skewed
ScreenPorch is skewed Skewed
MiscVal is skewed Skewed
Dataset Insights
MoSold is skewed Skewed
Street has constant length 4 Constant Length
Alley has constant length 4 Constant Length
LotShape has constant length 3 Constant Length
LandContour has constant length 3 Constant Length
Utilities has constant length 6 Constant Length
LandSlope has constant length 3 Constant Length
OverallCond has constant length 1 Constant Length
ExterQual has constant length 2 Constant Length
ExterCond has constant length 2 Constant Length
Dataset Insights
BsmtQual has constant length 2 Constant Length
BsmtCond has constant length 2 Constant Length
BsmtExposure has constant length 2 Constant Length
BsmtFinType1 has constant length 3 Constant Length
BsmtFinType2 has constant length 3 Constant Length
HeatingQC has constant length 2 Constant Length
CentralAir has constant length 1 Constant Length
BsmtFullBath has constant length 1 Constant Length
BsmtHalfBath has constant length 1 Constant Length
FullBath has constant length 1 Constant Length
Dataset Insights
HalfBath has constant length 1 Constant Length
BedroomAbvGr has constant length 1 Constant Length
KitchenAbvGr has constant length 1 Constant Length
KitchenQual has constant length 2 Constant Length
Fireplaces has constant length 1 Constant Length
FireplaceQu has constant length 2 Constant Length
GarageFinish has constant length 3 Constant Length
GarageCars has constant length 1 Constant Length
GarageQual has constant length 2 Constant Length
GarageCond has constant length 2 Constant Length
Dataset Insights
PavedDrive has constant length 1 Constant Length
PoolQC has constant length 2 Constant Length
MiscFeature has constant length 4 Constant Length
YrSold has constant length 4 Constant Length
MasVnrArea has 861 (58.97%) zeros Zeros
BsmtFinSF1 has 467 (31.99%) zeros Zeros
BsmtFinSF2 has 1293 (88.56%) zeros Zeros
BsmtUnfSF has 118 (8.08%) zeros Zeros
2ndFlrSF has 829 (56.78%) zeros Zeros
LowQualFinSF has 1434 (98.22%) zeros Zeros
Dataset Insights
GarageArea has 81 (5.55%) zeros Zeros
WoodDeckSF has 761 (52.12%) zeros Zeros
OpenPorchSF has 656 (44.93%) zeros Zeros
EnclosedPorch has 1252 (85.75%) zeros Zeros
3SsnPorch has 1436 (98.36%) zeros Zeros
ScreenPorch has 1344 (92.05%) zeros Zeros
MiscVal has 1408 (96.44%) zeros Zeros
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9

Overview of the data

We could get the following information: * Variable-Variable name * Type-There are 43 categorical columns and 38 numerical columns. * Missing value-How many missing values each column contains. For instance, Fence contains 80.8% * 1460 = 1180 missing tuples. Usually, some model does not allow the input data contains missing value such as SVM, we have to clean the data before we utilize it. * Target Value-The distribution of target value (SalePrice). According to the distribution of the target value, we could get the information that the target value is numerical and the distribution of the target value conforms to the norm distribution. Thus, we are not confronted with imbalanced classes problem. It is really great. * Guess-According to the columns’ name, we reckon GrLivArea, YearBuilt and OverallQual are likely to be correlated to the target value (SalePrice).

Correlation in data

[7]:
plot_correlation(houses, "SalePrice")
[7]:
DataPrep.EDA Report
'height': 400
Height of the plot
'width': 400
Width of the plot
'height': 400
Height of the plot
'width': 400
Width of the plot
'height': 400
Height of the plot
'width': 400
Width of the plot
[8]:
plot_correlation(houses, "SalePrice", value_range=[0.5, 1])
[8]:
DataPrep.EDA Report