House Prices is a classical Kaggle competition. The task is to predicts final price of each house. For more detail, refer to https://www.kaggle.com/c/house-prices-advanced-regression-techniques/.
As it is a famous competition, there exists lots of excelent analysis on how to do eda and how to build model for this task. See https://www.kaggle.com/khandelwallaksya/house-prices-eda for a reference. In this notebook, we will show how dataprep.eda can simply the eda process using a few lines of code.
In conclusion: * Understand the problem. We’ll look at each variable and do a philosophical analysis about their meaning and importance for this problem. * Univariable study. We’ll just focus on the dependent variable (‘SalePrice’) and try to know a little bit more about it. * Multivariate study. We’ll try to understand how the dependent variable and independent variables relate. * Basic cleaning. We’ll clean the dataset and handle the missing data, outliers and categorical variables.
[1]:
from dataprep.eda import plot from dataprep.eda import plot_correlation from dataprep.eda import plot_missing from dataprep.datasets import load_dataset import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns sns.set(style="whitegrid", color_codes=True) sns.set(font_scale=1)
[2]:
houses = load_dataset("house_prices_train") houses.head()
5 rows × 81 columns
[3]:
houses_test = load_dataset("house_prices_test") houses_test.head()
5 rows × 80 columns
[4]:
houses.shape
(1460, 81)
There are total 1460 tuples, each tuple contains 80 features and 1 target value.
[5]:
houses_test.shape
(1459, 80)
[6]:
plot(houses)
We could get the following information: * Variable-Variable name * Type-There are 43 categorical columns and 38 numerical columns. * Missing value-How many missing values each column contains. For instance, Fence contains 80.8% * 1460 = 1180 missing tuples. Usually, some model does not allow the input data contains missing value such as SVM, we have to clean the data before we utilize it. * Target Value-The distribution of target value (SalePrice). According to the distribution of the target value, we could get the information that the target value is numerical and the distribution of the target value conforms to the norm distribution. Thus, we are not confronted with imbalanced classes problem. It is really great. * Guess-According to the columns’ name, we reckon GrLivArea, YearBuilt and OverallQual are likely to be correlated to the target value (SalePrice).
[7]:
plot_correlation(houses, "SalePrice")
[8]:
plot_correlation(houses, "SalePrice", value_range=[0.5, 1])
OverallQual, GrLivArea, GarageCars, GarageArea, TotalBsmtSF, 1stFlrSF, FullBath, TotRmsAbvGrd, YearBuilt, YearRemodAdd have more than 0.5 Pearson correlation with SalePrice.
OverallQual, GrLivArea, GarageCars, YearBuilt, GarageArea, FullBath, TotalBsmtSF, GarageYrBlt, 1stFlrSF, YearRemodAdd, TotRmsAbvGrd and Fireplaces have more than 0.5 Spearman correlation with SalePrice.
OverallQual, GarageCars, GrLivArea and FullBath have more than 0.5 KendallTau correlation with SalePrice.
EnclosedPorch and KitchenAbvGr have little negative correlation with target variable.
These can prove to be important features to predict SalePrice.
[9]:
plot_correlation(houses)
In my opinion, this heatmap is the best way to get a quick overview of features’ relationships.
At first sight, there are two red colored squares that get my attention. The first one refers to the ‘TotalBsmtSF’ and ‘1stFlrSF’ variables, and the second one refers to the ‘GarageX’ variables. Both cases show how significant the correlation is between these variables. Actually, this correlation is so strong that it can indicate a situation of multicollinearity. If we think about these variables, we can conclude that they give almost the same information so multicollinearity really occurs. Heatmaps are great to detect this kind of situations and in problems dominated by feature selection, like ours, they are an essential tool.
Another thing that got my attention was the ‘SalePrice’ correlations. We can see our well-known ‘GrLivArea’, ‘TotalBsmtSF’, and ‘OverallQual’, but we can also see many other variables that should be taken into account. That’s what we will do next.
[10]:
plot_correlation(houses[["SalePrice","OverallQual","GrLivArea","GarageCars", "GarageArea","GarageYrBlt","TotalBsmtSF","1stFlrSF","FullBath", "TotRmsAbvGrd","YearBuilt","YearRemodAdd"]])
As we saw above there are few feature which shows high multicollinearity from heatmap. Lets focus on red squares on diagonal line and few on the sides.
SalePrice and OverallQual
GarageArea and GarageCars
TotalBsmtSF and 1stFlrSF
GrLiveArea and TotRmsAbvGrd
YearBulit and GarageYrBlt
We have to create a single feature from them before we use them as predictors.
[11]:
plot_correlation(houses, value_range=[0.5, 1])
[12]:
plot_correlation(houses, k=30)
Attribute Pair Correlation
7 (GarageArea, GarageCars) 0.882475
11 (GarageYrBlt, YearBuilt) 0.825667
15 (GrLivArea, TotRmsAbvGrd) 0.825489
18 (1stFlrSF, TotalBsmtSF) 0.819530
19 (2ndFlrSF, GrLivArea) 0.687501
9 (BedroomAbvGr, TotRmsAbvGrd) 0.676620
0 (BsmtFinSF1, BsmtFullBath) 0.649212
2 (GarageYrBlt, YearRemodAdd) 0.642277
24 (FullBath, GrLivArea) 0.630012
8 (2ndFlrSF, TotRmsAbvGrd) 0.616423
1 (2ndFlrSF, HalfBath) 0.609707
4 (GarageCars, OverallQual) 0.600671
16 (GrLivArea, OverallQual) 0.593007
23 (YearBuilt, YearRemodAdd) 0.592855
22 (GarageCars, GarageYrBlt) 0.588920
12 (OverallQual, YearBuilt) 0.572323
5 (1stFlrSF, GrLivArea) 0.566024
25 (GarageArea, GarageYrBlt) 0.564567
6 (GarageArea, OverallQual) 0.562022
17 (FullBath, TotRmsAbvGrd) 0.554784
13 (OverallQual, YearRemodAdd) 0.550684
14 (FullBath, OverallQual) 0.550600
3 (GarageYrBlt, OverallQual) 0.547766
10 (GarageCars, YearBuilt) 0.537850
27 (OverallQual, TotalBsmtSF) 0.537808
20 (BsmtFinSF1, TotalBsmtSF) 0.522396
21 (BedroomAbvGr, GrLivArea) 0.521270
26 (2ndFlrSF, BedroomAbvGr) 0.502901
This shows multicollinearity. In regression, “multicollinearity” refers to features that are correlated with other features. Multicollinearity occurs when your model includes multiple factors that are correlated not just to your target variable, but also to each other.
Problem:
Multicollinearity increases the standard errors of the coefficients. That means, multicollinearity makes some variables statistically insignificant when they should be significant.
To avoid this we can do 3 things:
Completely remove those variables Make new feature by adding them or by some other operation. Use PCA, which will reduce feature set to small number of non-collinear features. Reference:http://blog.minitab.com/blog/understanding-statistics/handling-multicollinearity-in-regression-analysis
How 1 single variable is distributed in numeric range. What is statistical summary of it. Is it positively skewed or negatively.
[13]:
plot(houses, "SalePrice")
[14]:
plot_correlation(houses, "OverallQual", "SalePrice")
[15]:
plot(houses, "OverallQual", "SalePrice")
[16]:
plot(houses, "GarageCars", "SalePrice")
[17]:
plot(houses, "Fireplaces", "SalePrice")
[18]:
plot(houses, "GrLivArea", "SalePrice")
[19]:
plot(houses, "TotalBsmtSF", "SalePrice")
[20]:
plot(houses, "YearBuilt", "SalePrice")
Based on the above analysis, we can conclude that:
‘GrLivArea’ and ‘TotalBsmtSF’ seem to be linearly related with ‘SalePrice’. Both relationships are positive, which means that as one variable increases, the other also increases. In the case of ‘TotalBsmtSF’, we can see that the slope of the linear relationship is particularly high. ‘OverallQual’ and ‘YearBuilt’ also seem to be related with ‘SalePrice’. The relationship seems to be stronger in the case of ‘OverallQual’, where the box plot shows how sales prices increase with the overall quality. We just analysed four variables, but there are many other that we should analyse. The trick here seems to be the choice of the right features (feature selection) and not the definition of complex relationships between them (feature engineering).
That said, let’s separate the wheat from the chaff.
Missing values in the training data set can affect prediction or classification of a model negatively.
Also some machine learning algorithms can’t accept missing data eg. SVM, Neural Network.
But filling missing values with mean/median/mode or using another predictive model to predict missing values is also a prediction which may not be 100% accurate, instead you can use models like Decision Trees and Random Forest which handle missing values very well.
Some of this part is based on this kernel: https://www.kaggle.com/bisaria/house-prices-advanced-regression-techniques/handling-missing-data
[21]:
plot_missing(houses)
[22]:
# plot_missing(houses, "BsmtQual") basement_cols=['BsmtQual','BsmtCond','BsmtExposure','BsmtFinType1','BsmtFinType2','BsmtFinSF1','BsmtFinSF2'] houses[basement_cols][houses['BsmtQual'].isnull()==True]
All categorical variables contains NAN whereas continuous ones have 0. So that means there is no basement for those houses. we can replace it with ‘None’.
[23]:
for col in basement_cols: if 'FinSF'not in col: houses[col] = houses[col].fillna('None')
[24]:
# plot_missing(houses, "FireplaceQu") houses["FireplaceQu"] = houses["FireplaceQu"].fillna('None') pd.crosstab(houses.Fireplaces, houses.FireplaceQu)
[25]:
garage_cols=['GarageType','GarageQual','GarageCond','GarageYrBlt','GarageFinish','GarageCars','GarageArea'] houses[garage_cols][houses['GarageType'].isnull()==True]
81 rows × 7 columns
All garage related features are missing values in same rows. that means we can replace categorical variables with None and continuous ones with 0.
[26]:
for col in garage_cols: if houses[col].dtype==np.object: houses[col] = houses[col].fillna('None') else: houses[col] = houses[col].fillna(0)