House Prices is a classical Kaggle competition. The task is to predicts final price of each house. For more detail, refer to https://www.kaggle.com/c/house-prices-advanced-regression-techniques/.
As it is a famous competition, there exists lots of excelent analysis on how to do eda and how to build model for this task. See https://www.kaggle.com/khandelwallaksya/house-prices-eda for a reference. In this notebook, we will show how dataprep.eda can simply the eda process using a few lines of code.
In conclusion: * Understand the problem. We’ll look at each variable and do a philosophical analysis about their meaning and importance for this problem. * Univariable study. We’ll just focus on the dependent variable (‘SalePrice’) and try to know a little bit more about it. * Multivariate study. We’ll try to understand how the dependent variable and independent variables relate. * Basic cleaning. We’ll clean the dataset and handle the missing data, outliers and categorical variables.
[1]:
from dataprep.eda import plot from dataprep.eda import plot_correlation from dataprep.eda import plot_missing from dataprep.datasets import load_dataset import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns sns.set(style="whitegrid", color_codes=True) sns.set(font_scale=1)
[2]:
houses = load_dataset("house_prices_train") houses.head()
5 rows × 81 columns
[3]:
houses_test = load_dataset("house_prices_test") houses_test.head()
5 rows × 80 columns
[4]:
houses.shape
(1460, 81)
There are total 1460 tuples, each tuple contains 80 features and 1 target value.
[5]:
houses_test.shape
(1459, 80)
[6]:
plot(houses)
We could get the following information: * Variable-Variable name * Type-There are 43 categorical columns and 38 numerical columns. * Missing value-How many missing values each column contains. For instance, Fence contains 80.8% * 1460 = 1180 missing tuples. Usually, some model does not allow the input data contains missing value such as SVM, we have to clean the data before we utilize it. * Target Value-The distribution of target value (SalePrice). According to the distribution of the target value, we could get the information that the target value is numerical and the distribution of the target value conforms to the norm distribution. Thus, we are not confronted with imbalanced classes problem. It is really great. * Guess-According to the columns’ name, we reckon GrLivArea, YearBuilt and OverallQual are likely to be correlated to the target value (SalePrice).
[7]:
plot_correlation(houses, "SalePrice")
[8]:
plot_correlation(houses, "SalePrice", value_range=[0.5, 1])