Case Study on Kaggle’s Ames Housing Prediction Project

Fnu Parshant
3 min readJul 24, 2021

--

1. Overview

Henry, who recently moved to Ames, is looking forward to buying a house. While buying a property/house we have to consider various parameters such as location, area of the property, number of bedrooms, living space, garage, kitchen area, garden, swimming pool, etc.,

So, it is not an easy decision to make as he has to analyze many parameters to avoid losses. He collects data on housing properties and approaches his friend Mike to find a solution. His friend, as a data scientist figures out the relationship between the features and the sale price and provides useful information to henry.

2. Context & Challenge

This section consists of a detailed description of the context that led to the creation of the project. This section is comprised of three parts:

a) Background & Description

Realtors determine house values by looking at various features. Some of them are the location of the property, condition of the property, year built, square feet area, heating-AC, garage type, the total number of bedrooms and bathrooms, and many more. After this, they will look for similar houses in the neighborhood that were sold recently to get a better estimate of the property (from Mateus Realty)

b) Problem

As Mike is a data scientist and a friend of Henry, he approaches Mike with the data on housing properties in Ames, which consist of various features of the property and their sale price. Mike has to find a solution for him to buy the property effectively without any loss.

c) Goals & Objectives

● To clean the datasets and make them ready for analysis.

● To observe the data and extract useful information that could help Henry to arrive at some conclusion.

● To do the required visualization to find the relationship between the features and to support the findings.

3. Process & Insight

In this task, we will be using a Jupyter notebook to code, analyze and visualize the data. Mike’s work starts with importing the required datasets.

● Since Mike was given 80 features, he followed a simple approach to choose 32 highly correlated features to predict the Sales Price of a property. He used boxplots to correlate all other features with the Sales Price.

● In the following plots, the x-axis represents different features of our dataset, and the y-axis represents the SalePrice of a property. The boxes represent most of the distribution of the respective features. The dots represent outliers.

● Here the x-axis represents the number of half bathrooms in the basement, and the y-axis represents the SalePrice of a property. The four boxes represent most of the distribution of the respective number of half bathrooms. The dots represent outliers.

● Since all the four boxes are in the same SalePrice range and there are many outliers, this means the number of half bathrooms in the basement hardly affects the SalePrice of a property, so it is best to drop this feature.

4. Solution

Now we have a cleaned dataset. The next step is to create a predictive model with the dataset to predict the sale price of the housing property. Mike created three different machine learning models they are,

● Logistic Regression Model

● LassoCV Model

● RidgeCV Model

All three models are fitted with the train data, and cross-validation is done. The linear regression model has the smallest RMSE, which means better fit than other models.

5. The Results

● Out of the given 80 features, the following 32 features are highly correlated with the Sales Price of a property: [‘MS SubClass,’ ‘MS Zoning,’ ‘Street,’ ‘Land Contour,’ ‘Neighborhood,’ ‘Condition 1’, ‘Overall Qual,’ ‘Year Built,’ ‘Year Remod/Add,’ ‘Mas Vnr Area,’ ‘Exter Qual,’ ‘Exter Cond,’ ‘Foundation,’ ‘BsmtFin SF 1’, ‘Total Bsmt SF,’ ‘Heating QC,’ ‘Central Air,’ ‘Electrical,’ ‘Gr Liv Area,’ ‘Full Bath,’ ‘Half Bath,’ ‘Kitchen Qual,’ ‘TotRms AbvGrd,’ ‘Functional,’ ‘Fireplaces,’ ‘Garage Type,’ ‘Garage Finish,’ ‘Garage Cars,’ ‘Garage Area,’ ‘Paved Drive,’ ‘Sale Type,’ ‘SalePrice’]

● The feature Gr Liv Area has the largest coefficient. This shows that the Area of the grade (ground) living area appears to add the most value to a home price.

Kitchen Qual_Po has the largest negative coefficient. This tells us poor Kitchen quality hurts the value of a home the most.

--

--