Predicting House Prices with the Ames Dataset

This problem is fairly simple to describe… given a bunch of variables that describe the various aspects of a house along with a sales price, is it possible to come up with a mathematical model that can be used to predict future prices based on different values of the same variables? It is based housing sales data from Ames Iowa between the years 2006 – 2010. (I am not sure if it work for Toronto sales data where its common to get huge sold over asking amounts which would make it difficult to predict sales prices).

If I had taken any econometrics courses, I would have learned that this problem can be considered as a Hedonic pricing model; from Investopedia:

Hedonic pricing is a model that identifies price factors according to the premise that price is determined both by internal characteristics of the good being sold and the external factors affecting it.

In finding a suitable model, there are few issues to deal with, including:

There are 79 features which may be more than necessary; that is some features may provide more value than others when making a prediction
How to deal with missing values? Drop them or fill them in with something reasonable (what is reasonable)
How to deal with outliers?
How to treat categorical data? What type of encoding should be used

Features…

Again, there are 79 features the dataset. Below is a brief description of them all (taken from the data dictionary with a few added notes):

MSSubClass: Categorical variable that identifies the type of dwelling involved in the sale. An example value is 60 which refers to a ‘2-Story 1946 & Newer’ house
MSZoning: Identifies the general zoning classification of the sale. Another categorical variable; ex. ‘RL’ refers to ‘Residential Low Density’
LotFrontage: Linear feet of street connected to property
LotArea: Lot size in square feet
Street: Type of road access to property. Categorical variable with two values: ‘Grvl’ and ‘Pave’ for Gravel and Paved
Alley: Type of alley access to property. Categorical variable with three values: ‘Grvl’, ‘Pave’, and NA for Gravel, Paved, and No Alley access
LotShape: General shape of property. Categorical with options for regular to various shades of irregular
LandContour: Flatness of the property. Another categorical variable with options for level, banked, hillside, or depression
Utilities: Type of utilities available. Either all, all with no sewer, all with no sewer and water, or only electricity
LotConfig: Lot configuration. Categorical describing where on the street the house is located; for ex. corner lot
LandSlope: Slope of property. Categorical; ex. moderate slope
Neighborhood: Physical locations within Ames city limits. String with up to 7 letters indicating the neighbourhood name. Note that ‘Names’ corresponds to North Ames
Condition1: Proximity to various conditions such as adjacent to a major street, railroad, or park. Again categorical
Condition2: Proximity to various conditions if more than one is present
BldgType: Type of dwelling. Choices are in 5 categories including: single family detacted, two-family conversion, duplex, townhouse end unit, and townhouse inside unit
HouseStyle: Style of dwelling. Eight categories including one story, two story, and split level
OverallQual: Rates the overall material and finish of the house. Ranges from 1 for very poor to 10 for very excellent
OverallCond: Rates the overall condition of the house. Same range as above, that is 1 for very poor to 10 for very excellent
YearBuilt: Original construction date
YearRemodAdd: Remodel date (same as construction date if no remodeling or additions)
RoofStyle: Type of roof. Categorical with the following options: Flat, Gable, Gabrel, Hip, Mansard, Shed
RoofMatl: Roof material. Categorical with eight options: Clay tile, Standard shingle, Membrane, Metal, Roll, Gravel & Tar, Wood shakes, and Wood Shingles
Exterior1st: Exterior covering on house. Categorical with 17 options; some include: Brick face, Stone, Stucco, and Wood Shingles.
Exterior2nd: Exterior covering on house if more than one material is present. Same options as those listed above.
MasVnrType: Masonry veneer type. Categorical with 5 options: Brick common, brick face, cinder block, none, and stone
MasVnrArea: Masonry veneer area in square feet
ExterQual: Evaluates the quality of the material on the exterior. Five options from poor to excellent.
ExterCond: Evaluates the present condition of the material on the exterior. Again, five options from poor to excellent.
Foundation: Type of foundation. Categorical variable with options such as Cinder block, poured concrete, and slab
BsmtQual: Evaluates the height of the basement. The height of the basement has been grouped into five bins, along with one for no basement.
BsmtCond: Evaluates the general condition of the basement. Again 5 categories to describe the condition (from excellent to poor) along with 1 for no basement.
BsmtExposure: Refers to walkout or garden level walls. Categorical with 4 options along with 1 for No basement
BsmtFinType1: Rating of basement finished area. Seven categories… 5 for quality of living quarters (from good to low quality), 1 for unfinished, and 1 for no basement.
BsmtFinSF1: Type 1 finished square feet
BsmtFinType2: Rating of basement finished area (if multiple types). Same categories as those listed above.
BsmtFinSF2: Type 2 finished square feet
BsmtUnfSF: Unfinished square feet of basement area
TotalBsmtSF: Total square feet of basement area
Heating: Type of heating. Categorical with 6 options including floor furnace and gas forced warm air furnace among others
HeatingQC: Heating quality and condition. Five options from poor to excellent
CentralAir: Central air conditioning. Either yes or no
Electrical: Electrical system. Five options including: standard circuit breakers & romex, fuse box over 60 amp and all romex wiring, 60 amp fuse box and mostly romex wiring, 60 amp fuse box and mostly knob and tube wiring, and mixed
1stFlrSF: First Floor square feet
2ndFlrSF: Second floor square feet
LowQualFinSF: Low quality finished square feet (all floors)
GrLivArea: Above grade (ground) living area square feet
BsmtFullBath: Basement full bathrooms. Integer value
BsmtHalfBath: Basement half bathrooms. Again value is an integer
FullBath: Full bathrooms above grade
HalfBath: Half baths above grade
Bedroom: Bedrooms above grade (does NOT include basement bedrooms)
Kitchen: Kitchens above grade
KitchenQual: Kitchen quality. Five options including: excellent, good, average, fair, and poor.
TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)
Functional: How functional the home is (assume typical unless deductions are warranted). Options include minor deductions, moderate deductions, major deductions, severely damaged, and salvage only
Fireplaces: Number of fireplaces
FireplaceQu: Fireplace quality. Five levels of quality (from excellent to poor), and one option for no fireplace
GarageType: Garage location. Options include: attached to home, basement garage, built-in, car port, detacted from home, more than one type, and no garage
GarageYrBlt: Year garage was built
GarageFinish: Interior finish of the garage. Options include: funished, rough finished, unfinished, and no garage
GarageCars: Size of garage in car capacity
GarageArea: Size of garage in square feet. Highly correlated with the garage cars feature
GarageQual: Garage quality. Five options for quality (from excellent to poor), and one option for no garage
GarageCond: Garage condition. Five options for condition (from excellent to poor), and one option for no garage
PavedDrive: Paved driveway. Three options: paved, partial pavement, and dirt/gravel
WoodDeckSF: Wood deck area in square feet
OpenPorchSF: Open porch area in square feet
EnclosedPorch: Enclosed porch area in square feet
3SsnPorch: Three season porch area in square feet
ScreenPorch: Screen porch area in square feet
PoolArea: Pool area in square feet
PoolQC: Pool quality. Four options for condition (from excellent to fair), and one option for no pool
Fence: Fence quality. Five options including: good privacy, minimum privacy, good wood, minimum wood/wire, and no fence
MiscFeature: Miscellaneous feature not covered in other categories. These include: elevator, second garage, shed, tennis court, other, and none
MiscVal: Dollar value of miscellaneous feature
MoSold: Month Sold (MM)
YrSold: Year Sold (YYYY)
SaleType: Type of sale. Ten options including: warranty deed – conventional, warranty deed – cash, warranty deed – va loan, home just constructed and sold, court officer deed/estate, contract 15% down payment regular terms, contract low down payment and low interest, contract low interest, contract low down, and other
SaleCondition: Condition of sale. Siz options including: normal sale, abnormal sale (such as a trade or foreclosure), allocation (two linked properties), family (sale between family members), and partial (home was not yet finished)

Data Cleanup

Overall to clean up the data after the NAs were dealt with, a combination of of the following options were performed (such as dropping features, outliers, and creating new features)

Dealing with NAs

First and foremost, we should check to see if any null values exist and what exactly to do with them.

Recall that one way to list the number of NAs for a feature within a dataframe is:

NumberNAs = DF[<Feature>].isna().sum()

Using the above code, there are a total of 13965 NAs in the training set…

  MSZoning      -->     4   Categorical
  LotFrontage   -->   486   Numerical
  Alley         -->  2721   Categorical
  Utilities     -->     2   Categorical
  Exterior1st   -->     1   Categorical
  Exterior2nd   -->     1   Categorical
  MasVnrType    -->    24   Categorical
  MasVnrArea    -->    23   Numerical
  BsmtQual      -->    81   Categorical
  BsmtCond      -->    82   Categorical
  BsmtExposure  -->    82   Categorical
  BsmtFinType1  -->    79   Categorical
  BsmtFinSF1    -->     1   Numerical
  BsmtFinType2  -->    80   Categorical
  BsmtFinSF2    -->     1   Numerical
  BsmtUnfSF     -->     1   Numerical
  TotalBsmtSF   -->     1   Numerical
  Electrical    -->     1   Categorical
  BsmtFullBath  -->     2   Numerical
  BsmtHalfBath  -->     2   Numerical
  KitchenQual   -->     1   Categorical
  Functional    -->     2   Categorical
  FireplaceQu   -->  1420   Categorical
  GarageType    -->   157   Categorical
  GarageYrBlt   -->   159   Numerical
  GarageFinish  -->   159   Categorical
  GarageCars    -->     1   Numerical
  GarageArea    -->     1   Numerical
  GarageQual    -->   159   Categorical
  GarageCond    -->   159   Categorical
  PoolQC        -->  2909   Categorical
  Fence         -->  2348   Categorical
  MiscFeature   -->  2814   Categorical
  SaleType      -->     1   Categorical

Going through each of these features is a rather painful process (and was done in a separate Juptyer notebook). From that exploration, a few items were noticed:

For some features NAs refer to None. So for Alley, FireplaceQu, GarageType, GarageFinish, GarageQual, GarageCond, PoolQC, Fence, and MiscFeature all NAs were replace with the string ‘None’
For most categorical features the most frequently occurring value was used for the NA replacement. This was the case for: MSZoning, Utilities, Exterior1st, Exterior2nd, MasVnrType, BsmtQual, BsmtCond, BsmtExposure, BsmtFinType2, Electrical, KitchenQual, GarageCars, and SaleType
Depending on the condition of the data, either the mean, median, or the mode were used for: LotFrontage, MasVnrArea, TotalBsmtSF, and GarageYrBlt
For certain features the values were almost distributed between two or three options. For such cases a function was written to randomly select from the most commonly distributed using the frequency that they occur
Some features are dependent on others and in such cases, some code was written to use the mean value based on the parent feature. This included: BsmtFinType1, BsmtFinType2, BsmtFullBath, BsmtHalfBath, and GarageArea
And finally, BsmtUnfSF was calculated using the means or TotalBsmtSF, BsmtFinSF1, and BsmtFinSF2

Checking Feature Variation

Some features were almost entirely constant. Hence, there would provide very little information in the pricing model so they were very good candidates to be dropped. The following is a list of features and the percentages of how often a one value is used:

  Street           0.99588900
  Alley            0.93216855
  Utilities        0.99965742
  LandSlope        0.95169579
  Condition2       0.98972251
  RoofMatl         0.98526893
  BsmtCond         0.92086331
  Heating          0.98458376
  CentralAir       0.93285372
  Electrical       0.91538198
  LowQualFinSF     0.98629668
  BsmtHalfBath     0.93970538
  KitchenAbvGr     0.95409387
  Functional       0.93148338
  GarageCond       0.90921548
  PavedDrive       0.90476190
  3SsnPorch        0.98732443
  ScreenPorch      0.91229873
  PoolArea         0.99554642
  PoolQC           0.99657417
  MiscFeature      0.96402878
  MiscVal          0.96471394

As one can see Utilities, Street, Condition2, Heating, 3SsnPorch, PoorArea, and PoolQC are dominated by one value are are good candidates to be dropped.

Checking the Correlation to the Sales Price

Some features have little impact on the sales price and hence can possibly be dropped. It is easiest to see these using a heat map where the correlation is calculated between all the numeric values as seen below.

From the heat map, the main item of interest is which values have the greatest correspondence with the SalePrice. Considering both positive and negative correlation the features that have the least correlation with the SalePrice are: MSSubClass, LotArea, OverallCond, BsmtFinSF2, BsmtUnfSF, 2ndFlrSF, LowQualFinSF, BsmtFullBath, BsmtHalfBath, HalfBath, BedroomAbvGr, KitchenAbvGr, WoodDeckSF, OpenPorchSF, EnclosedPorch, 3SsnPorch, ScreenPorch, PoolArea, MiscVal, MoSold, and YrSold. All of these features are possible candidates that can be dropped.

Feature Engineering

It is possible to create new features from existing ones that will have more correlation with the sales price and then produce a better model. Some of the created features included the square feet per room (SqFtPerRoom), overall condition/quality, total number of baths, and the square feet above grade. Mathematically they were calculated using:

DF['SqFtPerRoom'] = DF['GrLivArea'] / (DF['TotRmsAbvGrd'] +
                                           DF['FullBath'] +
                                           DF['HalfBath'] +
                                           DF['KitchenAbvGr'] )

DF['OverallCondQual'] = DF['OverallQual'] * DF['OverallCond']

DF['TotalBaths'] = ( DF['FullBath'] + 
                     DF['BsmtFullBath'] +
                     (0.5 * DF['HalfBath']) +
                     (0.5 * DF['BsmtHalfBath']) )

DF['AboveGradeSF'] = DF['1stFlrSF'] + DF['2ndFlrSF']

Note that an option made it possible to drop the features that were used with the new engineered feature (that is the dependencies).

Categorical Variable Encoding

Approximately half of the features are categorical and in order to effectively deal with them, they need to be label or one-hot encoded. An option was made so that during hyperparameter tuning, either method could be used and evaluated.

Dropping Outliers

Certain numerical features contained some extreme outliers. Our definition of an outlier was basically those values that were more than two standard deviations away from the mean. Note that during hyperparameter tuning it was possible to adjust if outliers should be dropped along with the number of standard deviations to use.

Scaling the Data

Since a few different models were used, therefore it was possible to scale the data using either the min/max or standard scaling, along with no scaling at all. Again, this option was able to adjusted during hyperparameter tuning.

Taking the Log of SalePrice

The sale price is somewhat skewed. One way to deal with this problem is to take the log of the price before it is fit to a model, then use the exponential function after a prediction is made. Note that this option is user controllable so that the quality of the results can be tested with logs both taken and not.

Models

A variety of models were used with the hope that one of them would generalize best to the data.

Linear Regression

Recall that a linear model can be written mathematically as:\ell = \beta_0\ +\ \beta_1x_1\ +\ \beta_2x_2\ +\ \cdots\ +\ \beta_nx_n

Where $\beta_0$ in the intercept (or bias) and $\beta_i$ (where $1 \leq i \leq n$ ) is the weight (or slope) that controls the amount of a particular feature that is used.

One of the most common ways to prevent overfitting is to use some type of regularization, with the most popular are L1 (or ridge), L2 (or lasso), and elastic net (which is a combination of L1 and L2).

By using many combinations of data and model parameters, it was determined that the following parameters produced a model with best results:

Ridge regression
One-hot encoding
No scaling
Take log of sales price
Drop feature with little variation
Keep features with little correlation to the sales price
Do not make new features,
Drop outliers

Also this model obtained a kaggle score of 0.13574

Random Forest

The random forest algorithm uses multiple decision trees that are grouped together to commonly solve classification and regression problems. The grouping is commonly called ensemble learning.

During training both the data parameters (listed above) and model parameters were adjusted to produce a model that made the best predictions on a validation set (using cross fold validation). The model parameters that were adjusted included:

Number of estimators. Possible values included: 200, 400, 600, 800, 1000
Maximum number of levels in tree. Possible values included: 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, or none for max depth until all leaves are ‘pure’
Minimum number of samples required to split a node. Possible values included: 2, 5, 10, 15
Minimum number of samples required at each leaf node. Possible values: 1, 2, 4, 8, 16
Bootstrap… whether bootstrap samples are used when building trees. If false, the whole dataset is used to build the tree (taken from SKLearn documentation)

During training the following combination of model and data parameters were found to produce a good fitting model (via cross fold verification):

Label encoding
Standard scaling
Take log of sale prices
Keep features with little variation or little correlation the sales price
Do not make any new features
Drop outliers
Number of estimators: 1000
Min samples per split: 10
Min samples per leaf: 2
Max depth: 120
Bootstrap: False

On kaggle this model obtained a score of 0.14021 which is slightly worse than linear regression

XGBoost

XGBoost stands for extreme gradient boosting and it was an very popular technique since it was used in many winning solutions for various online machine learning competitions. It can be run on a single machine or distributed on multiple machines (using Hadoop).

Compared to random forests, there are much less hyperparameters that can be tuned. The following are some of the possible values used for a few of the hyperparameters:

Number of estimators: possible values include 100, 200, …, 1100
Maximum number of levels in tree: possible values include: 3, 4, … 11, and none
Number of features (columns) used in each tree
Column samples by tree: is the subsample ratio of columns when constructing each tree. Subsampling occurs once for every tree constructed (from docs); possible values include: 0.0, 0.2, 0.4, 0.6, 0.8, and 1.0
Subsample: the ratio of the training instances. Setting it to 0.5 means that XGBoost would randomly sample half of the training data prior to growing trees. and this will prevent overfitting (from docs). Possible values were: 0.0, 0.2, 0.4, 0.6, 0.8, and 1.0

After many iterations of hyperparameter tuning, the parameters for the better behaving model were:

Label encoding
Standard scalar
Do not take the log of the sales price
Keep features with little variation or little correlation the sales price
Make new features and drop their dependencies
Drop outliers
Number of estimators: 300
Max depth: 3
Sub sample: 0.6
Column sample by Tree: 0.4

On Kaggle this model produced a score of 0.14685 which is the worst so far. Perhaps there was some overfitting with this model.

CatBoost

CatBoost is a “high-performance open source library for gradient boosting on decision trees” (taken from the official website). It has been designed to have very good default values so that an untuned model will still have good performance. From my brief online explorations, it seems to be taking over the space once occupied by XGBoost.

Since my system isn’t the strongest, only data parameters were explored during hyperparameter tuning; that is the default model parameters were used. Therefore, the following parameters were used:

Label encoding
Standard scaler
Take log of sale prices
Keep features with little variation or correlation to the sales price
Make new features
and Drop any outliers

Overall, the CatBoost model had a Kaggle score of 0.12642, which was the best so far.

Neural Network

Neural networks really shine with larger homogenous datasets (for example an image’s pixel values). Since there is a bit of variety among the various features the dataset, perhaps a neural network is not the best solution for the problem.

The network used was ‘shallow’ and had one linear output layer; and was created with the following code:

import tensorflow as tf
from tensorflow import keras

model = keras.models.Sequential ()

model.add (keras.layers.Dense (128, kernel_initializer='normal', activation='relu', input_shape=inputShape))
model.add (keras.layers.Dense (256, kernel_initializer='normal', activation='relu'))
model.add (keras.layers.Dense (256, kernel_initializer='normal', activation='relu'))
model.add (keras.layers.Dense (256, kernel_initializer='normal', activation='relu'))
model.add (keras.layers.Dense (1, kernel_initializer='normal', activation='linear'))

model.compile (loss='mean_squared_error', optimizer='nadam', metrics=['mean_absolute_error'])

The data was tweaked to the following state:

One-Hot encoding of categorical features
Min/Max scaling
The log of the sales price was taken
Both the features with little variation and low correlation to the sales price were dropped
New features were created (and their dependencies were kept)
And outliers were dropped

The score reported by Kaggle was the worst yet of 0.15101, which is still a respectable score.

Overall Results

Overall a CatBoost based model produced the best results (lowest score is best; since there is the least amount of errors):

	Model	Kaggle Score
	Linear Regression (Ridge)	0.13574
	Neural Network	0.15101
	Random Forest	0.14021
	XGBoost	0.14685
	CatBoost	0.12642

Feature Importance

Finally, the models can be scrutinized to determine which features are the most and least important in predicting the sales price and which do not contribute at all. Thus, the features that sell a house can be determined.

This article is a great summary of how to determine the feature importance of a model. For the most part, the feature importance can be used determined two different ways:

Model based
- For Linear regression, this simply is the weights (or coefficients) from the fitted model. After the model has been fit, the weights are available via model.coef_
- For Random Forests, XGBoost, and CatBoost models, a the importance values are all available via model.feature_importances_
Analyzing permutations of the data
- SKLearn contains permutation_importance which is functionality that determines the feature importance by analyzing how much the model score is impacted by randomly changing a features values. Permutation importance can be used on any model, including neural networks.

Below are graphs of the top ten most important features for the linear regression, random forest, XGBoost, and CatBoost models. It should be noted that degree of importance and the selected features are different among the models.

Sample Results

Below are some predictions made by the CatBoost model on the training dataset. Note that most features were used during model training but only the ten most important (found in the previous section) have been listed.

From the table, we can see that some predictions are very close to the actual value. A small percentage of the predictions are off but are still within an acceptable range.