Housing Price Prediction Model with Machine Learning

7 min readJul 10, 2024

The housing market is a complex and ever-changing landscape, influenced by various economic, social, and environmental factors. Understanding the dynamics of housing prices is crucial for real estate professionals, homeowners, and aspiring buyers. Recent advancements in data analysis and machine learning have revolutionized the real estate field, enabling accurate predictions of housing prices. However, many commercial tools for such predictions are often expensive and require significant technical expertise.

Introduction

This project aims to provide a comprehensive analysis of housing price prediction using a housing dataset from Kaggle. By leveraging data-driven insights, we explore the intricacies of the housing market, uncover patterns, and identify key features driving property prices. Through rigorous model evaluation, we aim to develop a robust predictive model for estimating housing prices effectively.

Data Overview

The dataset used for housing price prediction contains 545 entries with 13 columns. There are 6 numeric columns (price, area, bedrooms, bathrooms, stories, parking) and 7 categorical columns (main road, guestroom, basement, hot water heating, air conditioning, preferred area, furnishing status). The dataset has no missing values.

Descriptive statistics of the dataset

        price          area           bedrooms     bathrooms    stories      parking

count   5.450000e+02   545.000000     545.000000   545.000000   545.000000   545.000000
mean    4.766729e+06   5150.541284    2.965138     1.286239     1.805505     0.693578
std     1.870440e+06   2170.141023    0.738064     0.502470     0.867492     0.861586
min     1.750000e+06   1650.000000    1.000000     1.000000     1.000000     0.000000
25%     3.430000e+06   3600.000000    2.000000     1.000000     1.000000     0.000000
50%     4.340000e+06   4600.000000    3.000000     1.000000     2.000000     0.000000
75%     5.740000e+06   6360.000000    3.000000     2.000000     2.000000     1.000000
max     1.330000e+07   16200.000000   6.000000     4.000000     4.000000     3.000000

The descriptive statistics provide a summary of the numerical attributes in the dataset. The average price of houses is approximately 4.77 million, with a standard deviation of 1.87 million, indicating significant price variation. The houses range in price from 1.75 million to 13.3 million. The average house area is approximately 5150.54 square feet, with a standard deviation of 2170.14 square feet. The number of bedrooms ranges from 1 to 6, with an average of 2.97. The number of bathrooms ranges from 1 to 4, with an average of 1.29. The number of stories ranges from 1 to 4, with an average of 1.81. The average number of parking spaces available is approximately 0.69, with a standard deviation of 0.86.

The average price, area, and number of bedrooms and bathrooms give an idea of the typical house characteristics. The standard deviations indicate the degree of variability in these attributes. The range of values for each attribute helps understand the distribution and spread of data. This information can be useful for real estate professionals, buyers, and sellers in determining market trends, setting prices, and making informed decisions. The statistics highlight the diversity and range of options available in the housing market.

Exploratory Data Analysis (EDA)

EDA helps us understand the relationships between different features and the target variable (price). We use scatter plots, box plots, and histograms to visualize these relationships, uncover correlations, identify potential outliers, and gain valuable insights.

Correlation Matrix

We calculate the correlation matrix to understand the linear relationships between variables.

The correlation matrix measures the strength and direction of the linear association between pairs of variables. In this case:

Price has a moderate positive correlation with area (0.54) and bathrooms (0.52), indicating that larger houses and those with more bathrooms tend to have higher prices.
The correlation between price and bedrooms is relatively weaker (0.37), suggesting that the number of bedrooms has a less significant impact on the price.
Similarly, the correlation between price and stories is moderate (0.42), indicating that the number of stories in a house has some influence on the price.
The correlation between price and parking is also moderate (0.38), suggesting that houses with more parking spaces may have slightly higher prices.

The distribution of house prices is visualized using a histogram. It shows the count of houses at different price levels. The distribution appears slightly right-skewed.

A scatter plot is used to visualize the relationship between price and area. It shows how the price of houses varies with the area. The plot suggests a positive linear relationship, indicating that larger houses tend to have higher prices.

The boxplot reveals that the number of bedrooms plays a significant role in determining the price of a house. The highest prices were observed for houses with four bedrooms and above, while the lowest prices were associated with houses having two bedrooms or less.

The plot indicates that a significant portion of the houses are semi-furnished, suggesting a potential trend or preference among homeowners.

The histogram plots provide insights into the distribution of values for each numeric column.

Area: The area distribution shows a relatively normal distribution, with a peak around 3000–4000 square feet.
Bedrooms: Most properties have 3 bedrooms, with a smaller number having 1 or 6 bedrooms.
Bathrooms: The majority of properties have 1 or 2 bathrooms, with a few having 3 or 4 bathrooms.
Stories: The distribution of the number of stories shows that two-story houses are the most common, followed by one-story houses.

Model Training and Evaluation

We transformed the dataset to make it suitable for analysis and modeling. Categorical variables were converted into binary values using one-hot encoding, while the ‘furnishingstatus’ column was encoded separately. The dataset was then split into features and the target variable (‘price’). A train-test split was performed, allocating 80% for training and 20% for testing. The resulting preprocessed dataset contains the original features, along with the encoded categorical variables.

# Convert categorical variables to binary (0s and 1s)
categorical_cols = ['mainroad', 'guestroom', 'basement', 'hotwaterheating', 'airconditioning', 'prefarea']
housing = pd.get_dummies(housing, columns=categorical_cols, drop_first=True)

# Convert 'furnishingstatus' column using one-hot encoding
furnishingstatus_encoded = pd.get_dummies(housing['furnishingstatus'], prefix='furnishingstatus')
housing = pd.concat([housing, furnishingstatus_encoded], axis=1)
housing.drop('furnishingstatus', axis=1, inplace=True)

# Split the dataset into features (X) and target variable (y)
X = housing.drop('price', axis=1)
y = housing['price']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=50)

A Random Forest regressor model was trained and evaluated on a dataset using 100 decision trees and a random state of 42. The model aimed to predict a target variable based on input features. Mean Squared Error (MSE) and R-squared (R2) were used as evaluation metrics. The model achieved an MSE of 876,119,313,542.7317, indicating the average squared difference between predicted and true values. The R2 score of 0.7424 suggests that approximately 74.24% of the target variable’s variance can be explained by the input features. The model demonstrates moderate predictive performance, and further optimization or use for analysis is possible.

# Create a Random Forest regressor
rf_regressor = RandomForestRegressor(n_estimators=100, random_state=42)

# Train the model
rf_regressor.fit(X_train, y_train)
RandomForestRegressor(random_state=42)

# Predict on the test set
y_pred = rf_regressor.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Mean Squared Error:", mse)
print("R-squared:", r2)

Mean Squared Error: 876119313542.7317
R-squared: 0.7423835119987388

Feature Importance

The feature importance analysis was then conducted on the trained Random Forest regressor model to determine the relative importance of different features in predicting the target variable. The importance of each feature was calculated using the Gini index, which measures the feature’s contribution to the model’s predictive power.

The results indicate that the “area” feature has the highest importance with a value of 0.4449, suggesting that it is the most influential feature in predicting the target variable. The “bathrooms” feature follows with an importance value of 0.1514, indicating its significant contribution to the model’s predictions. Other features such as “parking,” “stories,” and “airconditioning_yes” also demonstrate notable importance, with importance values of 0.0686, 0.0603, and 0.0602, respectively.

On the other hand, features like “furnishingstatus_furnished,” “mainroad_yes,” and “guestroom_yes” exhibit relatively lower importance values. These features have importance values of 0.0148, 0.0122, and 0.0175, respectively, suggesting they have less impact on the model’s predictions.

# Feature Importance
importances = rf_regressor.feature_importances_
feature_names = X.columns
# Create a DataFrame to display feature importance
feature_importance_df = pd.DataFrame({'Feature': feature_names, 'Importance': importances})
feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False)
print(feature_importance_df)
                            Feature  Importance
0                              area    0.444878
2                         bathrooms    0.151366
4                           parking    0.068603
3                           stories    0.060251
9               airconditioning_yes    0.060172
1                          bedrooms    0.039073
13     furnishingstatus_unfurnished    0.034381
10                     prefarea_yes    0.032572
7                      basement_yes    0.029101
8               hotwaterheating_yes    0.020117
6                     guestroom_yes    0.017482
12  furnishingstatus_semi-furnished    0.015017
11       furnishingstatus_furnished    0.014771
5                      mainroad_yes    0.012215

Results

Key Findings

Price Correlations: The price of a house is moderately positively correlated with the area and number of bathrooms. The number of stories and parking spaces also show moderate correlations with the price.
EDA Insights: Larger houses tend to have higher prices, and houses with more bedrooms also tend to be pricier.
Model Performance: The RandomForestRegressor model achieved an MSE of 876,119,313,542.7317 and an R² score of 0.7424, indicating moderate predictive performance.
Feature Importance: The area of the house is the most influential feature in predicting prices, followed by the number of bathrooms and parking spaces.

Conclusion

This project provides a comprehensive analysis of housing price prediction using machine learning techniques. By understanding the relationships and factors influencing housing prices, real estate professionals, homeowners, and aspiring buyers can make informed decisions. The insights gained from this analysis can aid in setting appropriate prices, identifying market trends, and guiding investment strategies. Leveraging advanced machine learning models like the RandomForestRegressor can enhance the accuracy of price predictions and contribute to a better understanding of the dynamic housing market.