bedrooms show long-tailed distributions with
potential outliers. Latitude and longitude variables
have dispersed distributions, indicating that property
data originates from multiple geographic areas.
After analysing histograms, Figure 2 is generated
to show heatmap of correlation matrix for numerical
variables. This method uses color shades to represent
correlation levels. It is often used to analyse linear
relationships between variables.
Figure 2: Heatmap of correlation Matrix. (Picture credit:
Original)
In Figure 2, the strength of the positive or negative
correlation is indicated by the intensity of the red or
blue in the figure, respectively. Lighter colors
indicate weaker linear relationships between the
variables. The heatmap indicates a substantial impact
of bathroom dimensions and square footage on the
target variable, exhibiting a robust positive
correlation. This suggests that houses with more
bedrooms and bathrooms tend to command higher
prices, which aligns with prevailing market logic in
the real estate sector. In contrast, the correlation
between the number of bedrooms and house price is
relatively weak, suggesting that while the number of
bedrooms may influence house price, its impact is
less significant than that of house size and the number
of bathrooms. This may be because house size and the
number of bathrooms can more directly reflect the
comfort and market value of the house. It is
noteworthy that the geographical variables exhibit a
low correlation, suggesting that while housing price
is influenced by location, the linear relationship is not
readily apparent. In practice, the impact of location on
housing prices is often non-linear, as evidenced by the
significant variation in housing prices between city
centers and suburbs, which can be challenging to
measure through a simple linear correlation.
3.2 Model results and discussion
Table 2 shows the MAE, RMSE, and R-squared
values for the models MLR, RF, and XGBoost. The
evaluation shows that RF and XGBoost perform
better than MLR. The R-squared value for MLR is
0.5911, and the MAE and RMSE are 592,927.3 and
1,064,716.9. This shows that MLR doesn't effectively
capture the complex patterns in the data. XGBoost
and RF perform better than the others, with R-squared
values of 0.7887 and 0.8007, respectively. XGBoost
has the lowest MAE of 286,103.7, showing its
effectiveness in reducing mean error. RF has the
lowest RMSE of 743,287.8, showing its slightly
better ability to control larger errors.
GridSearchCV is a tool for hyperparameter
optimization that systematically searches for the
optimal hyperparameter combination to improve the
performance of machine learning models. Alemerien
et al. used GridSearchCV to optimize XGBoost and
Random Forest and achieved the best classification
accuracy in the cardiovascular disease prediction task
(Alemerien, 2024). As MLR does not entail
hyperparameter tuning, this method exclusively
focuses on optimizing RF and XGBoost to enhance
the model's predictive capabilities. The optimization
outcomes are depicted in the latter half of Table 2,
including the MAE, RMSE, and R-squared scores.
Prior to the implementation of optimization
techniques, both RF and XGBoost exhibited superior
performance in distinct error metrics. However, after
the optimization process, RF has surpassed XGBoost
in all error metrics, substantiating its enhanced
capacity for generalization. Specifically, the MAE of
RF has decreased from 296,429.0 to 281,715.4,
RMSE has dropped from 743,287.8 to 717,681.5, and
R-squared has improved to 0.8142, indicating further
enhancement in its fitting capabilities. Similarly, the
MAE of XGBoost has declined to 282,458.4, RMSE
has decreased to 763,109.8, and R-squared has
increased to 0.7899. Despite these improvements, the
overall error has remained higher than that of RF.
Table 2: Single Model Result Comparison.
Model
MAE
Before
RMSE
Before
R Square
Before
MAR
Afte