predict hypothyroidism and compare them to find the
best predictive model (Prusty, 2022). Khan et al. also
co-authored a paper discussing comparative studies
of machine learning algorithms for detecting breast
cancer and affirming the potential of machine
algorithms such as XGBoost (Khan, 2021).
Therefore, based on the fact that AI has
demonstrated extraordinary predictive performance
on various tasks, this paper intends to consider also
utilizing different machine learning algorithms for
house price forecasting and analysis.
To cope with the issue of housing price
prediction, this study utilized datasets sourced from
Kaggle. The study implemented several machine
learning models. The performance of each model was
compared to evaluate their effectiveness in predicting
housing prices. The results demonstrate the efficacy
of machine learning models in forecasting,
highlighting the strengths and limitations of each
approach in the context of real estate analysis.
2 METHOD
2.1 Dataset Preparation
The dataset used in this study is sourced from Kaggle
and contains 1,460 data points with 81 features
related to housing prices. The features include details
such as street, area, number of rooms, and other
characteristics of the houses. The primary goal of this
dataset is to predict house prices, making this a
regression task.
The steps involved in data preprocessing include
separating 35 numerical and 43 non-numerical
columns, handling missing values, applying one-hot
encoding, normalizing numerical features. In addition,
the proportions for the train-test split were carefully
chosen to ensure a balanced and accurate model
evaluation. The specific methods and code
implementations used for data preprocessing are
based on industry-standard practices and tailored to
the requirements of this study.
2.2 Linear Regression-Based
Prediction
Linear regression is a method used to analyze the
relationship between one dependent variable and one
or more independent variables (Su, 2012;
Montgomery, 2021; James, 2023), often used for
prediction. Its principle is to find a line that best fits
the dependent and independent variables, allowing us
to use this equation to calculate new values for the
dependent and independent variables. This method
enables us to predict house prices.
2.3 Decision Tree-based Prediction
A decision tree is a common machine learning model
used for classification problems (De, 2013; Song,
2015). It represents the process using a tree-like
structure, where each node in the structure
corresponds to a feature that helps determine the
strategy most likely to achieve the goal.
The decision tree algorithm utilizes a hierarchical
structure to perform classifications. It consists of
several essential components: the root node that holds
all the samples, internal nodes that test specific
feature attributes, and leaf nodes that provide the
decision outcomes. In the prediction phase, the
algorithm examines an internal node's attribute value
to decide the path towards a leaf node, where it
delivers the final classification result. This supervised
learning method operates on if-then-else logic, with
the decision rules derived from data training, instead
of manual construction.
Indeed, the decision tree is one of the most
straightforward machine learning algorithms, known
for its ease of implementation, clarity, and alignment
with human reasoning. It is widely applicable across
various fields. However, the inherent nature of
decision trees can lead to the creation of overly
complex models. This complexity often results in
poor data generalization, commonly referred to as
overfitting.
2.4 XGBoost-Based Prediction
XGBoost is a highly efficient and scalable machine
learning algorithm that implements gradient boosting
for decision trees (Chen, 2016; Nielsen, 2016; Torlay,
2017). It’s optimized for performance, supporting
parallel and distributed computing, making it ideal for
handling large datasets. It is designed to be
exceptionally fast, scalable, and portable, making it a
powerful tool for machine learning tasks, particularly
in distributed computing environments.
XGBoost is widely used in data science for tasks
like classification, regression, and ranking. Its key
features include handling missing data, regularization
techniques to prevent overfitting, and working
seamlessly in environments like Hadoop and MPI.