Mao and Yao conducted a similar study using
the King County House Sales dataset to investigate
how geographic features impact housing price. In the
research, they apply multiple linear regression
(MLR) combined with 10-fold cross-validation to
assess model performance. The findings highlight
that factors including the number of bedrooms,
latitude, and longitude significantly affect. Their
predictive summary is both methodologically sound
and interpretable, offering detailed views into the
relationship between these variables and property
values (Mao and Yao, 2020). Additionally, Lau
analysed how population alternation and
homeownership rates impact California housing
prices from 2020-2022 potentially using county-level
data. This study applied time-series analysis and
correlation theory to measure relationships between
these variables, with results indicating positive,
negative, or neutral associations (Lau, 2024).
Yu applied the same linear regression analysis, the
common statistic method in order to predict the target
variable by linear combination of characteristic
variables. Evaluation of these models through
different matrixes such as mean square error is the
strategy she used to determine the accuracy of model,
which is indeed understandable and precise (Yu,
2024). Similarly, Yan also conducted excel to build
up a multi-regression model, aiming to research on
influencing factors of housing value in New York in
a comprehensive aspect (Yan, 2022). There are some
other researchers adopt different strategies to figure
out those potential causes. Zoppi et al. attempted to
employ Hedonic models to analyze environmental
and structural features which affects housing market
in Cagliari (Zoppi et al, 2015). Huang et al. conducted
models to study and collect those data of unusual
variation in housing prices (Huang et al., 2010). In
summary, this study will properly conduct the
combination of two strategies, correlation module and
linear regression, in order to investigate the potential
features that influencing change in housing prices in
California.
2 METHODS
2.1 Data Source
In this research, a dataset focuses on housing price
from Kaggle website will be adopted. This dataset
offers a practical starting point for exploring machine
learning techniques. It covers housing information
from the 1990 California census, measuring various
homes across different districts. This dataset includes
region-level statistics such as population, median
income and housing characteristics, providing an
understandable and clear background for building and
evaluating machine learning models. From the
original dataset, there were several missing values in
the total_bedrooms column. To deal with this issue,
the missing entries were occupied using the median
value. From the full set inputs, a random sample of
20640 records was selected for this research. Then,
the updated dataset now includes nine features-
longitude, latitude, housing median age, total rooms,
total bedrooms, population, households, median
income and ocean proximity, along with the target
variable, median house value.
2.2 Variables Explanation
In order to present the further model, the study needs
to initially list out all the dependent variables and the
independent variable that investigating in the dataset.
These variables cover longitude, latitude, housing
age, total rooms, total bedrooms, population,
households, income level, and the house value.
Therefore, the research lists out a relevant variables
table, as shown in Table 1.
Table 1: Variables introduction
Variables Description
Lon