Random Forest has gained increasing popularity
in recent literature for its robustness and ability to
model non-linear relationships. Valarmathi et al.
achieved an R² of 0.9621, MSE of 0.044, and RMSE
of 0.2096 using RF on a multi-brand car dataset
containing 19 features (Valarmathi et al., 2023). The
model’s internal structure enables it to capture
complex variable interactions and automatically
assess feature importance, making it less sensitive to
irrelevant inputs. Chen et al. reported that in a
universal model with over 100000 samples and 19
features, Random Forest achieved a Normalized
Mean Squared Error of 0.052, compared to 0.26 for
Linear Regression-representing an approximately
80% reduction in prediction error (Chen et al., 2017).
However, RF comes at the cost of interpretability and
higher computational demand, which may limit its
application in real-time or user-facing environments.
K-Nearest Neighbors, though less common, has
been explored as a non-parametric alternative that
performs well in small, well-prepared datasets. Das
Adhikary et al. applied KNN with careful feature
normalization and reported an RMSE of 6.72, noting
that the model achieved reasonable accuracy despite
its simplicity (Adhikary et al., 2022). Other studies
compared KNN and LR found that KNN performed
better for certain vehicle segments with more
localized feature patterns but degraded in high-
dimensional spaces due to its reliance on distance
metrics (Samruddhi and Kumar, 2020). KNN also
lacks internal mechanisms for feature selection,
making it highly sensitive to unscaled or noisy inputs.
Across these studies, feature engineering
consistently emerged as a critical factor influencing
model success. Pudaruth and others emphasized that
preprocessing steps-such as log transformation of the
target variable, imputation of missing values, and
encoding of categorical data-significantly improved
model accuracy, especially for LR and KNN
(Pudaruth, 2014; Adhikary et al., 2022). In contrast,
RF demonstrated greater robustness to imperfect
features due to its ensemble structure, which reduces
the impact of noise and redundancy (Valarmathi et
al., 2023; Chen et al., 2017).
Additionally, dataset scale and complexity were
found to influence model choice. RF maintained high
performance as dataset size and feature interactions
increased, while LR and KNN were more effective in
smaller, cleaner datasets with linear relationships
(Asghar et al., 2021; Samruddhi and Kumar, 2020).
These findings underscore the importance of aligning
model selection with data characteristics and
analytical goals.
In summary, while each model presents distinct
strengths and limitations, Random Forest generally
excels in handling complex, high-dimensional data,
Linear Regression offers interpretability under clean
linear conditions, and KNN provides a simple yet
effective solution in small, structured contexts.
Although the existing literature provides valuable
insights into used car price forecasting, research has
focused primarily on broader or non-UK markets,
leaving significant knowledge gaps regarding the
unique dynamics of the UK used car market.
Furthermore, interpretability remains a practical
challenge despite the prevalence of complex
algorithms. Therefore, this study specifically adopts
linear regression and machine learning methods to
strike a balance between prediction accuracy and
interpretability to investigate key influencing factors
and improve the reliability of price predictions within
the UK. Ultimately, this study aims to support the
Sustainable Development Goals by improving market
transparency and facilitating informed consumer
decision-making.
2 METHODOLOGY
2.1 Data Source
The dataset utilized in this research was sourced from
Kaggle, made publicly available under the CC0
Public Domain License, which had been downloaded
2,631 times. It contains detailed records of 3,685 used
cars listed in the UK, encompassing diverse attributes
covering technical specifications, vehicle features,
and market indicators, 14 variables and their
descriptions are summarized in Table 1.
Table 1: Summary of Original Variables in the Used Car Dataset.
Variable Name Description Type
Unnamed: 0 Index column (not used) Numerical
title Full title of the car listing (e.g., SKODA Fabia, Vauxhall Corsa...) Categorical
Price Selling price of the vehicle in GBP Numerical
Mileage(miles) Total distance the vehicle has traveled Numerical