Car Price Prediction Based on Multiple Machine Learning Models

Hangzhi Chen

Wuhan Experimental Foreign Languages School, Wuhan, Hubei, China

Keywords: Car Price Prediction, Linear Regression, Random Forest, XGBoost.

Abstract: This article takes car price prediction based on three machine learning models as the topic. Cars, especially

in recent decades, are becoming increasingly necessary for companies and families around the world. As the

demand for cars grows, further concerns about purchasing vehicles emerge. Obviously, car price acts as a key

factor in making these decisions. Therefore, car price prediction becomes a meaningful topic to discuss about.

In this passage, comparisons between three machine learning models, Linear Regression, Random Forest and

XGBoost, are carried out. The three models are applied to a train and a test dataset and the performances are

evaluated by Root Mean Squared Error (RMSE) and accuracy. For the train dataset, the RMSE for Linear

Regression, Random Forest and XGBoost are 534,865.15, 155,706.68 and 302,861.88 while the accuracy

reveals to be 61.64%, 96.75% and 87.70%. For the test dataset, the RMSE for Linear Regression, Random

Forest and XGBoost are 555,802.12, 338,065.9 and 337,698.16 while the accuracy is 58.44%, 84.62% and

84.66%. The overall conclusion is that Random Forest learnt slightly faster and better than XGBoost but

performed almost the same in predicting car prices. Linear regression performed the worst throughout all

datasets.

1 INTRODUCTION

Humans have taken advantage of an immense amount

of means of transportation. One category of popular

travel tools is cars. Invented in 1886, cars are now

becoming an increasingly indispensable part of

people’s life. More companies and families are

owning vehicles of different types recently, many of

which own two or more. Cars are often regarded as

dependable tools during work period and are

considered as reliable friends for travellers. From

application scenarios of cars, it is easy to notice that

the mobilization efficiency has been greatly increased

and the range of human activities on land has been

greatly expanded. In recent years, the demand for cars

has met a substantial surge. When considering

purchasing a car, the price of it serves as a vital factor,

while it could differ greatly due to a variety of

aspects. After the pandemic and experiencing

dramatic changes in energy resources, it has been

widely acknowledged that wide swings can be

common in the car market over the near future.

Therefore, more advanced methods are required to be

applied to predict car prices, which can contribute to

https://orcid.org/0009-0006-6242-0598

making vehicle purchase decisions. Furthermore, this

will also provide vehicle companies with

convenience in pricing their automobiles and defining

future development direction.

With the popularization of Artificial Intelligence

(AI) and relevant technologies (Roll, 2016; Chen,

2022; Patil, 1998), it is apparent that algorithms are

playing vital roles on various occasions. In recent

years, lots of representative algorithms have

appeared, including the random forest, decision tree,

logistic regression etc. Machine learning algorithms

have been widely applied to various fields, such as

chemistry, biomedicine and especially business

analytics. For instance, Diamantaras et al. used eight

machine learning models to predict airfare prices

(Diamantaras, 2017). Karthikeyan et al. used

ensemble-based machine learning techniques to carry

out gold price prediction (Karthikeyan, 2019).

Michael et al. determined how variations in the

Deoxyribonucleic Acid (DNA) of individuals can

affect the risk of different diseases through machine

learning (Alipanahi, 2015). One important field

included is car price prediction. For instance, Gegic

et al. used artificial neural network, support vector

Chen and H.

Car Price Prediction Based on Multiple Machine Learning Models.

DOI: 10.5220/0013509000004619

In Proceedings of the 2nd International Conference on Data Analysis and Machine Lear ning (DAML 2024), pages 92-95

ISBN: 978-989-758-754-2

machine and random forest to predict car price

(Gegic, 2019). It is now a fact that AI can bring about

accurate prediction results. Therefore, this paper

intends to consider using machine learning models to

predict the corresponding price according to various

features of cars and analyze the importance of the

contribution factors.

The dataset used in the study is taken from Kaggle

(Kaggle, 2021). The study considers using three

machine learning models to predict car prices,

including Linear Regression, Random Forest and

XGBoost. The implementation of code was referred

to a notebook written on Kaggle (Kaggle, 2019).

Standards used to compare their performances are

Root Mean Squared Error (RMSE) and accuracy.

Feature importance function in Random Forest was

also used to evaluate the importance of the

contribution factors.

2 METHOD

2.1 Dataset Preparation

The dataset used in this study is taken from Kaggle

(Kaggle, 2019). The original dataset includes

information of 6, 1500 vehicles and 22 variables.

Before inputting the dataset, 11 of them such as 'url',

'city' was dropped due to the existence of NaN

columns. In addition, ’Year’ and ‘Odometer’ features

were transferred into int type.

Before inputting the dataset, it was split into a

train dataset and a test dataset. The train dataset

included 525, 839 rows and the test dataset included

34, 440 rows.

2.2 Machine Learning Models-based

Prediction

This study used Linear Regression, Random Forest

and XGBoost for car price prediction. The models

were imported from sklearn. The evaluation metrics

applied in this study included RMSE and accuracy.

2.2.1 Linear Regression

Linear Regression is a widely utilized machine

learning model, commonly employed in statistical

analysis to explore and predict relationships between

variables. It operates on the principle of fitting a

straight line to a set of data points in a manner that

best captures the underlying relationship between two

features, typically used for continuous target

variables.

In practice, Linear Regression analyzes paired

data to determine a direct correlation, making it

especially valuable in scenarios where the

relationship between the independent variable (or

predictor) and the dependent variable (or outcome) is

expected to be linear. This model constructs a linear

equation based on the input dataset, plotting a line in

a two-dimensional coordinate system aimed at best

fitting the observed data points. This linear

representation enables predictions regarding future

outcomes by projecting the dependent variable's

values based on current trends.

However, the real-world application of Linear

Regression acknowledges that the model cannot

perfectly fit every data point. There will invariably be

deviations between the actual values and the values

predicted by the model, which are termed as

residuals. The presence of these residuals indicates

the error inherent in the model predictions, reflecting

the nuances and variability in real-world data that

cannot be fully captured by a linear model.

To evaluate the performance of a Linear

Regression model, a metric known as Root Mean

Square Error (RMSE) is used. RMSE measures the

average magnitude of the residuals, providing insight

into how much error is typically involved in the

predictions made by the linear model. A lower RMSE

value indicates a better fit of the model to the data,

suggesting that the linear equation is effective at

capturing the primary trends and reducing the

prediction errors.

Despite its simplicity, Linear Regression has

profound implications across various fields, from

economics and finance to healthcare and social

sciences. It aids in understanding and quantifying the

strength of relationships between variables,

facilitating decision-making processes by providing a

clear, quantifiable link between different variables.

For instance, businesses often use Linear Regression

to predict sales based on advertising spend, or

healthcare providers might use it to predict patient

outcomes based on treatment protocols.

2.2.2 Random Forest

Random Forest is an ensemble learning method that

builds on the simplicity of decision trees by

combining multiple such trees to form a more robust

and accurate model. This technique leverages the

strength of numerous decision trees, each serving as

an estimator, to improve predictive accuracy and

control over-fitting, which is a common problem in

single decision tree models. The core idea behind

Car Price Prediction Based on Multiple Machine Learning Models

Random Forest is to create a 'forest' of trees where

each tree is slightly different from the others.

The model begins by using a technique called

Bootstrap Aggregating, or bagging, to create different

subsets of the original dataset, with replacement.

Each subset is used to train a separate decision tree.

Since the subsets can have overlapping data points,

each tree in the forest is not identically trained, which

helps in making the ensemble of trees less sensitive

to the specifics of the training data, thereby enhancing

the generalization capabilities of the model.

During prediction, Random Forest takes an

average of the outputs from all the decision trees for

regression problems, or a majority vote for

classification, to improve predictive accuracy. By

aggregating the results, the model not only reduces

the variance but also combines the individual trees’

predictions to produce a more accurate and stable

estimation than any single tree could provide.

Despite its many benefits, such as high accuracy,

the ability to handle a large number of input variables,

and robustness to overfitting, the Random Forest

model does have some drawbacks. One significant

limitation is the loss of interpretability that comes

with using a single decision tree. The complex

structure of multiple trees makes it difficult to

visualize and understand the specific reasons behind

predictions, thus complicating the explanation of

model decisions to stakeholders who are not technical

experts.

Furthermore, while Random Forest is generally

good at handling overfitting, it can still perform

poorly in the presence of noisy classification or

regression problems. This occurs because noise can

lead to overly complex trees, which, even when

averaged, result in an overfitted model. Therefore,

while Random Forest is a powerful tool in the arsenal

of machine learning techniques, it requires careful

tuning of parameters like the number of trees in the

forest and the depth of each tree to optimize its

performance and prevent overfitting. This ensemble

method, hence, offers a balance between accuracy

and complexity, providing a reliable predictive tool

across various applications in fields ranging from

finance to healthcare.

2.2.3 XGBoost

XGBoost is an abbreviation for eXtreme Gradient

Boosting. It is a machine learning model with

massively parallel boosting trees. The whole process

is an additive model made up by multiple base

models. When generating a processing tree,

algorithms including greedy algorithm for best

solutions and approximation algorithm for massive

data, can be practical alternatives. Advantages for the

XGBoost are high accuracy and flexibility, learning

shrinkage, regularization included, column sampling

and various tools to support parallel. Disadvantages

also exist, during the whole process, all data have to

be traversed. Furthermore, the prearrangement

process requires relatively high space complexity to

store the index of the gradient statistics for the

corresponding sample.

3 RESULTS AND DISCUSSION

The results of the predictions are shown in Table 1.

Illustrated by the Table, it’s obvious that for the train

dataset, Random Forest reached the highest accuracy

of 96.75% while XGBoost demonstrated fairly well,

achieving 87.70%, followed by Linear Regression

which appeared to be the worst among the three with

an accuracy of 61.64%. The result is almost the same

when it comes to train RMSE. Random Forest, the

best among all three, had the lowest RMSE of

155,706.68, next is XGBoost with the value of

302,861.88 and Linear Regression with the highest

RMSE of 534,865.15.

Table 1. The performance of different models

Model name Train Accuracy Train RMSE Test Accuracy Test RMSE

Linear

Regression

61.64% 534,865.15 58.44% 555,802.12

Random Forest 96.75% 155,706.68 84.62% 338,065.9

XGBoost 87.70% 302,861.88 84.66% 337,698.16

DAML 2024 - International Conference on Data Analysis and Machine Learning

However, while applying them to the test dataset,

the accuracy of Random Forest was almost the same

as XGBoost which were both around 85%, 84.62%

and 84.66% representatively, while Linear regression

kept carrying out the worst performance with an

accuracy of 58.44%. The same condition emerged

when considering RMSE. The RMSE of Random

Forest and XGBoost were about the same, 338,065.9

and 337,698.16 each, and the largest one remained

Linear Regression with a value of 555,802.12, almost

the same as the train RMSE.

Above all, Random Forest learnt slightly faster

and better than XGBoost as Random Forest

performed evidently better when greeting the train

dataset. However, Random Forest predicted as

accurately as XGBoost did according to the test

dataset results. It’s obvious to notice that Linear

Regression performed the worst throughout all

datasets.

It was a possible reason that Linear regression is a

one-time analyzation while others went through two

times or more. What’s more, multiple results were

generated and concluded in both Random Forest and

XGBoost, while only one was generated from Linear

Regression.

Noticeably, some of the reasons concluded are

only based on the understanding towards the

algorithms and are lack of further research data

support. However, in a study written by Vishal

Khandare and Manish Pandey in which they used the

same three models to predict electric car prices, the

results come out to be almost the same (Khandare,

2022). In the studies mentioned above, Random

Forest performs the best, reaching an accuracy of

90.38% while XGBoost gets 89% and Linear

Regression achieves 67%. Therefore, the result is

quite credible to a certain extent. However, the

reasons and logic behind it still need further work and

research.

4 CONCLUSIONS

This paper compared the performances between three

machine learning models which are Linear

Regression, Random Forest and XGBoost

representatively. The standards used to appraise the

results are Root Mean Squared Error (RMSE) and

accuracy. Through the final results, it is evident that

Random Forest learnt slightly faster and better than

XGBoost but performed almost the same in

predicting car prices. Linear regression performed the

worst throughout all datasets. It was a possible reason

that Linear regression is a one-time analyzation and

only one result is concluded, much less than the other

two models.

This article supports the feasibility of machine

learning models and analyzed the reasons why the

performances differed from each other. The reasons

mentioned above are according to the personal

perspectives and understanding and requires more

accurate data to support them. However, these can

still serve as a clue to analyze different machine

learning models and select the best to be applied to

car price prediction.

REFERENCES

Alipanahi, B., Leung, M. K., Delong, A., & Frey, B. J.

2015. Machine learning in genomic medicine: a review

of computational problems and data sets. Proceedings

of the IEEE, 104(1), 176-197.

Chen, X., Zou, D., Xie, H., Cheng, G., & Liu, C. 2022. Two

decades of artificial intelligence in education.

Educational Technology & Society, 25(1), 28-47.

Diamantaras, K. I., Tziridis, K., Kalampokas, T., &

Papakostas, G. A. 2017. Airfare prices prediction using

machine learning techniques. 2017 25th European

Signal Processing Conference (EUSIPCO), 1036-1039.

IEEE.

Gegic, E., Isakovic, B., Keco, D., Masetic, Z., & Kevric, J.

2019. Car price prediction using machine learning

techniques. TEM Journal, 8(1), 113.

Kaggle. 2019. Used cars price prediction by 15 models.

Retrieved from

https://www.kaggle.com/code/vbmokin/used-cars-

price-prediction-by-15-models/notebook

Kaggle. 2021. Craigslist carstrucks data. Retrieved from

https://www.kaggle.com/datasets/austinreese/craigslist

-carstrucks-data

Karthikeyan, P., & Manjula, K. A. 2019. Gold price

prediction using ensemble based machine learning

techniques. 2019 3rd International Conference on

Trends in Electronics and Informatics (ICOEI), 1360-

1364. IEEE.

Khandare, V., & Pandey, M. 2022. Electric Car Price

Prediction using Machine Learning Techniques.

International Journal of Engineering Research in

Computer Science and Engineering (IJERCSE), 9(11).

Patil, R. S., Szolovits, P., & Schwartz, W. B. 1988.

Artificial intelligence in medical diagnosis. Annals of

internal medicine, 108(1), 80-87.

Roll, I., & Wylie, R. 2016. Evolution and revolution in

artificial intelligence in education. International journal

of artificial intelligence in education, 26, 582-599.

Car Price Prediction Based on Multiple Machine Learning Models