Predicting Used Car Price Based on Machine Learning

Jiayi Lin

The Dorothy and George Hennings College of Science, Mathematics and Technology,

Wenzhou-Kean University, Wenzhou, Zhejiang, China

Keywords: Price Prediction, Used Car, Machine Learning.

Abstract: With used car sales in many countries surpassing new car sales, the automobile industry is important to the

global economy and in an unshakable position. Accurately predicting used car prices is beneficial for making

wise decisions for different interested parties, including consumers, car sellers, and some financial institutions.

This paper compares different regression models including Linear Regression (LR), Ridge Regression (RR),

and Random Forest (RF) to determine the most reliable method for predicting used car prices. The dataset is

sourced from CarDekho and has been preprocessed, which includes handling missing values, feature

engineering, and anomaly detection. The RF outperforms other models in terms of performance, indicating

higher prediction accuracy. However, limitations such as small sample size and potential overfitting indicate

the need for further model tuning and data expansion. To increase prediction accuracy and model robustness,

future research should concentrate on enhancing data quality, investigating new characteristics, and

implementing sophisticated encoding techniques.

1 INTRODUCTION

The automotive industry is the driving force behind

the economies of almost all industrialized countries,

whose cars account for over 70% of the total global

automobile manufacturing (Onat, 2024). This

dominance highlights the industry’s crucial role in the

global economy. The used car market often faces

issues of trust and information inequality, due to

sellers having more knowledge about the vehicle than

buyers. Despite this, the importance of the market for

used cars is still growing. (Eckhardt et al., 2022) The

sales in this market often exceed the sales of new cars,

particularly in the US, stressing its crucial role in the

modern economy. With the expansion of the market

and the diversification of consumer demand, it has

become an important supplement to new car sales.

This growth highlights the necessity of accurate price

forecasting to support informed decision-making by

both buyers and sellers, solving the inherent

challenges of trust and information gaps. For buyers,

it helps to buy a car at a reasonable price. As for

sellers, accurate price forecasting helps to develop

effective pricing strategies and improve profit

margins. Accurate price forecasts also help financial

https://orcid.org/0009-0005-5842-6594

institutions make more informed decisions on loan

evaluations and manage risks effectively, so that they

can benefit from reducing the chances of non-

performing loans.

To address this problem, researchers have

developed a looping architecture that blends deep

residual networks with extreme gradient boosting

(XGBoost) and light gradient boosting machines

(LightGBM). This idea helps to improve the accuracy

of forecasting through integrating deep learning

model outputs with original data features, offering a

promising solution for improving used car price

predictions (Cui et al., 2022). Recent research stresses

the accuracy of artificial neural networks (ANN) in

forecasting used car prices in contrast to conventional

techniques like Linear Regression (LR) and Random

Forests (RF). The ANN model trained on 140000

used cars performed better than these methods, with

an average absolute percentage error of 11% and an

𝑅



of 0.96, which demonstrates the potential of ANN

to improve price prediction in the context of rising

automotive costs and market fluctuations (Pillai,

2022). The rapid growth of the mobile Internet has led

to the decline of traditional offline used car trading

models, which gave rise to online platforms at the

Lin, J.

Predicting Used Car Price Based on Machine Learning.

DOI: 10.5220/0013270500004568

In Proceedings of the 1st International Conference on E-commerce and Artiﬁcial Intelligence (ECAI 2024), pages 553-560

ISBN: 978-989-758-726-9

553

same time. Precise forecasting of the price of used

cars is important for fair dealings. In order to improve

the conventional BP neural network (BPNN), this

study presents a PSO-GRA-BPNN model that

combines particle swarm optimization (PSO) with

grey relational analysis (GRA) for feature selection.

The results show that the PSO-GRA-BPNN model

outperforms other models, with a MAPE of 3.936%,

significantly reducing prediction errors. This method

introduces a new method of evaluating used car

prices, which provides more reliable market pricing.

(Liu et al., 2022). Forecasting used car prices is vital

due to growing market demand. This article applies

K-Nearest Neighbor (KNN) regression to predict

used car prices using a supervised machine learning

model. The model used Kaggle data for training and

evaluated various train-test splits. The accuracy of the

suggested model was about 85%, which made it a

useful instrument for used car price prediction. K-

Fold cross-validation was also used in the study to

guarantee the robustness and dependability of the

model (Samruddhi et al., 2020). Recent research has

shown that using models like Artificial Neural

Networks (ANN) and RF can greatly increase the

predictability of used car prices. Through considering

multiple factors such as bias and data quality, these

models like RF outperform compared to simpler

methods like LR (Varshitha et al., 2022).

This study uses LR, RR, and RF to estimate the

cost of used cars and identify the best model. Precise

forecasting can help different interested parties

improve pricing strategies, leading to better decisions

and supporting the growth of the used car market.

2 DATASETS

This dataset is collected from the CarDekho website,

which is a popular online platform for buying and

selling cars in India. It provides a collection of car-

related data, which is useful for various analysis and

predictive modeling tasks.

2.1 Data Description

The dataset car_data.csv contains information on

used cars listed on the CarDekho platform, with a

total of 9 entries. Every car's name, year of

manufacture, selling price, and other information are

included in the dataset. The particular information is

outlined in Table 1.

Table 1: The caption has one line so it is centered.

Primary Explanation

Columns

The name of the car including the make

and model.

Car_Name The year of manufacture.

Year

The price at which the car is being sold

(

in INR

)

Selling_Price

The current price of the car when new (in

INR

)

Present_Price The total kilometers are driven by the car.

Kms_Driven

The type of fuel the car uses (e.g., Petrol,

Diesel, CNG

)

Fuel_Type

The type of seller (e.g., individual,

dealer)

Seller_Type

The type of transmission (e.g., Manual,

Automatic

)

Transmission The number of previous owners.

Owner The number of previous owners.

2.2 Data Preprocessing

After loading the vehicle sales dataset, this paper

began with data cleaning and feature engineering.

Missing values were addressed by either interpolating

them or removing the affected rows. Next, this paper

created a new feature, Age, by calculating the

difference between the production year and 2020,

which allowed us to drop the now redundant Year

column. Finally, this paper renamed the Selling_Price

and Present_Price columns to Selling_Price (lacs)

and Present_Price (lacs), the Owner column was

updated to Past_Owners.

To Tensure data quality, outlier detection was

performed by filtering records where values exceeded

the 99th percentile for numerical features such as

Present_Pr ice (lacs), Selling_Price (lacs), and

Kms_Driven. This approach identified potential

outliers, which were then addressed through data-

cleaning processes. As shown in Figure 1, Figure 2,

Figure 3, Figure 4 some outliers are clearly visible.

ECAI 2024 - International Conference on E-commerce and Artiﬁcial Intelligence

554

Figure 1: Boxplot of Selling_Price (lacs) (Photo/Picture credit : Original).

Figure 2: Boxplot of Present_Price (lacs) (Photo/Picture credit : Original).

Figure 3: Boxplot of Kms_Driven (Photo/Picture credit : Original).

Figure 4: Boxplot of Age (Photo/Picture credit: Original).

Subsequently, a correlation analysis was

conducted using a heatmap to examine the

relationships among numerical features. This analysis

revealed linear relationships between features and

their correlation levels with the target variable,

Selling_Price (lacs). The correlation coefficient

matrix helped identify features with strong

correlations to the predicted target, guiding feature

selection and optimization. As shown in Figure 5, it

can be seen that the target variable has a strong

correlation with Present_Price (lacs) and

Past_Owners.

Predicting Used Car Price Based on Machine Learning

555

Figure 5: Heatmap (Photo/Picture credit : Original).

In the next step, after verifying that the Car_Name

feature had limited contribution and led to overfitting

in the prediction model, it was discarded. One-hot

encoding was then applied to the categorical features

using the pd.get_dummies function to convert them

into numerical form. This process included encoding

Fuel_Type, Seller_Type, Transmission, and

Past_Owners, with drop_first = True to avoid the

dummy variable trap and enhance model

performance.

In order to facilitate model training and

evaluation, the dataset was finally split into training

and test sets. The feature set X included all processed

features, while the target variable y was Selling_Price

(lacs). The data was divided into 80% training and

20% testing sets, allowing for a thorough evaluation

of the model's performance.

3 EXPERIMENT METHODS

3.1 Linear Regression

The linear relationship between the independent and

dependent variables is modeled using the basic

regression algorithm known as LR (Maulud &

Abdulazeez, 2020). It assumes that the target

variable 𝑦 is a linear function of the features 𝑋,

expressed as:

𝑦= 𝑋𝛽+ 𝜖





where 𝛽 represents the regression coefficients,

and 𝜖 is the error term.

The model estimates the coefficients 𝛽 by

minimizing the sum of squared errors (i.e., least

squares method):

𝑅𝑆𝑆= (𝑦



−𝑦



)



(

)

where 𝑦



 is the predicted value from the model.

3.2 Ridge Regression

Ridge Regression (RR) is an extension of LR that

addresses multicollinearity (high correlation among

features). To keep the goal function from overfitting,

a regularization term is added (Dorugade, 2014):

𝛽



= 𝑎𝑟𝑔𝑚𝑖𝑛(𝑦



−𝛽



−𝛽



𝑥







+ 𝛼𝛽









)





(3)

𝛽



represents the estimated values of the

regression coefficients. 𝑦



is the actual output for the

i-th observation. 𝛽



is the intercept. 𝛽



denotes the

regression coefficients. 𝑥



represents the j-th feature

value for the i-th observation. 𝛼 is the regularization

parameter that controls the influence of the

regularization term on the model.

3.3 Random Forest

To increase accuracy and robustness, Random Forest

ECAI 2024 - International Conference on E-commerce and Artiﬁcial Intelligence

556

(RF), constructs several decision trees and averages

the forecasts they make (Biau & Scornet, 2016),

involves bootstrapping multiple subsets from the

training data, training a decision tree on each subset,

and averaging the predictions from all trees for

regression tasks. A few important model parameters

are min_samples_split, which indicates the minimum

number of samples needed to split an internal node,

min_samples_leaf, which indicates the minimum

number of samples required at a leaf node,

max_features, which limits the maximum number of

features considered for splitting each node, and

n_estimators for figuring out how many trees there

are in the forest. The following parameter ranges were

used in this study's Randomized Search CV

hyperparameter optimization: The range of

n_estimators was 500 to 1000 in steps 100;

max_depth was 4 to 8; min_samples_split was 4 to 8

in steps 2, min_samples_leaf comprised 1, 2, 5, or 7,

and max_features was either set to 'auto' or 'sqrt'.

4 RESULTS AND DISCUSSION

Figure 6, Figure 7, and Figure 8 show the residual

plots for the LR, RR, and RF, respectively, depicting

the distribution of prediction errors in the training

samples.

Figure 6: Residual plot of train samples of LR (Photo/Picture credit : Original).

Figure 7: Residual plot of train samples of RR. (Photo/Picture credit : Original).

Figure 8: Residual plot of train samples of RF (Photo/Picture credit: Original).

Predicting Used Car Price Based on Machine Learning

557

The residual distributions for both LR and RR

primarily fall within the range of -2.5 to 2.5, with the

maximum frequency observed at the 0.0 position.

This indicates that most residuals are close to zero,

suggesting that both models generally provide a good

fit with small prediction errors for the majority of the

data. However, the presence of residuals with larger

magnitudes, though less frequent, still indicates some

variability in prediction accuracy.

In contrast, the RF’s residual distribution shows a

higher concentration of smaller residuals around zero

compared to the other models, with a wider overall

spread. While this model fits the training data well

and avoids large errors, the frequent residuals to the

left of zero suggest a tendency to under-predict the

actual values. This consistent bias towards lower

predictions may indicate potential issues such as

overfitting or less accurate predictions in specific

instances.

Based on the images, it can be observed that LR

and RR provide stable predictions with smaller

residuals for most cases, but can struggle with larger

residuals due to linear assumptions or outliers. The

RF, while achieving high accuracy with smaller

residuals overall, may suffer from overfitting or bias

towards under-prediction, which can affect its

generalization to new data.

Figure 9: Predicted vs actual values plot of LR (Photo/Picture credit : Original).

Figure 10: Predicted vs actual values plot of RR (Photo/Picture credit : Original).

Figure 11: Predicted vs actual values plot of RF (Photo/Picture credit : Original).

ECAI 2024 - International Conference on E-commerce and Artiﬁcial Intelligence

558

Figure 9, Figure 10, and Figure 11 display the

scatter plots of y_test vs y_pred_test for the LR, RR,

and RF, respectively, presenting the test samples'

actual and expected values in comparison.

The predicted vs actual values plot is essential for

assessing model accuracy. It shows the correlation

between the test values that were obtained and the

values that had been expected. An ideal model would

show data points closely aligned with the diagonal

line where y_test equals y_pred_test. Points above the

line indicate under-predictions, while points below

suggest over-predictions.

The plots for both LR and RR show that the

majority of data points are closely aligned with the

diagonal line. This alignment indicates that both

models are generally accurate in their predictions.

Although there are some deviations, most predictions

show steady performance, coming in fairly near to the

actual values.

The RF plot shows that most points are also

aligned with the diagonal line. However, there are a

few points above the line, which indicate under-

predictions where the model’s predicted values are

lower than the actual values. This suggests that while

the RF performs well overall, it occasionally

underestimates car prices, reflecting some variability

in its predictions.

In the context of regression analysis, LR and RR

are known for their excellent stability. These features

make them particularly useful when they are used to

understand how the model makes its predictions.

They provide clear explanations of the relationship

between features and target variables. On the other

hand, RF is known for their high prediction accuracy.

However, achieving superb performance with this

model typically requires careful hyperparameter

tuning. This process not only improves the model's

generalization ability to new data but also reduces the

risk of overfitting.

5 LIMITATION AND OUTLOOKS

In this research, different regression models were

compared to predict used car prices. However, several

limitations may affect the accuracy of the models.

Firstly, the quantity and quality of the data present

constraints. The dataset used may have a small

sample size, which could limit the effectiveness and

predictive ability of model training. Additionally,

although missing values were addressed, some

outliers may still be present, introducing noise and

affecting prediction accuracy. Furthermore, feature

selection presents challenges. While basic car

information was used, other important factors

influencing car prices, such as specific car

configurations and changes in market demand, may

have been overlooked. The feature engineering

approach also has limitations; for categorical features

like Fuel_Type and Seller_Type, simple One-Hot

encoding may not fully capture their latent

information.

To address these issues, several improvements are

suggested. Expanding the dataset is crucial for

enhancing model performance. Increasing the sample

size can improve the stability of model training and

predictions, particularly by collecting data from

various sources. Improving data quality control,

especially in managing missing values and outliers,

will enhance the reliability of the models.

Additionally, employing more advanced feature

processing techniques can further improve model

performance. Finally, using complex encoding

methods and different transformation techniques

could also contribute to better model performance and

accuracy.

6 CONCLUSIONS

Accurate forecasting of used car prices is important

for consumers to achieve reasonable purchases,

dealers to set effective prices and manage inventory,

and financial institutions to manage risks better. This

study evaluated the efficacy of many regression

models for used car price prediction and compared

them. By analyzing LR, RR, and RF, and found that

the RF demonstrated the greatest performance on the

training and test datasets. Therefore, it shows that the

RF is better at capturing the complex nonlinear

relationships in the data and providing more accurate

predictions. However, there were some limitations

due to the small dataset, which may impact model

accuracy. Data exceptions and feature selection issues

also affected model performance. Upcoming research

should concentrate on improving data quality control

and exploring additional features. Using advanced

feature engineering and encoding techniques could

further enhance model performance. Overall, this

research provides insights into forecasting used car

prices and highlights the relative advantages and

disadvantages of different regression models at the

same time. By improving data processing procedures

and model training methods, more reliable

predictions can be achieved in practical applications.

Predicting Used Car Price Based on Machine Learning

559

REFERENCES

Biau, G., Scornet, E., 2016. A random forest guided tour.

Test, 25, 197-227.

Cui, B., Ye, Z., Zhao, H., Renqing, Z., Meng, L., Yang, Y.,

2022. Used car price prediction based on the iterative

framework of XGBoost+ LightGBM. Electronics,

11(18), 2932.

Dorugade, A. V., 2014. New ridge parameters for ridge

regression. Journal of the Association of Arab

Universities for Basic and Applied Sciences, 15, 94-99.

Eckhardt, S., Sprenkamp, K., Zavolokina, L., Bauer, I.,

Schwabe, G., 2022. Can artificial intelligence help

used-car dealers survive in a data-driven used-car

market. In International Conference on Design Science

Research in Information Systems and Technology (pp.

115-127). Cham: Springer International Publishing.

Liu, E., Li, J., Zheng, A., Liu, H., Jiang, T., 2022. Research

on the prediction model of the used car price in view of

the pso-gra-bp neural network. Sustainability,14(15),

8993.

Maulud, D., Abdulazeez, A. M., 2020. A review on linear

regression comprehensive in machine learning. Journal

of Applied Science and Technology Trends, 1(2), 140-

147.

Onat, M. G., 2007. Otomotiv sektöründe oranlar yöntemi

aracılığı ile finansal Analiz (Master's thesis, Marmara

Universitesi (Turkey)).

Pillai, A. S., 2022. A Deep Learning Approach for Used Car

Price Prediction. Journal of Science & Technology,

3(3), 31-50.

Samruddhi, K., Kumar, R. A., 2020. Used car price

prediction using K-nearest neighbor based model. Int.

J. Innov. Res. Appl. Sci. Eng. (IJIRASE), 4(3), 2020-

686.

Varshitha, J., Jahnavi, K., Lakshmi, C., 2022. Prediction of

used car prices using artificial neural networks and

machine learning. In 2022 international conference on

computer communication and informatics (ICCCI) (pp.

1-4). IEEE.

ECAI 2024 - International Conference on E-commerce and Artiﬁcial Intelligence

560