Price Prediction of Ford Cars Applying Multiple Machine Learning

Methods

Yicheng Wang

School of Computer, Nanjing University of Information Science & Technology, Jiangsu, China

Keywords: Price prediction, Ford Motors, Machine Learning, eXtreme Gradient Boosting.

Abstract: The growing supply of used cars and lower prices compared with their counterparts make the used car market

highly competitive. Such a dynamic and sophisticated market underscores the necessity for accurate price

prediction, which is crucial for both buyers and sellers. Nowadays, the popularisation of online databases and

advanced machine learning techniques have made price prediction based on machine learning models an

obvious trend in the used car trading industry. This study delves into the performance evaluation of multiple

machine learning methods for predicting used Ford car prices. Utilizing a comprehensive dataset from Kaggle,

encompassing 17,966 entries with nine distinct features, the researcher employed a battery of regression

models, including linear regression (LR), decision tree (DT), random forest (RF) and eXtreme Gradient

Boosting (XGBoost). The approach involved rigorous feature engineering, model training and cross-

validation evaluation, employing Mean Squared Error (MSE), Mean Absolute Error (MAE), and R-squared

(𝑅



) as key performance indicators. The results indicate that XGBoost and RF surpass traditional models in

predictive accuracy, with XGBoost demonstrating the highest 𝑅



value of 0.9347. This study compares the

performance of several widespread models and offers practical implications for stakeholders seeking to

enhance transactional outcomes through data-driven pricing strategies.

1 INTRODUCTION

Ford Motors is the second largest automobile

manufacturer in the United States. Since 1903, the

company has launched countless iconic models like

Mustang, Bronco, Cortina and Transit. As the first

motor manufacturer that implements assembly lines

and vertical integration into industrial production,

Ford contributes a lot to the modern car industry

(Mead & Brinkley, 2003). Trusted quality and

moderate depreciation rate make used Ford cars

popular in the market.

Over the past few decades, the flourishing second-

hand car market has become a vibrant sector within

the automotive industry, offering consumers a

reasonably-priced alternative to a brand-new

counterpart. According to Edmunds, an online

resource that provides car buying information and

vehicle sales data, approximately 39 million used

vehicles were sold in the U.S. throughout the year

2021. Manheim, a global provider of vehicle

remarketing services, introduced Manheim Used

https://orcid.org/0009-0006-7569-6219

Vehicle Value Index to reflect the fluctuation of

America’s used car prices (Manheim, 2023). The

index showed an upward trend in price over the last

five years.

Accurate price prediction for used cars is not only

crucial for buyers seeking value but also essential for

sellers aiming to maximize their profits. As the

volume of transactions continues to grow, the

precision of used car pricing has garnered

considerable interest, prompting researchers to

pursue better performance of various predictive

models.

The complexity of the used car market requires

advanced and sophisticated analytical tools to

evaluate the fair value of vehicles based on multi-

dimensional factors. The manufacturer who produces

the vehicle, mileage, transmission and fuel economy

all contribute to the training process of machine

learning models. The application of data-driven

models has transformed the way prices are estimated,

moving from heuristic methods to evidence-based

approaches. Cui et al. (2022) discussed the integration

280

Wang, Y.

Price Prediction of Ford Cars Applying Multiple Machine Learning Methods.

DOI: 10.5220/0013214800004568

In Proceedings of the 1st International Conference on E-commerce and Artiﬁcial Intelligence (ECAI 2024), pages 280-285

ISBN: 978-989-758-726-9

of machine learning algorithms in predicting used car

prices, which is vital in enhancing the accuracy and

efficiency of predictions.

Previous scholarly endeavours in the domain of

used car price prediction have primarily leveraged

models such as linear regression (LR), decision tree

(DT), and neural networks. LR models, despite their

simplicity, offer a baseline for understanding the

relationship between car attributes and their actual

price (Dahal, 2023). Jin (2021) constructed DT

models, including ensemble methods such as random

forest (RF) in their research. Ensemble models were

also widely applied due to their ability to deal with

feature interactions under non-linear correlations and

provided robust predictions. Neural networks,

particularly deep learning architectures, have

demonstrated remarkable potential in capturing

complex patterns, especially when applied to datasets

on a large scale (Huang et al., 2023).

The evolution of these models reflects the

growing need for more nuanced and dynamic pricing

strategies. However, the heterogeneity of used cars

and the rapid pace of technological advancements in

the automotive industry present ongoing challenges.

The quest for more accurate and efficient predictive

models continues, with recent studies exploring

ensemble methods and hybrid approaches that

combine the strengths of multiple algorithms.

The previous study has shown the necessity of

combining multiple regressors to better evaluate used

cars’ prices. This paper aims to evaluate the

performance of different regressors and discover

critical factors that influence used cars’ value. The

author will first introduce the dataset used during the

study, and then discuss the main methodologies and

techniques applied. At the end, the author will offer

an overview of the research and discuss limitations

and possible optimization methods.

2 DATASETS AND METHODS

2.1 Data Collection and Description

The dataset used in the research is obtained through

Kaggle. It highlights 9 features that might influence

the pricing procedure of used Ford vehicles. Those

features include the specific model of the car and its

production year, both of which depict the overall

price section of certain models. However, prices of

the same models vary significantly as each car has

different mileage, engine size, transmission, fuel

consumption and tax. The dataset collects detailed

information on 17,966 used Ford vehicles priced in

U.S. dollars. The introduction of characteristic

properties used during the research is shown in Table

Table 1: Basic characteristic properties of the data.

ID Features T

1 model object

2 yea

int64

rice int64

4 transmission o

ect

5 milea

e int64

6 feulType o

ject

7 tax int64

8 m

g float64

9 en

ineSize float64

Among the features mentioned in Table 1,

transmission tells the researcher what kind of gearbox

the car is equipped with, including manual, semi-auto

and automatic. Mpg stands for mile per gallon, which

is an indicator of vehicle’s fuel economy. The figure

of engine size indicates how many litres the engine

has in swept volume.

To better carry out the following research, the

author first conducted the Exploratory Data Analysis

(EDA). The data chosen doesn’t have null values and

the size of duplicate records is relatively small.

However, it is worth noting that the distribution of

samples on model, transmission and fuel type is

highly imbalanced. To improve the performance of

models applied, the paper dealt with those rare data

through feature engineering.

2.2 Feature Engineering

Before model training, it is vital to ensure data quality

through data pre-processing. To optimize the

performance of selected models, the author processes

the data with the following steps. The first step is to

divide characteristic properties into two groups,

numerical and categorical. The classification standard

is slightly different from the ordinary one, that is the

researcher treats engine size and tax as categorical

features as they cluster at a few discrete values.

After grouping, the researcher focuses on rare data

among the categorical features. To reduce the

disturbance of those records, a threshold of 0.05 is set

in the paper. Features with a ratio less than 5% will

be added to a new group called “Other”. Figure 1

shows the distribution of values on 4 features,

including model, engine size, fuel type and tax. All

kinds of transmission options meet the threshold of

0.05 so it is not necessary to update the bar chart.

Price Prediction of Ford Cars Applying Multiple Machine Learning Methods

281

Figure 1: Distribution of values across selected features. (a) Distribution of Model types, (b) Distribution of Fuel types, (c)

Distribution of Tax, (d) Distribution of Engine Size (Photo/Picture credit: Original).

For each categorical feature f, this paper applies

functions to perform one-hot encoding. This function

converts the categorical feature into a series of binary

columns, where each column represents a category,

and each sample has only 1 in the corresponding

column, with 0s in all others. After the encoding

procedure, the author checked each option’s

correlation with the price tag of the vehicle. By

sorting them into descending order, the study dropped

three low-correlated features with an influence

coefficient below 0.05. As for the numerical features,

this study has found that none of the three features

follows normal distribution according to respective

count figures. In this case, it is essential to apply

scalers to reduce the size of evaluation metrics in the

following tests. This paper applies MinMaxScaler to

process corresponding values.

2.3 Linear Regression

Linear regression (LR) is a statistical methodology

employed to model the correlation between the

variable to be predicted and a set of independent

variables. It searches for the best-fitting line through

data points under the supposition that the variables

have linear correlations. The coefficient of

determination is used to gauge the strength of the

model, which is usually expressed as an equation. LR

is extensively employed in various domains for

forecasting, inference, and prediction.

2.4 Decision Tree & Random Forest

Decision Tree (DT) is a machine-learning algorithm

based on tree-like model to predict continuous

outcomes. It works by splitting the data into subsets

based on feature values, creating a tree of decisions

that ultimately yield a prediction. This method is

robust, easy to interpret, and capable of handling non-

linear relationships and feature interactions.

Random Forest (RF) is a versatile ensemble

learning method that effectively addresses regression

challenges by combining the predictions of multiple

decision trees (Fawagreh et al., 2014). It harnesses the

power of bagging (Bootstrap Aggregating) to

enhance the stability and accuracy of the model.

ECAI 2024 - International Conference on E-commerce and Artiﬁcial Intelligence

282

The training procedure of each tree involves

different bootstrap samples of the data and the model

makes its final prediction by averaging the individual

tree predictions, thereby reducing the variance and

preventing overfitting.

The algorithm incorporates randomness by

selecting a random subset of features to consider at

each split in the trees, which promotes diversity

among the trees and improves the model's

generalization capabilities. RF is particularly useful

for high-dimensional datasets and is known for its

robustness to outliers and noise.

In the research, the author applied early stopping

to select a suitable value for n_estimators, which

indicates the number of DT. Normally speaking,

implementing more DT enables the model to achieve

better performance. However, after the n_estimator

reaches the performance saturation point, adding

more trees to the ensemble no longer leads to

significant improvements in the model's predictive

performance. Early stopping assists the computer in

making the trade-off between computation time and

model's performance by stopping the optimization

process when the improvement is lower than the

significance threshold set before the evaluation.

2.5 Extreme Gradient Boosting

The eXtreme Gradient Boosting (XGBoost) is

another efficient ensemble method designed for

regression and classification tasks (Chen & Guestrin,

2016). It builds models in an additive manner by

focusing on minimizing a loss function through a

gradient-boosting framework. XGBoost employs a

tree-based approach, enhancing the predictive

accuracy by scaling the weak learners and adding

them to the final prediction. It also introduces

regularization to control overfitting, making it robust

for a wide range of data sets (Yang et al., 2023). With

its user-friendly interface and high performance,

XGBoost has become a go-to tool for data scientists

and machine learning practitioners in various

competitions and industries.

3 RESULTS AND DISCUSSION

3.1 Evaluation Metrics

The performance of selected models in this study was

assessed using Mean Squared Error (MSE), Mean

Absolute Error (MAE), and R-Square (𝑅



MSE is the average squared difference between

the estimated values and the actual values. The

formula for MSE is shown in Equation (1).

𝑀𝑆𝐸 =

𝑛

𝑌



−𝑌

















In Equation (1), Y





stands for the actual value

while 𝑌



means predicted value. The character n is the

size of observations involved.

MAE measures the average magnitude of errors

among predictions without considering their

direction. Therefore, the figure is always non-

negative. The formula for MAE is shown in Equation

(2).

𝑀𝐴𝐸 =

𝑛

𝑌



−𝑌















𝑅



is a metric that indicates how effectively the

model is likely to predict future outcomes. It

demonstrates the proportion of the variance in the

dependent variable which can be predicted from the

independent variables. According to Chicco et al.

(2021), 𝑅



is capable of overcoming the

interpretability limitations of MSE and MAE. The

formula for 𝑅



is shown in Equation (3).

𝑅



=1−

∑

𝑌



−𝑌













∑

𝑌



−𝑌

















3.2 Model Training

Before the author fits the data into the DT, LR is

applied to set a baseline, achieving an 𝑅



of 0.7779.

The result of DT significantly improves the

performance with the MSE in 8e-3, MAE in 0.0201

and 𝑅



in 0.8832.

To further optimize the performance of the model,

this study applies the method of Bootstrap

Aggregating by increasing the number of DT during

the predicting procedure. The approach has proved

effective in strengthening the model’s robustness and

preventing overfitting. To find a suitable number of

trees acquired, the researcher uses an iterative

algorithm that compares the cross-validation score

between two nearby rounds. The improvement

threshold is set at 0.1%, indicating that the algorithm

will stop after the model fails to reach the

improvement threshold for 5 consecutive tries. The

cross-validation score is calculated based on model’s

𝑅



figure. Eventually, the performance saturated after

22 trees. The 𝑅



figure reached 0.9222. Table 2

shows the performance evaluation of three regression

trained so far.

Price Prediction of Ford Cars Applying Multiple Machine Learning Methods

283

Table 2: Performance Evaluation of three regression.

ID Model

𝑀𝑆𝐸 𝑀𝐴𝐸

𝑅



1 Linear Regression 1.68e-3 0.0283 0.7779

2 Decision Tree 8.81e-4 0.0201 0.8832

3 Random Forest 5.86e-4 0.0166 0.9222

Table 3: Performance Evaluation of four regression.

ID Model

𝑀𝑆𝐸 𝑀𝐴𝐸

𝑅



CV Score

1 LR 1.68e-3 0.0283 0.7779 79.1941

2 DT 8.81e-4 0.0201 0.8832 87.9012

3 RF 5.86e-4 0.0166 0.9222 91.7158

4 XGBoost 4.93e-4 0.0152 0.9347 93.0520

Then researcher generates importance values of

different features by implementing functions of DT.

Among all the features, the top seven characteristic

properties show a higher correlation with the price of

used vehicles. Figure 2 demonstrates their important

value to the price of cars.

Figure 2: Top 7 features that influence vehicle pricing

(Photo/Picture credit: Original).

From Figure 2, it is obvious that consumers

consider production year as the most vital factor when

purchasing used vehicles, followed by engine size of

2.0 and mpg. The study also finds that consumers may

prefer an engine size of 2.0 to 1.2 or other options.

Ford Kuga generally has a higher price according to

both Figure 2 and the original dataset.

The study then trains the XGBoost model and

makes it continue to optimize its performance based

on the results of previous models. Yet the

improvement of the model will reach a bottleneck

where further optimization is less evident but time-

consuming. To avoid the situation, an early-stopping

algorithm is applied with the threshold set at 5 rounds.

The final iteration result indicates that the model

scores 0.9347 in 𝑅



. Figure 3 shows the change of

Root Mean Squared Error (RMSE) figure as iteration

grows.

Figure 3: RMSE value as the iteration grows (Photo/Picture

credit: Original).

Table 3 compares the performance of four

selected models. CV Score demonstrates their

performance on cross-validation accuracy judged by

the 𝑅



figure.

From the results shown above, it is obvious that

XGBoost outperforms its counterpart at the price

prediction of the given dataset. Closely followed by

RF. Despite its simplicity, LR still achieves a

moderate performance.

4 CONCLUSIONS

Tens of thousands of used vehicles are traded every

day around the world, making timely and accurate

ECAI 2024 - International Conference on E-commerce and Artiﬁcial Intelligence

284

price predictions essential to boost the efficiency and

profit of the industry. Therefore, applying machine

learning methods to assist participants with decision-

making is promising. This study offers insightful

information about used automobile pricing

predictions. Not only has it found the key features

which might affect the price of cars, but also shown

the preference of customers for those features. For

example, the importance values of various features

indicate that a 2.0-litre engine affects the price more

than its 1.2-litre counterparts. This study also

demonstrates the effectiveness of used car price

prediction based on machine learning methodology,

with XGBoost and RF showing superior

performance. The cross-validation scores indicate

that these models offer a high level of accuracy and

are capable of providing reliable price predictions for

used cars. However, the research has some limitations

at the present stage. Firstly, the car market is a

sophisticated system and the detailed information on

models involves complicated terms. The dataset

selected focused on a single manufacturer Ford and

simplified the issue. More work on data pre-

processing is acquired to predict data of a much larger

scale, such as annual domestic used car trading. Apart

from that, more features involved will inevitably lead

to the Multi-collinearity problem. In that case

additional machine learning methods need to be

implemented to handle that issue.

REFERENCES

Chen, T., Guestrin, C., 2016. XGBoost: A scalable tree

boosting system. arXiv preprint arXiv:1603.02754.

Chicco, D., Warrens, M. J., Jurman, G., 2021. The

coefficient of determination R-squared is more

informative than SMAPE, MAE, MAPE, MSE and

RMSE in regression analysis evaluation. PeerJ

Computer Science, 7, e623.

Cui, B., Ye, Z., Zhao, H., Renqing, Z., Meng, L., Yang, Y.,

2022. Used car price prediction based on the iterative

framework of XGBOOST + LightGBM. Electronics,

11(18), 2932.

Dahal, R., 2023. Used car price prediction using Linear

regression. https://www.researchgate.net/publication/

375697258_Used_car_price_prediction_using_Linear

_ regression.

Fawagreh, K., Gaber, M. M., Elyan, E., 2014. Random

forests: from early developments to recent

advancements. Systems Science & Control

Engineering, 2(1), 602–609.

Huang, J., Saw, S. N., Feng, W., Jiang, Y., Yang, R., Qin,

Y., Seng, L. S., 2023. A Latent Factor-Based Bayesian

Neural Networks Model in Cloud Platform for Used

Car Price Prediction. IEEE Transactions on

Engineering Management, 1–11.

Jin, C., 2021. Price prediction of used cars using machine

learning. In 2021 IEEE International Conference on

Emergency Science and Information Technology

(ICESIT) (pp. 223-230). Chongqing, China.

https://doi.org/10.1109/ICESIT53460.2021.9696839.

Manheim., 2023. Summary Methodology for Manheim

Used Vehicle Value Index. https://site.manheim.

com/wp-content/uploads/sites/2/2023/07/Used-

Vehicle-Summary-Methodology.pdf.

Mead, W. R., Brinkley, D., 2003. Wheels for the World:

Henry Ford, his Company, and A Century of Progress,

1903-2003. Foreign Affairs, 82(5), 176.

Yang, Q., He, K., Zheng, L., Wu, C., Yu, Y., Zou, Y., 2023.

Forecasting crude oil futures prices using Extreme

Gradient Boosting. Procedia Computer Science, 221,

920–926.

Price Prediction of Ford Cars Applying Multiple Machine Learning Methods

285