machine learning algorithms such as random forest
are used to predict the disease outcome by using
patients’ data. For example, the algorithm is used to
predict disease recurrence by using patient genetic
and clinical data (Sumwiza, 2023). In Business
prediction, like Credit scoring, random forests are
used to create the credit scoring model by assessing
the likelihood of loan default based on customers’
financial history and transaction history (Zhou,
2023). This helps financial institutions to make
decisions. Moreover, in sales volume prediction, a
study used random forest to predict the retail sales,
which used historical sales data, promotional
activities and seasonal trending to make business
strategies and do management in the company. To
provide a clear overview of trends in EV sales for
more people, and to help more people to make
decisions on their choosing cars, this paper aims to
think about how the sales of EV will develop in the
future by modelling the approximate trends for it. The
dataset is found in Kaggle and processed by random
forest, gradient boosting and linear regression. Then,
the study compares three models’ properties to get
result figures. Finally, Experimental results
demonstrate the effectiveness of the methods.
2 METHOD
2.1 Dataset Preparation
This study got the dataset from Kaggle “Global EV
Sales: 2010-2024” by Patrick L Ford (Ford, 2024).
The dataset shows the EV sales situation from 2010
to 2024 all over the world, in every country and every
region. It contains 12,654 entries and includes 8
features: region, category, parameter, mode,
powertrain, year, unit, and value. The datasets are not
only about EV sales, but include various aspects of
EV adoption, such as market share and stock levels
across different regions and years.
To get the trends of EV sales, this study handles it
as a regression problem. The target feature here is
“value” in eight features corresponding to EV sales.
However, the dataset includes various parameters, so
it is important to filter the dataset and get relevant data
before starting to model the trends. The initial step is
to make the “parameter” just related to EV sales, also
“unit” (eg. Vehicles) just have absolute value rather
than percentages or other metrics.
Handling the missing value: Despite the fact that
the dataset seems to have entire entries, any missing
values were managed via either elimination of
incomplete facts or imputation the usage of statistical
techniques along with imply or median imputation.
This step changed into essential to ensure the dataset's
integrity before similarly processing.
Normalization: Given the wide variety of values
inside the dataset, in particular in the value column,
normalization becomes implemented to scale the
features, ensuring that everyone variable contributes
proportionately to the model's predictions. This step
is vital in stopping features with larger scales from
dominating the model training procedure.
Categorical Encoding: The dataset consists of
numerous categorical variables, such as region,
category, mode, and powertrain. These have been
transformed into numerical form using one-hot
encoding, a way that creates binary columns for every
class, permitting the machine getting to know
algorithms to interpret these functions efficiently.
Train-Test Split: to assess the version's
performance correctly, the dataset changed into
divided right into a training set and a take a look at
set, with 70% of the statistics allotted for schooling
and 30% reserved for trying out. This division
guarantees that the model can generalize well to
unseen records, thereby imparting reliable
predictions.
2.2 Machine Learning-based Prediction
This study used linear regression, random forest and
gradient boosting to model the trend about EV sales,
by writing codes for each model, inserting the data
after filtering the dataset to each model. Finally, this
paper used Mean square errors (MSE) and Coefficient
of Determination (R-squared) to compare each model
and figure out their relationship.
2.2.1 Linear Regression
Linear Regression is a broadly used statistical
technique that fashions the relationship among a
structured variable and one or greater impartial
variables by means of becoming a linear equation to
the located facts (Su, 2012; Montgomery, 2021).
Linear Regression become used as a baseline version
to establish the fundamental relationship among the
functions (together with location, year, and
powertrain) and the target variable (EV sales). The
model was carried out the usage of the sklearn library,
and its performance turned into evaluated the use of
R-squared and MSE, which degree the accuracy of
the model's predictions.
𝑦 = 𝛽0 + 𝛽1𝑋 + 𝜖 (1)