performance, as the model may become skewed
towards the majority class (non-fraudulent
transactions), potentially leading to reduced
effectiveness in detecting fraudulent activity. By
balancing the dataset, either through under sampling
the majority class or oversampling the minority class
(e.g., using SMOTE), it can be ensured that the model
has a better chance of learning the characteristics of
both classes effectively.
Figure 3: Equally Distribution Classes (Photo/Picture
credit: Original).
This visualization shown in Figure 3 illustrates the
process of balancing the dataset, which is a crucial
step in improving the performance of machine
learning models for fraud detection. By ensuring an
equal distribution of classes, the model can learn to
distinguish between fraudulent and non-fraudulent
transactions more accurately, leading to better
detection rates and fewer false positives/negatives.
2.2.3 Preprocessing Steps Based on
Distribution Plots
The distribution plots for the V14, V12, and V10
features for fraudulent transactions provide valuable
insights that inform the preprocessing steps. These
steps include normalization, data cleaning, balancing
the dataset, applying SMOTE, and splitting the data
into training and testing sets.
Normalization: Normalization is crucial to ensure
that all features have an equal influence during model
training, preventing any single feature from
dominating the learning process due to differences in
scale. The skewed nature of some features, as
observed in the distribution plots, indicates the need
for scaling.
Cleaning: Data cleaning involves handling
missing values, duplicates, and irrelevant features. In
the dataset, this study assumes no missing values
based on initial observations, but this study drops
irrelevant columns after transformation.
Balancing the Dataset: Given the highly
imbalanced nature of the dataset, balancing is crucial.
This study used undersampling and SMOTE to
address this issue. Undersampling reduces the
number of non-fraudulent transactions, while
SMOTE generates synthetic samples for the minority
class.
Train-Test Split: To evaluate the model's
performance, this paper splits the dataset into training
and testing sets. An 80-20 split ratio is commonly
used.
2.3 Machine Learning-Based Prediction
2.3.1 Introduction of the Machine Learning
Workflow
The machine learning workflow is a structured
process for building models that yield accurate
predictions. It begins with data collection, followed
by data preprocessing, which involves cleaning,
normalization, and addressing data imbalances. After
preprocessing, the data is split into training and
testing sets. Different machine learning algorithms
are then applied to the training set to create predictive
models, which are subsequently evaluated on the
testing set. This structured approach ensures robust
and reliable machine learning solutions (Witten et al.,
2016).
Data Collection: Data collection is the initial
phase of the machine learning workflow, involving
the aggregation of relevant data from various sources.
The quality and quantity of the collected data are
critical, as they directly influence the model's
effectiveness and overall performance. This data can
be structured or unstructured and is typically gathered
from databases, APIs, web scraping, or manual entry.
High-quality data collection ensures that the
subsequent steps in the workflow are built on a solid
foundation (Zhang et al., 2019).
Model Building: Model building involves
selecting the appropriate machine learning algorithms
that will be used to create the predictive model. This
step includes defining the model architecture,
choosing the type of model (e.g., regression,
classification), and setting hyperparameters.
Common algorithms used in fraud detection include
Logistic Regression, Decision Trees, Random
Forests, and Support Vector Machines (SVM) (Zhou
et al., 2018).
Model Training: Model training involves feeding
preprocessed data into the chosen algorithm, enabling
the model to learn patterns from the data. During this
phase, the model adjusts its parameters to minimize