2.2 Data Validation and Model
Preparation
One of the main challenges in fraud detection is the
class imbalance problem, where there are far fewer
instances of fraudulent transactions than legitimate
ones. This imbalance can significantly distort the
performance of predictive models, leading to a high
number of false negatives. To address this issue, the
study uses the Synthetic Minority Over-sampling
Technique (SMOTE) shown in Figure 1. This
technique, as described by Lipton (Lipton et al.,
2016), involves generating synthetic samples from
the minority class (fraudulent transactions) to balance
the dataset.
Before training the models, it is crucial to ensure
that they perform well across different scenarios and
datasets. To achieve this, the study makes extensive
use of cross-validation, a technique in which the data
is divided into several subsets and the model is trained
on each subset, while the remaining parts are used for
testing. This approach not only helps to assess the
robustness and stability of the models, but also to
avoid overfitting, which can occur when a model is
fitted too closely to a limited set of data points. Cross-
validation ensures that the models are generalisable
and perform consistently across different data
samples. These steps precede the actual model
training, setting the stage for effective learning and
accurate fraud prediction.
2.3 Machine Learning Models
GBMs are an advanced ensemble learning technique
known for its effectiveness in classification tasks.
GBMs operate by incrementally constructing an
ensemble of decision trees. The objective of each tree
in the sequence is to rectify the errors identified in the
previous tree, thereby enhancing the overall accuracy
of the model with each iteration. This approach is
referred to as 'boosting', whereby the outputs of weak
learners are combined to create a robust predictive
model. The sequential addition of models enables
GBMs to adaptively refine their predictions,
rendering them particularly effective in addressing
the complex and dynamic nature of financial fraud.
The decision tree base learners enable them to model
non-linear relationships in a natural manner.
Furthermore, gradient descent is employed to
optimise a loss function, thereby focusing intensively
on the most challenging cases for classification.
GBMs also incorporate regularisation techniques
such as subsampling and tree complexity bounds,
which help to avoid overfitting and maintain the
generalisability of the model across different datasets.
NNs, and in particular those configured for deep
learning, are highly adept at modelling intricate and
high-dimensional data patterns, rendering them an
optimal choice for the detection of sophisticated fraud
schemes. These networks comprise multiple layers of
neurons, with each layer learning a distinct aspect of
the data. As data progresses through these layers, it is
transformed by means of weights and biases that are
adjusted through backpropagation during the training
phase. This enables the network to learn both detailed
and abstract representations of the input data. Deep
neural networks are highly effective at learning from
complex and voluminous datasets due to their deep
architectures, which enable them to uncover hidden
and non-obvious patterns in the data. Activation
functions such as ReLU or sigmoid are essential for
introducing the requisite non-linearities, thereby
enabling the network to learn and adapt to complex
and non-linear data relationships, which are typical in
fraud detection scenarios where fraudulent and
normal transactions may not be easily distinguishable
by linear models. This sophisticated configuration
allows neural networks to identify complex
fraudulent transactions that are less readily
discernible by simpler models, thereby providing a
robust tool for security systems in financial
environments.
Finally, the study employs a rigorous
hyperparameter tuning process to optimise the
models for optimal performance. This encompasses
the modification of parameters such as the learning
rate and the number of layers in NNs, as well as the
depth and number of trees in GBMs. The tuning
process is informed by the model's performance on
the validation set, thereby ensuring that each model is
meticulously calibrated to capture the intricacies of
financial fraud detection.
3 TRAINING
3.1 Training Gradient Boosting
Machines (GBMs)
In the training of GBMs, the depth of the decision
trees and the learning rate represent critical
hyperparameters that require precise tuning. The
depth of the trees determines the degree of granularity
with which the data is split. Deeper trees facilitate the
learning of more detailed data patterns, which is
advantageous for the identification of complex fraud