rule induction algorithm that extracts a low False
Positive Rate (FPR) rule set directly from decision
trees in fraud detection (Martins et al., 2024), Lu et
al. evaluate the applicability of Kolmogorov-Arnold
Networks (KAN) applied in fraud detection, by
proposing a rapid decision rule based on Principal
Component Analysis (PCA) to evaluate the
appropriateness of KAN, along with introducing a
heuristic method for hyperparameter tuning, finding
that their effectiveness is context-dependent (Lu et
al., 2024). However, there is limited research that
simultaneously focuses on improving fraud detection
accuracy while considering the protection of user
privacy.
Therefore, this paper will conduct a more in-depth
discussion on fraud detection under the premise of
safeguarding user privacy. A dataset related to credit
card fraud detection on Kaggle was employed. This
study first used the K-means algorithm to classify the
dataset, and then simulated the Non-Independent and
Identically Distributed (non-iid)) characteristics of
the data by assigning data from the same category to
a single client. This paper also simulated the
Independent and Identically Distributed (iid)
characteristics by evenly distributing data from
different categories to each client. This setup is used
to compare the performance of the FedAvg algorithm
with embedded Logistic Regression model when
dealing with these two types of data distributions.
2 METHODS
2.1 Data Preparation
The data set used came from the website 'Kaggle'
(Elgiriyewithana, 2023), which contains more than
550,000 credit card trades made by European credit
card holders in 2023. The main features of this dataset
are V1-V28, processed with dimension reduction
methods by the author. Finally, the data set's label
(Class) is binary, indicating that the transaction is
either a credit card fraud (1) or not a fraud (0).
In terms of the data preprocessing, first, the id and
label columns are discarded from the dataset. Since
the Amount feature and the other features (V1-V28)
have significant differences in their value ranges, this
study applies standardization to the other features.
The dataset is splitted for training and testing in 8:2.
To investigate the impact of non-iid data on the
federated learning algorithm, K-means clustering is
applied to the training set, dividing it into clusters
corresponding to the number of clients. Each cluster
is then assigned to a corresponding client, simulating
the non-iid nature of the data. To simulate iid data,
each cluster is evenly distributed among the clients.
The performance of three ML algorithms—
Decision Trees, Random Forests, and Logistic
Regression—integrated into a federated learning
framework for a binary classification task is
investigated in this paper. After multiple rounds,
which can be determined by the variable
‘num_communications’, of training and
communication between clients, a global model is
obtained by combining the models from multiple
clients. This global model is then used to make
predictions on the test set. Accuracy, which is
computed as the ratio of properly categorized samples
to the total number of samples, is used to assess the
performance of the model.
2.2 Federated Learning-based Machine
Learning Models
Federated Learning, also known as Federated
Machine Learning, is a method proposed to address
privacy issues during joint model training (Li et al.,
2020; Mammen, 2021). In this approach, each
organization trains its own model locally. After
completing the training, each organization uploads its
model parameters to a central server (or it can be peer-
to-peer). The central server combines the parameters
from different organizations (this can be done by
uploading gradients or updated parameters) and
recalculates new parameters (e.g., through weighted
averaging, a process known as federated
aggregation). These new parameters are then
distributed back to each organization, which deploys
them into their models to continue further training.
This process can be repeated iteratively until the
model converges or other predefined conditions are
met. The study mainly focusses on the relation
between the variable ‘num_client’ (the number of
clients in the federated learning) and accuracy of the
test data. Experiments are conducted based on
‘num_clients’ disparately equals to 2,4,6,8. Other
hyper-parameters are fixed, in which ‘learning_rate’
(the step size) equals to 0.01, ‘num_communications’
(the number of communication rounds) equals to 10,
‘num_local_steps’ (the number of local steps clients
take in each communication round) equals to 8.