Optimizing Credit Card Fraud Detection with Multi-Armed

Bandit Algorithms

Rongjun Gao

The University of Michigan-Shanghai Jiao Tong University Joint Institute, Shanghai Jiao Tong University,

No. 800, Dongchuan Road, Minhang District, Shanghai, China

Keywords: Credit Card Fraud Detection, UCB1, Thompson Sampling.

Abstract: In today's world, the importance of credit card fraud detection cannot be overstated, as it is crucial for the

security of financial transactions. To optimize cost-efficiency, automated algorithms have been developed to

pinpoint the transactions that are most likely to be fraudulent. Despite their potential, multi-armed bandit

(MAB) algorithms have not been widely adopted in fraud detection. This paper introduces two models that

apply the Upper Confidence Bound 1 and Thompson sampling algorithms to the task of fraud detection,

categorizing transactions into 52 segments based on the amount and type. The performance of these

algorithms is evaluated against several metrics, including cumulative regret, the reward generated, the ratio

of optimal arm selection, and overall efficiency. The findings suggest that the Thompson sampling algorithm

surpasses the UCB1 in performance, achieving lower standard errors and computational complexity. It proves

to be more effective in swiftly and accurately identifying the most suspicious transactions, thus pinpointing

the optimal choice with greater speed.

1 INTRODUCTION

In modern society, credit cards are widely used for

transactions for its convenience and simplicity.

However, this also provides opportunities for fraud

cases to happen due to the swiftly changing essence

of financial services together with potential monetary

interest (Ferreira and Meidutė-Kavaliauskienė,

2019). Credit card transaction frauds happen

frequently and can easily trigger huge property losses

for both individuals and corporations without early

detection. Therefore, fraud has drawn wide attention

in areas such as business and commerce (Bernard et

al., 2019). Financial institutions are also confronting

high risks and uncertainties caused by these losses

(Sariannidis et al., 2019). On the other hand, human

experts can only examine a limited number of credit

card transactions in a fixed period of time (e.g. 1000

every month) (Soemers et al., 2018). Therefore,

there’s an urgent need for a model that can identify

transactions with the highest probability of being

fraudulent.

Typically, pre-trained machine learning models

are used for fraud detection. Existing models and

algorithms used for detecting potential frauds consist

of the following types. In Logistic Regression, the

sigmoid function is mainly applied to predict and

judge the probability of a transaction being

fraudulent. The transaction is considered suspicious if

the sigmoid output value is over 0.5 and legitimate

otherwise (Awoyemi et al., 2017). Together with

modified gradient ascent optimization, the classifier

updates new data gradually instead of all at once to

calculate the best-fit parameters (Awoyemi et al.,

2017). In Naïve Bayes, prior samples and conditional

probabilities are taken advantage of to make decisions

with the highest Bayesian probability (Awoyemi et

al., 2017). Grounded on the Bayesian classification

rule, the binary classification of fraudulent

transactions and legitimate transactions is performed

with the assumption that all data features are

conditionally independent (Awoyemi et al., 2017). In

the K-Nearest Neighbours algorithm (KNN),

traditional distance functions such as the Euclidean

distance and Minkowski distance are applied to

classify K data points with the lowest distances into

one group (Awoyemi et al., 2017). KNN exhibits the

best performance among the three algorithms based

on standards such as classification accuracy, balanced

rate, Matthews correlation coefficient (MCC)

(Awoyemi et al., 2017), etc., but also demonstrates

drawbacks including a requirement for sufficient

474

Gao, R.

Optimizing Credit Card Fraud Detection with Multi-Armed Bandit Algorithms.

DOI: 10.5220/0012956000004508

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 1st International Conference on Engineering Management, Information Technology and Intelligence (EMITI 2024), pages 474-478

ISBN: 978-989-758-713-9

Proceedings Copyright © 2024 by SCITEPRESS – Science and Technology Publications, Lda.

samples, overfitting, weak generalization, and

computational complexity (Zhang, 2022). On the

other hand, multi-armed bandit (MAB) algorithms are

seldom applied to credit card fraud detection due to

difficulties in forming suitable arms based on

excessively imbalanced data. In addition, fraudsters

are always finding a way to cheat detection models by

making fraud appear legitimate, which is known as

the concept drift (Dornadula and Geetha, 2019). In

this sense, properly tackling concept drift for non-

stationary data streams becomes a challenge

(Soemers et al., 2018).

This paper seeks to apply traditional MAB

algorithms including the UCB1 and Thompson

sampling algorithms to construct two fraud detection

models and examines the effect of the models in terms

of the cumulative regret after 100,000 rounds. All

codes are implemented in Python 3.11.4. The dataset

used in this article records real credit card

transactions over the globe. The arms are constructed

based on 4 transaction amount intervals and 13

transaction types. The reward is given to the model

whenever the model successfully identifies a fraud,

and an evaluation standard called cumulative regret is

defined as the gap between the maximum reward and

the actual obtained reward. A comparative analysis of

the performance of the UCB1 and Thompson

sampling algorithm is given in this paper, particularly

focusing on the evaluation of cumulative regret,

generated reward, optimal arm selection ratio, and

algorithm efficiency. The results show that the

Thompson sampling algorithm exhibits superior

performance than the UCB1 algorithm.

The rest of paper is organized as follows: Section

2 describes the detailed parameters of the dataset used，

the arm classification standard, and the definition of

regret, and introduces the UCB1 and Thompson

sampling algorithms. Section 3 compares the

performance of the UCB1 and Thompson sampling

algorithms in terms of cumulative regret, optimal arm

selection ratio, etc. Section 4 makes a conclusion of

this comparative study, states the future areas for

research, and provides some suggestions.

2 METHOD

2.1 Dataset

The dataset used in this article includes real credit

card transaction records from June to December in

2020 over the globe with 555719 instances. 22

attributes are exhibited in the dataset (transaction

date, transaction amount, customer identification

number, etc.), and 2 attributes (transaction amount

and transaction category) are chosen as the standard

of classification of multiple arms. The transaction

amount ranges from $1.00 to $22768.11, and the

transaction category includes 14 transaction types

(“personal care”, “health fitness”, etc.). There are

2145 fraudulent transactions in this dataset in total.

Notably, the dataset displays significant skewness,

where 99% of transaction amount lies in the interval

$1.00~$519.85, and only 1% lies in

$519.85~$22768.11. The overview of the dataset is

listed in Table 1.

Table 1: Dataset overview.

Total

dataset

Fraud Not fraud Label not

frau

d

Label

frau

d

555719 553574 2145 0 1

2.2 Arm Classification Standard and

Regret Definition

Among the 22 attributes of the dataset, the transaction

amount and the transaction category are applied for

the classification of arms. The transactions are

divided into four parts based on the transaction

amount so that the number of people falling into each

part accounts for a quarter of the total number of

people. As previously stated, the transaction category

includes 14 transaction types. For the sake of

convenience, the "grocery point of sale” type and the

“grocery net” type are combined into one type called

“grocery”. From these two dimensions, all

transactions can be classified into 52 types (4×13),

and each represents one arm in the MAB model.

The set of all arms is denoted by 𝒜, the horizon

is denoted by

n

, and each individual arm are labeled

i , where

1,...,52i =

. An arbitrary round is denoted

by

t

. The total number of transactions in each arm is

denoted by

i

s

, and the transaction amount of one

particular transaction is

,ip

s

, where i represents

which arm this transaction belongs to, and

p

represents it’s the pth transaction of this arm (

1

i

p

s≤≤

). The mean transaction amount of each arm

is given by

,

1

1

i

s

i

ik

k

i

ms

s

=

=

(1)

and the total mean reward of each arm is

,

,

1

1

i

iik

s

ik

k

i

rbs

s

=

=

(2)

Optimizing Credit Card Fraud Detection with Multi-Armed Bandit Algorithms

475

where

,

1

ip

b =

if the 𝑝th transaction of arm i is a

fraud, and

,

0

ip

b =

if otherwise. The general

assumption of this model is that the optimal arm is

unique (denoted by

* ) for simplicity, and its mean

reward

*1252

max{ , ,..., }rrrr=

. The actual random

reward in round 𝑡 is defined as

()

,,

if the th transaction of arm is

ˆ

picked

0otherwise

ip ip

i

i

sb

pi

xt

m

=

and the actual mean reward of arm 𝑖 until round 𝑡 is

given by

𝑟

̂

𝑡

=

∑

𝑥

𝑘

(3)

where 𝑧

𝑡

denotes the times arm

i has been played

until round

t

. The total regret until round

t

is

defined as

()

52

1

ˆ

ti

i

Rrt

=

=

(4)

and the goal of this article is to minimize the total

regret so that the model can identify as many

fraudulent cases as possible within its horizon.

2.3 UCB1 Algorithm

In UCB1, the UCB index for an arm i in round t is

defined as

() ()

()

ln

ˆ

2()

ii

i

n

B

UCB t r t

zt

α

=+

(5)

where

B

represents the gap between the maximum

and minimum reward value, and λ is a parameter. The

algorithm will initially pull each arm once to calculate

the initial UCB indices for all arms. Subsequently, in

every round, the algorithm selects the arm with the

highest UCB index to pull. Then in every upcoming

round, the algorithm will pick the arm with the largest

UCB index. In this way, the algorithm successfully

applies the principle of optimism and takes the

strategy that each arm is considered to give higher

rewards than they did in the past (Tor and Szepesvári,

2020). The gap-dependent regret upper bound is

𝑂

Δ

, and the gap-independent regret upper

bound is

()

()

Oln

K

nn

, where 52K = in this

article,

*ii

rrΔ= −

, Δ =min Δ

Δ

>0

(Mukherjee

et al., 2018).

Algorithm 1: UCB1 (Tor and Szepesvári, 2020).

Input: Time horizon

n

, reward value gap

B

Pull each arm once

for

53,...,tn=

do

Pull arm

()

152

argmax UCB 1

jj

it

≤≤

=−

Reset parameters:

() ( )

:11

ii

zt zt=−+

𝑟̂

𝑡

:=

∑

𝑥

𝑘

𝑈𝐶𝐵

𝑡

:=𝑟

̂

𝑡

+

α

()

2.4 Thompson Sampling Algorithm

In Thompson Sampling algorithm, the model will

pick the arm based on randomization and Bayesian

analysis (Tor and Szepesvári, 2020). The algorithm

will first pull each arm once, and then assign each arm

a posterior

𝑁𝑟̂

(

𝑡−1

)

,

()

. In each

upcoming round, the algorithm will first sample

𝑣

~𝑁𝑟̂

(

𝑡−1

)

,

()

from the posterior of

each arm, and then pick arm

152

argmax

j

j

iv

≤≤

=

.

Finally, the algorithm will update each arm’s

posterior according to the obtained reward. Based on

the utilization of posteriors and the random arm-

picking process, the algorithm is endowed with the

ability to explore suboptimal arms while also

exploiting the optimal arm as much as possible. The

regret of this algorithm is proved to be

𝑂

∑

Δ

𝑙𝑛

(

𝑛

)

Δ

when the actual probability

distribution of each arm is Gaussian (Tor and

Szepesvári, 2020).

A

lgorithm 2: Thompson sampling algorithm (Tor an

d

Szepesvári, 2020).

Input: Time horizon

n

, reward value gap

B

Pull each arm once

for

53,...,tn=

do

for

1,...,52i =

do

Sample

𝑣

~𝑁

𝑟̂

(

𝑡−1

)

,

()

Pull arm

152

argmax

j

j

iv

≤≤

=

Reset parameters:

() ( )

:11

ii

zt zt=−+

𝑟̂

(

𝑡

)

:=

(

)

∑

𝑥

(

𝑘

)

Update the posterior

𝑁

𝑟

̂

(

𝑡

)

,

()

EMITI 2024 - International Conference on Engineering Management, Information Technology and Intelligence

476

3 RESULTS

3.1 UCB1 Algorithm Performance

The model parameter

α

and the horizon

n

are

chosen to be 1 and 100000, respectively. The result is

averaged over 100 random experiments. As shown in

Table 2, among 100000 rounds, the optimal arm is

picked 96971.49 times on average, far more than

3028.51 times of picking all other arms. The reward

generated by the optimal arm is 96971.23, which

significantly outperforms the total reward generated

by other arms (37.84 on average).

Table 2: UCB1 algorithm performance overview.

Count of

selection

Percentage

count

Reward

generate

d

Optimal arm 96971.49 96.97% 96971.23

Other arms

3028.51

3.03%

37.84

Figure 1 visualizes the overall performance of the

UCB1 algorithm. The average cumulative regret is

marked every 4000 rounds using blue crosses, and

one standard error is marked in light blue. As

demonstrated in Fig. 1, the average cumulative regret

increases logarithmically with the round growing. A

significant amount of loss from failing to choose the

optimal arm is suffered in the initial stage, and less

loss is produced after the exploration stage, leading to

the increasing accuracy of identifying potential credit

card frauds. The average regret after 100000 rounds

is 2990.93 with a relatively small standard error of

38.84, which proves the stability of the UCB1

algorithm. The running time of this algorithm is

135.65 seconds, proving the efficiency of this

algorithm.

Figure 1: Cumulative regret using UCB1 algorithm with

1α=

.

3.2 Thompson Sampling Algorithm

Performance

The horizon

n

is chosen to be 1 and the result is

averaged over 100 random experiments. As shown in

Table 3, among 100000 rounds, the optimal arm is

picked 99693.94 times on average, which indicates

the model picks the optimal arm more than 99% of

total times. The reward generated by the optimal arm

is 99690.99, a sharp comparison with the total reward

from all other arms (272.83 on average).

Table 3: Thompson sampling algorithm performance

overview.

Count of

selection

Percentage

count

Reward

generate

d

Optimal arm

99693.94 99.69% 99690.99

Other arms 306.06 0.31% 272.83

As displayed in Figure 2, the cumulative mean

regret every 4000 rounds is marked using red crosses,

while the standard error of each round is plotted in

light pink. In comparison with the UCB1 algorithm,

the total regret of the Thompson sampling algorithm

after 100000 rounds is 305.89, far less than 2990.93

of the UCB1 algorithm. On the other hand, the regret

skyrockets in approximately the first 1000 rounds and

then increases at an extremely slow speed, which

indicates the variance of the posterior of the optimal

arm has converged to zero and the model has

successfully identified the optimal arm. In large, the

cumulative regret curve does not present a

logarithmic shape. The standard error is 36.17, nearly

the same as the one of the UCB1 algorithm (38.84).

The running time of the Thompson sampling

algorithm is 120.50 seconds, which is even fewer than

the one from the UCB1 algorithm (135.65 seconds).

Figure 2: Cumulative regret using Thompson sampling.

Optimizing Credit Card Fraud Detection with Multi-Armed Bandit Algorithms

477

4 CONCLUSIONS

This study explores the application of multi-armed

bandit (MAB) algorithms, including UCB1 and

Thompson Sampling, to the problem of credit card

fraud detection. Transactions are categorized into 52

distinct groups, or 'arms,' based on their types and

amounts. The goal of these models is to pinpoint the

arm with the highest likelihood of fraudulent activity,

thereby directing human investigators to the most

suspect transactions within a vast dataset. The

approach is validated by its performance in

minimizing cumulative regret, which enables

financial institutions to efficiently focus on arms

yielding the highest average reward. Furthermore, the

Thompson Sampling algorithm demonstrates

superior performance over the UCB1 algorithm by

achieving lower cumulative regret, exhibiting small

standard errors akin to those of UCB1, and

maintaining low computational complexity. For

future work, the arms can be formed more reasonably

and comprehensively. In this paper, merely

transaction types and transaction amounts are taken

into account. More features of these transactions can

be utilized since the dataset provides additional 20

unused attributes with advanced algorithms including

the incremental Regressions Trees, KNN, etc., to

cluster different transactions into multiple arms. On

the other hand, it’s claimed that fraudsters will

constantly modify their behaviors in order to escape

detection from existing models, known as concept

drift (Soemers et al., 2018). In this sense, the methods

of clustering different transactions into arms should

also take concept drift into consideration and be

updated regularly. Furthermore, more MAB

algorithms such as LinUCB, Efficient-UCBV,

Discounted UCB and Sliding window UCB (Garivier

and Moulines, 2008) can be implemented and tested

so that the computational complexity and cumulative

regret can be further reduced, or the concept drift may

be better handled.

REFERENCES

Lattimore, T., Szepesvári, C., 2020. Bandit Algorithms.

Cambridge University Press.

Mukherjee, S., Naveen, K. P., Nandan, S., Balaraman, R.,

2018. Efficient-UCBV: An almost optimal algorithm

using variance estimates. Proceedings of the AAAI

Conference on Artificial Intelligence.

Soemers, D., Brys, T., Driessens, K., Winands, M., Nowé,

A., 2018. Adapting to concept drift in credit card

transaction data streams using contextual bandits and

decision trees. Proceedings of the AAAI Conference on

Artificial Intelligence.

Awoyemi, J., Adetunmbi, A., and Oluwadare S., 207.

Credit card fraud detection using Machine Learning

Techniques: A Comparative Analysis. 2017

International Conf. on Computing Networking and

Informatics (ICCNI).

Garivier, A., Moulines, E., 2008. On upper-confidence

bound policies for non-stationary bandit problems.

Bernard P., El Mekkaoui De Freitas N., Maillet B., 2019. A

financial fraud detection indicator for investors: An

IDeA. Annals of Operations Research.

Ferreira, F., Meidutė-Kavaliauskienė, I., 2019. Toward a

sustainable supply chain for Social Credit: Learning by

experience using single-valued neutrosophic sets and

fuzzy cognitive maps. Annals of Operations Research.

Sariannidis, N., Papadakis, S., Garefalakis A., Lemonakis

C., Kyriaki-Argyro T., 2019. Default avoidance on

credit card portfolios using accounting,

Demographical and exploratory factors: Decision

making based on machine learning (ML) techniques.

Annals of Operations Research.

Zhang, S., 2022. Challenges in KNN classification. IEEE

Transactions on Knowledge and Data Engineering.

Dornadula, V., Geetha, S., 2019. Credit card fraud

detection using machine learning algorithms. Procedia

Computer Science.

EMITI 2024 - International Conference on Engineering Management, Information Technology and Intelligence

478