Q-Defense: When Q-Learning Comes to Help Proof-of-Work Against the

Selﬁsh Mining Attack

Ali Nikhalat-Jahromi

1 a

, Ali Mohammad Saghiri

1,2 b

and Mohammad Reza Meybodi

1 c

Department of Computer Engineering, Amirkabir University of Technology, Tehran, Iran

Department of Computer Science, William Paterson University, New Jersey, U.S.A.

Keywords:

Blockchain, Proof-of-Work, Selﬁsh Mining, Defense Mechanism, Reinforcement Learning, Q-Learning.

Abstract:

The Proof-of-Work (PoW) consensus protocol is widely utilized in various blockchain implementations, in-

cluding Bitcoin. The security of this protocol relies heavily on the incentive-compatibility of participating

miner, who compete against each other to discover new blocks. However, the assumption that competition

will naturally evolve into collaboration, ensuring blockchain security, is not always valid. Certain colluding

miners, known as ”selﬁsh miners,” attempt to unfairly obtain rewards by deviating from the prescribed proto-

col. In this paper, we propose a novel learning-based mechanism to address this challenge and enhance the

PoW protocol. Speciﬁcally, we apply Q-Learning, a prominent technique in reinforcement learning, to each

miner in order to mitigate the impact of selﬁsh collaboration among colluding miners. To best of our knowl-

edge, this is the ﬁrst defense mechanism based on Q-Learning in the literature. Our comprehensive analysis

demonstrates that the proposed modiﬁcation to the PoW protocol can increase the threshold for successful

selﬁsh mining attacks from 25% to 40%. Furthermore, simulation results comparing our defense mechanism

with tie-breaking, a well-known defense approach, validate the effectiveness of our proposed mechanism.

1 INTRODUCTION

In recent years, the emergence of blockchain

(Nakamoto, 2008) technology has garnered con-

siderable interest and has been recognized for its

transformative potential across numerous industries.

By offering a secure and decentralized platform

(Narayanan et al., 2016), (Antonopoulos, 2014) for

transactions and data storage, blockchain has paved

the way for signiﬁcant advancements. Initially in-

troduced in 2009 by an enigmatic individual named

Satoshi Nakamoto, blockchain found its ﬁrst imple-

mentation through Bitcoin(Nakamoto, 2008). Since

its inception, this technology has gained widespread

popularity, with the introduction of various cryptocur-

rencies, most of which are based on the proof-of-

work consensus mechanism (Bonneau et al., 2015),

(Tschorsch and Scheuermann, 2016), (Wang et al.,

2019), (Judmayer et al., 2022).

In the proof-of-work consensus mechanism

(Mingxiao et al., 2017), (Chaudhry and Yousaf,

https://orcid.org/0000-0001-8398-8058

https://orcid.org/0000-0003-0797-314X

https://orcid.org/0000-0003-3775-5565

2018), (Panda et al., 2019), exempliﬁed by Bitcoin’s

current operation, a group of nodes known as min-

ers engage in a competitive effort to uncover new

blocks. The process of discovering a new block

serves two important purposes. Firstly, it involves

recording transactions, ensuring their inclusion in the

blockchain. Secondly, it extends the chain of blocks

by introducing a new block, reinforcing the integrity

and continuity of the blockchain (Narayanan et al.,

2016), (Antonopoulos, 2014).

In the process of mining (Mingxiao et al., 2017),

ensuring the principle of incentive-compatibility

(Courtois and Bahack, 2014), (Eyal, 2015), (Babaioff

et al., 2012) is crucial for maintaining network secu-

rity. Miners actively compete against each other in

the quest to discover blocks, driven by the desire to

obtain two types of rewards. Firstly, they seek the

reward associated with successfully mining a block.

Secondly, they aim to collect transaction fees that are

included in the blocks they discover (Antonopoulos,

2014), (Eyal and Sirer, 2018).

The assumption that all miners in the network con-

sistently adhere to the rules is ﬂawed (Eyal and Sirer,

2018), (Bai et al., 2019). In reality, there are in-

stances where a group of colluding miners forms a

Nikhalat-Jahromi, A., Saghiri, A. and Meybodi, M.

Q-Defense: When Q-Learning Comes to Help Proof-of-Work Against the Selﬁsh Mining Attack.

DOI: 10.5220/0012378600003636

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 16th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2024) - Volume 1, pages 37-46

ISBN: 978-989-758-680-4; ISSN: 2184-433X

pool with the intention of deviating from the rules to

unfairly obtain a greater share of the rewards relative

to their computational power. These miners exhibit

selﬁsh behavior, contrasting with the rational miners

who strictly abide by the mining rules. This form of

self-serving conduct by miners is commonly referred

to as a selﬁsh mining attack (Eyal, 2015), (Eyal and

Sirer, 2018).

In addition to undermining incentive-

compatibility, the selﬁsh mining attack carries

hidden consequences. One of these is the potential to

entice rational miners to align with the selﬁsh miners,

amplifying their combined computational power

(Eyal, 2015), (Eyal and Sirer, 2018). This incre-

mental participation process further consolidates the

dominance of the selﬁsh miners, posing a signiﬁcant

threat to the decentralization of the targeted network.

Moreover, the imposition of the selﬁsh miners’ new

mining rule renders the victim network vulnerable to

other attacks (Nayak et al., 2016), including the risk

of double-spending (Eyal, 2015).

The covert nature of the selﬁsh mining attack

makes it challenging to detect for rational participants

in the network. In such a dire situation, ﬁnding ef-

fective solutions becomes an arduous task. Previous

attempts to address this type of attack have proven to

offer minimal or no substantial relief. To best of our

knowledge, this problem has been remained opened

as long as the introduction of this kind of attack. Con-

sequently, in our perspective, it is imperative to ap-

proach the problem in conjunction with the proof-of-

work consensus mechanism, recognizing the need for

a comprehensive solution.

Our research endeavors to devise an effective so-

lution by going deeper into the intrinsic nature of the

proof-of-work mechanism. To achieve this objective,

we conducted a thorough examination of Nakamoto’s

proof-of-work-based blockchains, including Bitcoin,

Litecoin (CLARKE et al., 2018), and Zcash (Hop-

wood et al., 2016). Our primary emphasis lies on Bit-

coin, given its preeminence within this coin family.

Through meticulous study and analysis, we aim to un-

cover insights that contribute to addressing the chal-

lenges posed by the proof-of-work consensus mecha-

nism.

Our proposed solution involves equipping all min-

ers in the network with Q-Learning (Sutton and Barto,

2018), (Watkins and Dayan, 1992), (Clifton and

Laber, 2020) agents. Traditionally, miners adhere to

the proof-of-work protocol by selecting the longest

chain of blocks when faced with multiple compet-

ing chains. However, our novel solution, leverag-

ing the embedded reinforcement learning agent, aims

to intelligently select the optimal chain, deviating

from the conventional policy. By incorporating Q-

Learning, we enable miners to make more informed

and strategic decisions in choosing the most advan-

tageous chain. In order to assess the efﬁcacy of the

newly proposed defense method, a series of simu-

lations have been performed. Specially, the distinct

properties of the proposed defense mechanism are

thoroughly examined. We should specify that the

code is available on Github

The contributions of this paper can be summarized

as follows: we propose a novel defense algorithm

against selﬁsh mining that incorporates Q-Learning.

The effectiveness of the proposed algorithm is evalu-

ated through a series of experiments and compared

to a well-known existing defense mechanism, tie-

breaking.

The structure of the paper is organized as follows:

in Section 2, we discuss related works. In Section

3, we provide the necessary preliminaries and back-

ground information. The details of the proposed al-

gorithm are presented in Section 4. The performance

of the defense algorithm is thoroughly evaluated in

Section 5. Section 6 would discuss the paper’s limita-

tions and problems. Finally, we conclude the paper in

Section 7, highlighting the key ﬁndings and contribu-

tions.

2 RELATED WORK

In this section, we provide a summary of existing de-

fenses against selﬁsh mining attacks. The defenses

are categorized based on the similarity of the meth-

ods used. We focus on the most popular defenses that

have been proposed.

First and foremost, it is important to highlight

that our work draws signiﬁcant inspiration from

previous research. Speciﬁcally, we have proposed

the Nik defense (Nikhalat-Jahromi et al., 2023a),

which stands as the pioneering defense against self-

ish mining attacks employing learning automaton

agents. Additionally, we have devised VDHLA (Vari-

able Depth Hybrid Learning Automaton) (Nikhalat-

Jahromi et al., 2023b), a novel family of learning au-

tomata that plays a central role in the implementation

of the Nik defense. The subsequent paragraphs will

categorize and present other defenses based on their

similarities.

Certain defenses require fundamental changes to

the blockchain structure, often necessitating signiﬁ-

cant updates to blockchain nodes that are incompati-

ble with previous versions. One defense proposed by

http://github.com/AliNikhalat/SelﬁshMining

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

Bahack (Bahack, 2013) involves imposing a punish-

ment rule for all miners, including honest ones. In this

defense, miners who fork the blockchain are subject

to punishment. However, a drawback of this defense

is that it also punishes honest miners. Solat et al. (So-

lat and Potop-Butucaru, 2016) introduced Zeroblock,

which enforces the timely release of blocks by miners.

If miners withhold their blocks for selﬁsh mining and

fail to broadcast them within the expected time, net-

work peers generate their own dummy blocks. These

defenses require changes to block validity and reward

distribution, necessitating client updates by network

nodes to comply with the new protocol. Another de-

fense proposed by (Saad et al., 2019) suggests alter-

ing the structure of transactions. This defense intro-

duces an additional parameter called ”Expected Con-

ﬁrmation Height” in transactions, which is compared

with the expected value for the published block height

to detect selﬁsh mining attacks. The following para-

graph discusses defenses that are effective when a

new fork is observed in the network.

Defenses aimed at reducing the chances of self-

ish miners creating a fork have been developed. The

most widely accepted solution is the tie-breaking de-

fense, proposed by Eyal and Sirer (Eyal and Sirer,

2018). According to this defense, when a miner dis-

covers competing branches of the same length, they

should propagate all of them and randomly choose

one to mine on. As demonstrated in their paper, self-

ish mining can only be initiated with a minimum min-

ing power of approximately 25%. Heilman (Heil-

man, 2014) proposed another backward-compatible

defense called Freshness Preferred (FP), which pe-

nalizes miners withholding blocks by utilizing an

unforgeable timestamp parameter. The latest value

of the unforgeable timestamp is compared to detect

block withholding. Heilman claimed that the lower

bound threshold for selﬁsh mining would increase

from 25% to 32%. However, a limitation of this solu-

tion is the reliance on a trusted party in the network,

which contradicts the decentralized nature of Bitcoin.

The following paragraph introduces defenses based

on fork-resolving policies.

Defenses operating on fork-resolving policies aim

to modify protocols so that the defense mechanisms

are triggered when the selﬁsh chain becomes longer

than the public chain, in contrast to the tie-breaking

defense. The ﬁrst solution in this category is Pub-

lish or Perish proposed by Zhang and Preneel (Zhang

and Preneel, 2017). In Publish or Perish, blocks not

published in time are disregarded, while blocks that

include links to competing blocks of their predeces-

sors are given preference. Consequently, a block re-

mains secure until a competing block is published,

contributing to neither or both branches, thereby pro-

viding no advantage in winning the block race. The

following paragraph introduces a machine learning-

based algorithm for detecting selﬁsh mining attacks.

Research has been conducted to identify factors

that can detect selﬁsh mining attacks (Chicarino et al.,

2020),(Peterson et al., 2022). These studies utilize ex-

isting data on selﬁsh mining attacks to create training

and test datasets. The research investigates various

factors and explores future research directions in this

context.

3 PRELIMINARIES

In this section, we provide the necessary background

information about the proposed algorithm, encom-

passing key aspects of blockchain, selﬁsh mining

strategies, and Q-Learning.

Bitcoin (Nakamoto, 2008) is a decentralized cryp-

tocurrency that enables users to transfer digital cur-

rency by generating new transactions and appending

them to the blockchain ledger. The blockchain serves

as an immutable ledger maintained by a network of

miners who secure it against data tampering. Miners

are incentivized (Eyal and Sirer, 2018) for their con-

tributions in safeguarding the blockchain by receiving

rewards. Each transaction within the blockchain con-

sists of one or more inputs and outputs, with the dif-

ference between the total inputs and outputs referred

to as the transaction fee. This fee is directed to the

miner responsible for incorporating the transaction

into the blockchain (Nakamoto, 2008), (Narayanan

et al., 2016), (Antonopoulos, 2014). Subsequently,

we thoroughly explore a comprehensive explanation

of the mining process, followed by an exploration of

selﬁsh mining attacks. Lastly, we discuss Q-Learning.

Further details regarding the mining process will be

provided in the subsequent subsection.

3.1 Mining Process

The state of the blockchain is altered through the ex-

ecution of transactions, which are then grouped into

blocks and appended to the blockchain. A typical

block in the blockchain comprises two key compo-

nents: the header and the body (Wang et al., 2021).

The block’s header contains vital information such as

the hash of the previous block, the hash of the current

block, the Merkle root representing all the transac-

tions included in the block, and a nonce value. On

the other hand, the block’s body consists of the trans-

actions that the miner has chosen to include in the

block (Nakamoto, 2008), (Narayanan et al., 2016),

Q-Defense: When Q-Learning Comes to Help Proof-of-Work Against the Selﬁsh Mining Attack

(Antonopoulos, 2014).

To validate a block, a solution to a cryptographic

puzzle must be found (Eyal and Sirer, 2018), (Sapir-

shtein et al., 2017). Miners strive to discover the cor-

rect nonce value to be placed in the block’s header,

resulting in a block hash that is smaller than the pre-

deﬁned block difﬁculty target. The block difﬁculty

is dynamically adjusted to maintain an average block

generation rate of approximately one block every ten

minutes. Upon successfully solving the puzzle and

adding the mined block to the longest chain, the miner

is rewarded with newly created Bitcoins that did not

exist previously, as well as the transaction fees asso-

ciated with the newly included transactions.

The probability of mining a new block is directly

proportional to the computational resources utilized

in solving the puzzle (Eyal and Sirer, 2018), (Sapir-

shtein et al., 2017). However, due to the inherent na-

ture of the mining process, the time interval between

mining events exhibits signiﬁcant variance from the

perspective of an individual miner. As a result, min-

ers often opt to join mining pools, where members

collaborate to collectively mine each block and share

the generated rewards whenever one of them success-

fully mines a block. While participating in a min-

ing pool does not alter a miner’s expected revenue,

it helps reduce the revenue variance and provides a

more predictable monthly income (Eyal and Sirer,

2018), (Sapirshtein et al., 2017).

3.2 Selﬁsh Mining Attack

Bitcoin’s documentation provides a comprehensive

overview of the block release process after mining.

When a miner successfully mines a new valid block,

it is expected to promptly share it with the network.

However, Eyal and Sirer (Eyal and Sirer, 2018) intro-

duced the concept of ”selﬁsh mining,” which involves

certain miners deviating from Bitcoin’s standard min-

ing rules to unfairly increase their revenue. This par-

ticular strategy, known as ”SM1,” is the ﬁrst and most

widely recognized form of selﬁsh mining.

In the SM1 selﬁsh mining strategy (Eyal and Sirer,

2018), miners attempting to act selﬁshly conceal their

newly discovered blocks by refraining from broad-

casting them to the network. This deliberate conceal-

ment creates a fork in the blockchain. One fork is

openly known to all network participants, while the

other remains hidden and known only to the selﬁsh

miners. In this scenario, honest miners continue to

work on the stale chain, unaware of the hidden chain

pursued by the selﬁsh miners. Consequently, honest

chain will be discarded and the relative revenue of the

selﬁsh miners increases, which in turn motivates other

Honest Chain

Selfish Chain

Figure 1: Development of

private chain using selﬁsh

miners(white blocks).

Honest Chain

Selfish Chain

Figure 2: After publishing

private chain and adoption

of private chain.

miners to join their ranks.

To gain a better understanding of the SM1 strat-

egy, let’s consider a simpliﬁed scenario. In this sce-

nario, selﬁsh miners aim to carry out their mining ac-

tivities covertly. When they are two blocks ahead of

the honest miners (white blocks in Figure 1), upon

discovering the ﬁrst block mined by the honest min-

ers (blue block in Figure 1), they reveal their private

chain. As a result, the work done by the honest min-

ers is discarded (orange block in Figure 2), and the

network adopts the secret blocks mined by the selﬁsh

miners (blue blocks in Figure 2).

As well as SM1 strategy, new strategies for selﬁsh

mining come to play. Sapirshtein et al.(Sapirshtein

et al., 2017) employed Markov Decision Process

(MDP) to study the proﬁt threshold, which represents

the minimum fraction of resources necessary for a

proﬁtable attack. They determined a bound that en-

sures system security against such attacks and modi-

ﬁed the protocol to evaluate its susceptibility to selﬁsh

mining by computing the optimal attack under differ-

ent scenarios. They revealed situations in which self-

ish miners could retain control of a selﬁsh chain, even

if the public chain appears longer.

Recently, new research areas have emerged in the

ﬁeld of selﬁsh mining, particularly leveraging ma-

chine learning techniques (Wang et al., 2021), (Hou

et al., 2019). Many of these studies have utilized rein-

forcement learning methods to enhance the optimality

of the attack (Sapirshtein et al., 2017). For instance,

(Bar-Zur et al., 2022), (Bar-Zur et al., 2023) devel-

oped an improved MDP-based solution by applying

reinforcement learning algorithms to maximize rev-

enue. They introduced a novel deep reinforcement

learning framework to analyze the incentives of ratio-

nal miners under various conditions and established

an upper bound for the security threshold of proof-of-

work-based blockchains.

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

3.3 Q-Learning

Q-Learning (Sutton and Barto, 2018), (Watkins and

Dayan, 1992), (Clifton and Laber, 2020) is a widely

used reinforcement learning algorithm that has gained

signiﬁcant attention in the ﬁeld of machine learning.

It is a model-free approach that enables an agent to

learn optimal actions in an environment through trial

and error.

At its core, Q-Learning involves the exploration

and exploitation of a state-action space. The agent

interacts with the environment by taking actions and

receiving rewards or penalties based on those actions.

The goal is to maximize the cumulative reward over

time.

Agent

Environment

reward r

state s

t+1

action a

Figure 3: Interaction between Q-Learning agent and envi-

ronment.

The algorithm maintains a Q-table, which is a ma-

trix that stores the expected cumulative rewards for

each state-action pair. Initially, the Q-values in the ta-

ble are arbitrary or set to zero. As the agent explores

the environment, it updates the Q-values based on the

rewards received and the new information obtained.

The following equation has shown the updating rule

of the Q-learning:

Q(s, a) ← (1 −α

) Q(s, a)+α

(r + γ

Q(s

′

, a

′

)) (1)

Where:

• Q(s, a) represents the Q-value of state s and action

• α

is the learning rate, controlling the impact of

new information on the Q-values

• r is the immediate reward received after taking ac-

tion a in state s

• γ

is the discount factor, determining the impor-

tance of future rewards

• s

′

is the next state after taking action a in state s

• a

′

is the action with the highest Q-value in state s

′

This updating rule calculates the new Q-value

based on a weighted combination of the previous Q-

value and the new information obtained from the re-

ward and the maximum Q-value of the next state. The

learning rate determines the balance between the new

information and the existing knowledge.

By repeatedly applying this updating rule during

the learning process, the Q-values converge to their

optimal values, allowing the agent to make informed

decisions and maximize its cumulative reward over

time.

4 PROPOSED ALGORITHM:

Q-DEFENSE

This section goes deeper into the details of the pro-

posed method. We begin by providing the system

model and necessary deﬁnitions to establish a clear

understanding of our proposed algorithm. Subse-

quently, we present the proposed defense algorithm

in a step-by-step manner, highlighting its key compo-

nents and mechanisms.

4.1 System Model

Every algorithm proposition requires a thorough un-

derstanding of the environment in which it is intended

to operate. Therefore, it is essential to establish a clear

model of our target blockchain system. Speciﬁcally,

we need to analyze the attack scenario from the per-

spective of honest miners.

Firstly, let us consider a network consisting of two

groups of miners. The ﬁrst group comprises selﬁsh

miners who deviate from the prescribed mining rules.

We denote their mining power as α percent of the total

mining power. Conversely, the second group consists

of rational miners who diligently adhere to the mining

rules and possess 1 − α percent of the total mining

power.

The impact of network propagation delay is disre-

garded in our analysis. It is assumed that miners in

the network strive to propagate blocks swiftly to min-

imize disruptions in the subsequent mining process.

Moreover, the blocks in the network are organized

in a tree structure to construct a chain. If two blocks

share the same previous block hash, it indicates the

creation of a new fork in the chain. However, in our

analysis, we only consider one type of fork, which

arises from selﬁsh mining. From the fork, two chains

emerge: the selﬁsh chain and the honest chain.

If a selﬁsh miner becomes aware that an honest

miner has discovered a new block, it will attempt to

replace its own private block. We introduce a parame-

ter γ as the advertisement factor, representing the pro-

portion of computing power required for nodes to ac-

cept the selﬁsh miner’s block instead of the honest

Q-Defense: When Q-Learning Comes to Help Proof-of-Work Against the Selﬁsh Mining Attack

miner’s block. In terms of block height, when the Bit-

coin network is at height h, the block of the selﬁsh

miner at height h has a probability of γ(1 − α) of be-

ing accepted.

4.2 Required Deﬁnitions

This subsection is dedicated to presenting the prereq-

uisite deﬁnitions that are essential for a better under-

standing of the proposed algorithm. It is important to

note that these deﬁnitions are based on the character-

istics of each branch of the fork resulting from selﬁsh

mining. The following deﬁnitions are provided:

• A fork branch refers to one of the competing

chains in the network that is created as a result of

selﬁsh mining. It consists of a sequence of blocks

starting from the common ancestor block up to the

current block in that branch.

• Let L represent the length of a branch, indicating

the number of blocks it contains. The length of a

branch serves as a crucial parameter for assessing

the size and extent of competing chains resulting

from selﬁsh mining.

• Let W denote the weight of a branch, which is de-

termined based on a comparison of blocks within

that branch and blocks of the same height in other

branches. The branch with the most recent cre-

ation time will have its weight increased by one,

encompassing all blocks from the ﬁrst to the last

in that particular branch.

• Let K denote the fail-safe parameter, which assists

the miner in selecting a branch based on either its

length (L) or its weight (W ). If the length of a

branch in the fork exceeds the others by at least

K, that branch is chosen. Otherwise, the weight

parameter W is used to determine the preferred

branch.

• Let τ denote the decision-making time, which rep-

resents the time taken by a miner to check for the

presence of forks. If a fork is detected, the miner

must select one of the branches, taking into ac-

count the parameter K.

• The time parameter θ is deﬁned to determine the

next value for K using the automaton. It is com-

posed of multiple τ values.

4.3 Algorithm

Having established the necessary deﬁnitions, we can

now proceed to deﬁne our algorithm for defending

against selﬁsh mining attacks. The proposed algo-

rithm is outlined in the following steps:

1. Calculate the length L of each branch.

2. Calculate the weight W of each branch.

3. If the difference between the length of the longest

branch and the length of the second longest

branch is greater than K, choose the longest

branch. Otherwise, choose the heaviest branch

based on the calculated W parameter.

4. When τ reaches the end, the Q-Learning agent de-

termines the next value for K. Typically, K oscil-

lates between K

min

and K

max

. It is important to

note that the Q-Learning agent has one state and

three actions. The actions correspond to: 1-Grow,

which increases K by one; 2-Stop, which keeps K

unchanged; and 3-Shrink, which decreases K by

one.

5. When θ reaches the end, the Q-Learning agent re-

ceives feedback from the environment. We have

designed a virtual environment for the Q-Learning

agent to provide information about its decisions.

The reward is calculated by dividing the number

of weight decisions (W ) by the total number of

decisions. The total number of decisions includes

the number of height decisions (L) plus the num-

ber of weight decisions (W ). The following equa-

tion illustrates the calculation of the r parameter

(reward) of the Q-Leaning agent:

r =

Number o f Weight Decisions

Total Number o f Decisions

(2)

5 EVALUATION

To evaluate the proposed algorithm, we conducted

a series of experiments that explore its performance

from various perspectives. In these experiments, we

considered a network with two types of miners: self-

ish miner and rational miner. In our simulations, the

mining process follows a Poisson process, where a

new block is mined with a probability of α by the

selﬁsh miner and a probability of 1 −α by the rational

miner.

Moreover, we modiﬁed the behavior of the selﬁsh

miner to operate within a learning-based mining envi-

ronment. This allowed us to observe the dynamics of

the proposed algorithm in a realistic setting.

In order to fulﬁll our objectives, we conducted ex-

periments involving a total of 10,000 blocks. The fail-

safe parameter, denoted by K, was allowed to vary

within the range K

min

= 1 and K

max

= 3. This range

of K was chosen to allow Q-Learning agents to reach

consensus in a short amount of time since choosing K

is a time-consuming process. The Q-Learning agents

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

(a) γ = 0 (b) γ = 0.5

Figure 4: Comparing Q-defense with tie-breaking using various values of α and γ.

were conﬁgured with speciﬁc values: a learning rate

of 0.1 and a discount factor of 0.75. These settings

were chosen to optimize the learning and decision-

making process of the agents within the proposed al-

gorithm.

To demonstrate the effectiveness of the proposed

defense, we compared it with the tie-breaking defense

mechanism. In these experiments, τ was set to the

mining time of ﬁve blocks, and θ consisted of ten τ.

These values are set for θ and τ to give Q-Learning

agents enough time to explore and exploit accurately.

The results are shown in Figure 4.

The results of the experiments highlight several

interesting ﬁndings. Firstly, they demonstrate the su-

periority of the proposed defense over tie-breaking

in terms of the relative revenue of selﬁsh miners.

This indicates that the proposed defense effectively

reduces the revenue gained by selﬁsh miners. Par-

ticularly, at γ = 1, where selﬁsh miners have max-

imum power, the proposed defense performs ex-

ceptionally well in decreasing the relative revenue.

These results provide strong evidence of the effective-

ness of Q-Learning in complex environments like the

blockchain.

In addition to relative revenue, another impor-

tant metric, the lower bound threshold, was exam-

ined. The results reveal another signiﬁcant observa-

tion. The proposed defense, leveraging the power of

Q-Learning, signiﬁcantly increases the lower bound

threshold for initiating the attack. Speciﬁcally, the

lower bound threshold is raised from 0.25 to approx-

imately 0.4. This indicates that the proposed defense

enhances the security and stability of the blockchain

network by making it more difﬁcult for selﬁsh miners

to launch successful attacks.

Furthermore, the detailed results of these experi-

ments, including the relative revenue for various val-

ues of α and γ, are provided in Appendix 7.

Q-Defense: When Q-Learning Comes to Help Proof-of-Work Against the Selﬁsh Mining Attack

6 DISCUSSION

Our work, while promising, acknowledges some in-

herent limitations that warrant discussion. One po-

tential concern is the computational complexity intro-

duced by machine learning methods, which can lead

to overheads. To address this, we deliberately opted

for Q-learning due to its favorable characteristics. Q-

learning’s off-policy nature allows it to efﬁciently ex-

plore and learn from suboptimal actions, preventing

entrapment in local optima. Being model-free, it can

handle diverse problems without requiring knowledge

of the environment’s dynamics. Moreover, its im-

plementation simplicity and efﬁcient memory usage

make it well-suited for our speciﬁc application, effec-

tively mitigating computational concerns and facili-

tating practical adoption.

Another aspect deserving attention is the potential

non-deterministic nature of our approach. The value

of parameter K heavily relies on the learning agent

at each node, which could introduce variability. To

address this, we limited the range of K, as demon-

strated in the experiment section, opting for a con-

trolled range, such as [1, 3]. This limitation helps to

reduce inconsistency and ensures a more stable per-

formance across different scenarios.

Considering scalability in large-scale networks is

crucial. As all nodes must reach consensus on one

chain in forks, the uniform convergence to the same

K value becomes vital. With a larger number of nodes

in the network, there is a higher likelihood of K reach-

ing a consistent value across more nodes. This solu-

tion addresses both scalability issues and potential in-

consistencies related to K, making our approach more

robust and applicable in real-world, complex network

environments.

Lastly, our system effectively prevents block with-

holding attempts through the τ and θ parameters. All

miners actively participate in chain selection and eval-

uating their K value, making block withholding an

ineffective strategy. This ensures the security and

fairness of the consensus process and strengthens the

overall integrity of our approach. By incorporating τ

and θ, we have established a robust defense against

block withholding and enhanced the reliability of the

system.

Consequently, our work demonstrates signiﬁcant

promise despite the acknowledged limitations. By

deliberately selecting Q-learning for its advantages,

controlling parameter variability, addressing scalabil-

ity concerns, and recognizing challenges like block

withholding, our approach provides valuable insights

and sets the foundation for future advancements in

consensus algorithms for large-scale networks. As

with any research, acknowledging and discussing lim-

itations contribute to the scientiﬁc discourse and pave

the way for even more robust solutions in the ﬁeld.

7 CONCLUSION

One of the most signiﬁcant challenges to the proof-

of-work consensus mechanism is posed by colluding

miners who aim to deviate from the prescribed min-

ing rules. These colluding miners, commonly known

as selﬁsh miners, have remained a persistent threat

since the early days of Bitcoin. In order to address

this problem, we were inspired to leverage the power

of Q-Learning, a well-established method in the ﬁeld

of reinforcement learning.

Our proposed model has undergone comprehen-

sive evaluation from various perspectives. In terms

of relative revenue, our method has successfully re-

duced the revenue obtained by selﬁsh miners com-

pared to conventional proof-of-work approaches that

lack learning mechanisms. Furthermore, our evalua-

tions reveal a signiﬁcant increase in the initial thresh-

old of the selﬁsh attack, rising from 25% of the to-

tal mining power to 40% approximately. The results

show the superiority of the proposed approach over

tie-breaking.

The suggested mechanism opens up new av-

enues at the intersection of artiﬁcial intelligence and

blockchain technology. As a future work, we will

consider more implementation details and model-

ings to deﬁne comprehensive defense mechanism for

proof-of-work consensus algorithms.

REFERENCES

Antonopoulos, A. M. (2014). Mastering Bitcoin: unlocking

digital cryptocurrencies. ” O’Reilly Media, Inc.”.

Babaioff, M., Dobzinski, S., Oren, S., and Zohar, A. (2012).

On bitcoin and red balloons. In Proceedings of the

13th ACM conference on electronic commerce, pages

56–73.

Bahack, L. (2013). Theoretical bitcoin attacks with less than

half of the computational power (draft). arXiv preprint

arXiv:1312.7013.

Bai, Q., Zhou, X., Wang, X., Xu, Y., Wang, X., and Kong,

Q. (2019). A deep dive into blockchain selﬁsh mining.

In Icc 2019-2019 Ieee International Conference On

Communications (Icc), pages 1–6. IEEE.

Bar-Zur, R., Abu-Hanna, A., Eyal, I., and Tamar, A. (2022).

Werlman: to tackle whale (transactions), go deep (rl).

In Proceedings of the 15th ACM International Confer-

ence on Systems and Storage, pages 148–148.

Bar-Zur, R., Dori, D., Vardi, S., Eyal, I., and Tamar, A.

(2023). Deep bribe: Predicting the rise of bribery in

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

blockchain mining with deep rl. Cryptology ePrint

Archive.

Bonneau, J., Miller, A., Clark, J., Narayanan, A., Kroll,

J. A., and Felten, E. W. (2015). Sok: Research per-

spectives and challenges for bitcoin and cryptocurren-

cies. In 2015 IEEE symposium on security and pri-

vacy, pages 104–121. IEEE.

Chaudhry, N. and Yousaf, M. M. (2018). Consensus al-

gorithms in blockchain: Comparative analysis, chal-

lenges and opportunities. In 2018 12th International

Conference on Open Source Systems and Technolo-

gies (ICOSST), pages 54–63. IEEE.

Chicarino, V., Albuquerque, C., Jesus, E., and Rocha, A.

(2020). On the detection of selﬁsh mining and stalker

attacks in blockchain networks. Annals of Telecom-

munications, 75:143–152.

CLARKE, S., CRAIG, I., and WYSZYNSKI, M. (2018).

Litecoin cash: The best of all worlds sha256 cryp-

tocurrency. URL: https://litecoinca. sh/downloads/lcc

whitepaper. pdf.[Last accessed on 2018 Sep 25].

Clifton, J. and Laber, E. (2020). Q-learning: Theory and

applications. Annual Review of Statistics and Its Ap-

plication, 7:279–301.

Courtois, N. T. and Bahack, L. (2014). On subversive miner

strategies and block withholding attack in bitcoin dig-

ital currency. arXiv preprint arXiv:1402.1718.

Eyal, I. (2015). The miner’s dilemma. In 2015 IEEE sym-

posium on security and privacy, pages 89–103. IEEE.

Eyal, I. and Sirer, E. G. (2018). Majority is not enough:

Bitcoin mining is vulnerable. Communications of the

ACM, 61(7):95–102.

Heilman, E. (2014). One weird trick to stop selﬁsh min-

ers: Fresh bitcoins, a solution for the honest miner.

In Financial Cryptography and Data Security: FC

2014 Workshops, BITCOIN and WAHC 2014, Christ

Church, Barbados, March 7, 2014, Revised Selected

Papers 18, pages 161–162. Springer.

Hopwood, D., Bowe, S., Hornby, T., Wilcox, N., et al.

(2016). Zcash protocol speciﬁcation. GitHub: San

Francisco, CA, USA, 4(220):32.

Hou, C., Zhou, M., Ji, Y., Daian, P., Tramer, F., Fanti,

G., and Juels, A. (2019). Squirrl: Automating at-

tack analysis on blockchain incentive mechanisms

with deep reinforcement learning. arXiv preprint

arXiv:1912.01798.

Judmayer, A., Stifter, N., Krombholz, K., and Weippl, E.

(2022). Blocks and chains: introduction to bitcoin,

cryptocurrencies, and their consensus mechanisms.

Springer Nature.

Mingxiao, D., Xiaofeng, M., Zhe, Z., Xiangwei, W., and

Qijun, C. (2017). A review on consensus algorithm

of blockchain. In 2017 IEEE international conference

on systems, man, and cybernetics (SMC), pages 2567–

2572. IEEE.

Nakamoto, S. (2008). Bitcoin: A peer-to-peer electronic

cash system. Decentralized business review.

Narayanan, A., Bonneau, J., Felten, E., Miller, A., and

Goldfeder, S. (2016). Bitcoin and cryptocurrency

technologies: a comprehensive introduction. Prince-

ton University Press.

Nayak, K., Kumar, S., Miller, A., and Shi, E. (2016). Stub-

born mining: Generalizing selﬁsh mining and com-

bining with an eclipse attack. In 2016 IEEE Euro-

pean Symposium on Security and Privacy (EuroS&P),

pages 305–320. IEEE.

Nikhalat-Jahromi, A., Saghiri, A. M., and Meybodi, M. R.

(2023a). Nik defense: An artiﬁcial intelligence based

defense mechanism against selﬁsh mining in bitcoin.

arXiv e-prints, pages arXiv–2301.

Nikhalat-Jahromi, A., Saghiri, A. M., and Meybodi, M. R.

(2023b). Vdhla: Variable depth hybrid learning au-

tomaton and its application to defense against the

selﬁsh mining attack in bitcoin. arXiv preprint

arXiv:2302.12096.

Panda, S. S., Mohanta, B. K., Satapathy, U., Jena, D., Goun-

tia, D., and Patra, T. K. (2019). Study of blockchain

based decentralized consensus algorithms. In TEN-

CON 2019-2019 IEEE Region 10 Conference (TEN-

CON), pages 908–913. IEEE.

Peterson, M., Andel, T., and Benton, R. (2022). Towards

detection of selﬁsh mining using machine learning. In

International Conference on Cyber Warfare and Secu-

rity, volume 17, pages 237–243.

Saad, M., Njilla, L., Kamhoua, C., and Mohaisen, A.

(2019). Countering selﬁsh mining in blockchains. In

2019 International Conference on Computing, Net-

working and Communications (ICNC), pages 360–

364. IEEE.

Sapirshtein, A., Sompolinsky, Y., and Zohar, A. (2017). Op-

timal selﬁsh mining strategies in bitcoin. In Finan-

cial Cryptography and Data Security: 20th Interna-

tional Conference, FC 2016, Christ Church, Barba-

dos, February 22–26, 2016, Revised Selected Papers

20, pages 515–532. Springer.

Solat, S. and Potop-Butucaru, M. (2016). Zeroblock:

Timestamp-free prevention of block-withholding at-

tack in bitcoin. arXiv preprint arXiv:1605.02435.

Sutton, R. S. and Barto, A. G. (2018). Reinforcement learn-

ing: An introduction. MIT press.

Tschorsch, F. and Scheuermann, B. (2016). Bitcoin and be-

yond: A technical survey on decentralized digital cur-

rencies. IEEE Communications Surveys & Tutorials,

18(3):2084–2123.

Wang, T., Liew, S. C., and Zhang, S. (2021). When

blockchain meets ai: Optimal mining strategy

achieved by machine learning. International Journal

of Intelligent Systems, 36(5):2183–2207.

Wang, W., Hoang, D. T., Hu, P., Xiong, Z., Niyato, D.,

Wang, P., Wen, Y., and Kim, D. I. (2019). A survey on

consensus mechanisms and mining strategy manage-

ment in blockchain networks. Ieee Access, 7:22328–

22370.

Watkins, C. J. and Dayan, P. (1992). Q-learning. Machine

learning, 8:279–292.

Zhang, R. and Preneel, B. (2017). Publish or perish: A

backward-compatible defense against selﬁsh mining

in bitcoin. In Topics in Cryptology–CT-RSA 2017: The

Cryptographers’ Track at the RSA Conference 2017,

San Francisco, CA, USA, February 14–17, 2017, Pro-

ceedings, pages 277–292. Springer.

Q-Defense: When Q-Learning Comes to Help Proof-of-Work Against the Selﬁsh Mining Attack

APPENDIX

In this appendix, we present detailed results of the

evaluation conducted in the Evaluation section (Sec-

tion 5), focusing on the impact of varying the α

and γ parameters. The analysis provides valuable in-

sights into the performance and effectiveness of our

proposed approach under different conﬁgurations of

these parameters.

Table 1: Q-Defense results for different values of α and γ.

Selﬁsh Miner Computational Power α Rational Miner Revenue Expected Rational Miner Revenue Selﬁsh Miner Revenue Expected Selﬁsh Miner Revenue

γ = 0.

0.0 100 100.0 0.0 0.0

0.05 99.67 95.00 0.33 5.00

0.10 98.62 90.00 1.38 10.00

0.15 99.10 85.00 0.90 15.00

0.20 93.60 80.00 6.4 20.00

0.25 95.08 75.00 4.92 25.00

0.30 93.14 70.00 6.85 30.00

0.35 87.04 65.00 12.96 35.00

0.40 77.51 60.00 22.49 40.00

0.45 56.73 55.00 43.26 45.00

0.50 24.33 50.00 75.66 50.00

γ = 0.5

0.0 100.0 100.0 0.0 0.0.00

0.05 99.41 95.00 0.59 5.00

0.10 97.73 90.00 2.27 10.00

0.15 95.15 85.00 4.85 15.00

0.20 93.44 80.00 6.56 20.00

0.25 86.06 75.00 13.94 25.00

0.30 83.31 70.00 16.69 30.00

0.35 70.55 65.00 29.45 35.00

0.40 65.73 60.00 34.27 40.00

0.45 49.58 55.00 50.49 45.00

0.50 12.07 50.00 87.93 50.00

γ = 1.0

0.0 100.0 100.0 0 0.00

0.05 98.89 95.00 1.11 5.00

0.10 96.30 90.00 3.70 10.00

0.15 93.04 85.00 6.96 15.00

0.20 86.31 80.00 13.69 20.00

0.25 82.17 75.00 17.83 25.00

0.30 74.72 70.00 25.28 30.00

0.35 64.95 65.00 35.05 35.00

0.40 56.91 60.00 43.09 40.00

0.45 45.05 55.00 54.95 45.00

0.50 12.68 50.00 87.32 50.00

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence