Catch Me if You Can: Improving Adversaries in Cyber-Security with

Q-Learning Algorithms

Arti Bandhana

, Ond

rej Luk

, Sebastian Garcia

and Tom

s Kroupa

Czech Technical University, Czech Republic

Keywords:

Q-Learning, Reinforcement Learning, MDP, Cybersecurity, Learning Agents, Advanced Persistent Threat.

Abstract:

The ongoing rise in cyberattacks and the lack of skilled professionals in the cybersecurity domain to combat

these attacks show the need for automated tools capable of detecting an attack with good performance. At-

tackers disguise their actions and launch attacks that consist of multiple actions, which are difﬁcult to detect.

Therefore, improving defensive tools requires their calibration against a well-trained attacker. In this work, we

propose a model of an attacking agent and environment and evaluate its performance using basic Q-Learning,

Naive Q-learning, and DoubleQ-Learning, all of which are variants of Q-Learning. The attacking agent is

trained with the goal of exﬁltrating data whereby all the hosts in the network have a non-zero detection prob-

ability. Results show that the DoubleQ-Learning agent has the best overall performance rate by successfully

achieving the goal in 70% of the interactions.

1 INTRODUCTION

The risk of cyber attacks is constantly increasing. At-

tackers continue to become more sophisticated and

manage to ﬁnd new vulnerabilities to exploit, making

the role of network defenders skewed and asymmet-

ric. Most attack techniques involve little direct inter-

action between the attacker and the defender. In at-

tacks such as ransomware (ENISA, 2022), port scan-

ning or cryptocurrency mining, the interaction can be

as little as only one action from the attacker. In more

complex attacks such as banking trojans or Advanced

Persistent Threat (APT) attacks (Dra

sar et al., 2020),

the attacker has to perform a series of steps within

the network or target device to be successful while re-

maining undetected. Such attacks are extremely dif-

ﬁcult to detect, yet they are the most impactful. APT

attacks are usually long-term, with many decisions

typically taken by a human adapting their tactics and

techniques to avoid detection and in most cases, the

defense mechanisms are not versatile enough to adapt

to the behavior of an attacker.

APT attackers can be modeled as agents who pur-

sue their goals while interacting with an environment

https://orcid.org/0000-0002-3711-3645

https://orcid.org/0000-0002-7922-8301

https://orcid.org/0000-0001-6238-9910

https://orcid.org/0000-0003-1531-2990

(target device or network). Most of these interac-

tions are captured mainly by Game theory or Rein-

forcement Learning (RL) models with the intent of

improving defenses in the network. Game-theoretic

frameworks are used to provide solutions for opti-

mal defenses (such as honeypot allocation) but RL

models are mostly used to improve penetration test-

ing attacks (Durkota et al., 2016; Mitchell and Healy,

2018). LSTM network and Q-Learning techniques

are also being applied to predict the attacker’s action

in APT data sets (Dehghan et al., 2022). However,

modeling realistic defenses inevitably requires learn-

ing almost optimal decisions for attackers. To the best

of our knowledge, there are no studies about model-

ing APT attacker’s behavior with the goal to improve

the decisions made by the attacker. Creating a real-

istic inference model for the attacker requires con-

sideration of factors such as intent, capabilities, ob-

jectives, opportunities, and available resources for the

attacker (Moskal et al., 2018; Liu et al., 2005). Due to

the complexity of these attributes, developing a gen-

eral framework becomes challenging. To overcome

these challenges, RL models are generally applied to

train and solve an optimal policy from a defender’s

perspective; however, we are unaware of a RL model

to optimize the actions of an APT attacker.

In this paper, we model both an APT attacker and

a network environment to train RL agents that opti-

mize the attack. The goal of the attacker is to exﬁltrate

442

Bandhana, A., Lukáš, O., Garcia, S. and Kroupa, T.

Catch Me if You Can: Improving Adversaries in Cyber-Security with Q-Learning Algorithms.

DOI: 10.5220/0011684500003393

In Proceedings of the 15th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2023) - Volume 3, pages 442-449

ISBN: 978-989-758-623-1; ISSN: 2184-433X

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

data from a speciﬁc server inside a local network to a

command and control (C&C) server in the Internet.

To ﬁnd the optimal policy for the attacker, three off-

policy RL algorithms are trained: Q-Learning, Naive

Q-Learning, and DoubleQ-Learning.

Our results show that the DoubleQ-Learning-

based attacker agent is able to exﬁltrate data in almost

70% of the interactions.

Furthermore, we show that the agent can learn

how to plan and execute a multistage data exﬁltration

attack detected less than 40% of the time. From a cy-

bersecurity point of view, it means that a model of an

attacker can be learned and improved, and therefore a

better model of the defender could be learned in fu-

ture research.

The main contributions of this paper are:

• a novel model of a decision-making entity (APT

attacker) in an adversarial environment;

• implementation of RL algorithms for an attacking

agent in a custom environment; and

• an analysis of the impact of APT attacker models

on the cybersecurity domain.

The paper is structured as follows. Section 2 pro-

vides the motivation and previous work. Section 3

describes the RL environment. Section 4 presents the

RL algorithms; Section 5 presents the setup of the

experiments; Section 6 presents the results and dis-

cusses their impact. The conclusions and future work

are contained in Section 7.

2 MOTIVATION & RELATED

WORK

There are two main sources of motivation for study-

ing the behavioral models of attackers in APT attacks

for local networks. First, improving defense mecha-

nisms (algorithms, antivirus systems, etc.) based on

the knowledge of past attacks highlights the need to

better understand the characteristics of nearly optimal

attack behaviors in realistic networks. Second, by cre-

ating and training RL models of the attacker’s behav-

ior, it is possible to optimize future defense mecha-

nisms and the dynamic properties of such systems.

Game theory and RL (Shiva et al., 2010) have

gained traction over the years in modeling attack and

defense mechanisms in many domains, including net-

work security.

Network security problems are primarily complex

and require rational decision-making. Game theory

provides mathematical models of strategic interaction

among multiple decision makers (players or agents)

along with algorithms for ﬁnding solutions (equilib-

ria) in such scenarios. The potential beneﬁt of apply-

ing game theory to network security is the automa-

tion of the exhaustive threat detection process for net-

work administrators. However, real-world cybersecu-

rity models may have limitations with regard to the

information observed by players. Typically, the de-

fender’s knowledge of the attacker’s strategy and de-

cisions is limited (Patil et al., 2018). This leads to

games with partial observation or incomplete infor-

mation, which are extremely difﬁcult to scale to the

required size of the problem.

In the area of game theory for security, there

has been promising research in honeypot technolo-

gies (Anwar and Kamhoua, 2022). The authors de-

signed an optimal approach for honeypot allocation

by formulating a two-player zero-sum game between

the defender and the attacker, which is played on top

of an attack graph. The defender places honeypots

on machines, while the attacker selects an attack path

through the attack graph, which would lead to the tar-

get machine without being detected. In addition to

solving an effective strategy for honeypot placement

in the network, the authors also experiment with a di-

versity of honeypot conﬁgurations. Diversifying the

honeypot conﬁguration ensures that not all honeypots

are discovered if one is compromised; however, this

adds to the operational cost. To automate response to

a cyber attack, (Hammar and Stadler, 2020) investi-

gate methods where strategies evolve without human

intervention and do not require domain knowledge.

The authors model the cyber interaction as a Markov

game and use simulations of self-play where agents

interact and update their strategies based on experi-

ence from previously played games.

Another promising research direction used Proxi-

mal Policy Optimization (PPO) with self-play to solve

a stochastic (Markov) two-player game with sequen-

tial moves between defender and attacker (Du et al.,

2022). The game is played on top of an attack graph,

and the authors show that the performance of a PPO

policy is better than that of a heuristic policy. The ini-

tial results are promising, but the setting used by the

authors is limited to the attack graph with ﬁve nodes

and four edges. By contrast, our work deals only with

a single-agent environment.

Attack graphs are helpful, as they can predict

the attacker’s path depending on the vulnerabilities

present in the network. At the same time, defend-

ers can leverage attack graphs to ﬁnd an effective de-

fense strategy. In particular, (Guo et al., 2021) pro-

vides defense solutions through edge blocking in an

attack graph constructed in the active directory. An-

other stream of research focuses on the assistance of

Catch Me if You Can: Improving Adversaries in Cyber-Security with Q-Learning Algorithms

443

attacking tools for better penetration testing or cyber-

training, for example, using Deep Q-Learning (Nicu-

lae et al., 2020). The authors compare Q-Learning,

Extended Classiﬁer Systems (XCS), and Deep Q-

Networks (DQN) to ﬁnd attacker strategies. To de-

termine the best response for a suspicious user on the

network, (Chung et al., 2016) compares the variations

of Q-Learning with a stochastic game.

3 ENVIRONMENT MODEL

Q-Learning is one of the most widely applied model-

free off-policy RL algorithms (Jang et al., 2019a).

It allows agents to learn in domains with Markovian

properties and thus can be modeled as a Markov De-

cision Process (MDP). Sufﬁcient exploration of the

environment is done with a ε-greedy policy. An ε-

greedy method chooses a uniformly random action

with probability ε and greedy action with probabil-

ity 1 − ε. The hyperparameter ε is chosen to balance

exploration and exploitation, intending to maximize

the cumulative reward.

An MDP is used as the underlying model (Sut-

ton and Barto, 2018) as the focus is on training a sin-

gle attacking agent. Such an approach results in the

defender being part of the environment. In real-life

scenarios, successful detection requires several steps,

from placement of the defensive measures, detecting

and generating alerts, to evaluating and addressing

threats. In this work, the defender is modeled as a

stochastic and global part of the environment.

3.1 Network

The computer network used for the deﬁnition of the

environment represents a small organization with ﬁve

clients, ﬁve servers, and a router that provides Inter-

net; see ﬁgure 1. Each host in the network has In-

ternet access. The router is also a ﬁrewall that con-

trols which clients from subnetwork 2 can access the

servers in subnetwork 1 (corresponding to dotted lines

in ﬁgure 1). Computers can connect to each other if

they are in the same subnetwork.

In the environment, we assume that the attacker

has already gained access to one of the clients on the

network. Additionally, the attacker knows the address

of an external C&C server on the Internet. The at-

tacker’s goal is to ﬁnd and exﬁltrate data located in

one of the servers in subnetwork 1.

Figure 1: Network topology with two local subnetworks

and a C&C server on the Internet. The solid black lines

represent direct network connectivity (such as Ethernet ca-

bles). The dotted lines represent logical connections from

clients to servers as allowed by the ﬁrewall. In the non-

randomized experiments, the attacker starts in Client 1. In

the experiments with a randomized start, the attacker starts

in one of the clients in subnetwork 2. The IP address of the

C&C server is always known to the attacker.

3.2 Defender

The defender in our model is an entity present in all

clients/servers simultaneously and it has assigned a

probability of detecting the attacker’s action. Once

the attacker is detected, the episode ends and the en-

vironment is reset to the initial state. This is repre-

sented by a terminal state in the environment. Given

that the defender has full network visibility, there is a

probability of detection for each action on all clients

and servers.

3.3 Attacker

Attackers usually do not have information about

the network and so they must compensate for lack

of knowledge by learning through trial and error.

We simulate an attacker who has already gained a

foothold in subnetwork 1 (ﬁgure 1) according to our

assumption. This holds for a real-world scenario, as

the initial breach can be done in various ways since

there are many connected devices on the network,

and preventing the initial breach in some ways is ex-

tremely hard. Therefore, modeling the attacker entry

in our current setup is ignored. The attacker’s objec-

tive is to ﬁnd the optimal path to a server in subnet-

work 2 containing sensitive data, ﬁnd and exﬁltrate

this data, and make it accessible on the web. The

available actions are the minimal actions required to

complete the goal: ﬁnd hosts, ﬁnd services, get ac-

cess, ﬁnd data, and exﬁltrate. The attacker was mod-

eled as a rational attacker behaving optimally.

ICAART 2023 - 15th International Conference on Agents and Artiﬁcial Intelligence

444

3.4 States

A state is an abstract representation of the environ-

ment from the attacker’s perspective. It contains sev-

eral assets the attacker can use or has discovered with

previous actions. Therefore, the state of the environ-

ment changes based on the actions of the attacker and

the current state. The probabilities p(a|s) represent

the probability of success of the attacker’s action a in

a state s and p(detection|s,a) represents the probabil-

ity of detection given the action a played in the state s.

These probabilities (table 1) of success and detection

were set based on the expert evaluations of penetra-

tion test professionals, where knowledge of the do-

main was compared and matched with the evaluation

of various detection tools for malicious behavior dis-

covery shown in (Huang et al., 2022).

Table 1: Probability of success and detection for each action

executed by the attacker in the network.

Action

Success

probability

Detection

probability

ScanNetwork 0.9 0.2

FindServices 0.9 0.3

ExecudeCode InService 0.7 0.4

FindData 0.8 0.1

ExﬁltrateData 0.8 0.1

Success probabilities are based on known tools

and techniques. While network issues are the cause

of the failure of most actions, in the case of Execute-

CodeInService other problems such as service ver-

sions and exploits quality have to be taken into ac-

count. Detection probabilities consider the false pos-

itives found in real networks with benign trafﬁc by a

human player. Some actions, such as ScanNetwork

with ARP scan, are highly successful and barely rec-

ognized (off-the-shelf state-of-the-art IDS can not de-

tect it (Hou et al., 2010)). Often, even if these scans

are detected, such alerts are dismissed for the sake of

limiting the False positives. The same applies to Find-

Data which is performed locally and thus nearly un-

detectable and ExﬁltrateData which when done cor-

rectly, is known to be extremely hard to distinguish

from benign trafﬁc.

At each time step, the following information is part of

the state:

• set of networks the attacker has discovered;

• set of hosts the attacker has discovered;

• set of hosts that the attacker has control of;

• set of services the attacker has discovered in each

host; and

• set of data the attacker has discovered in a host.

Having states consisting of assets, we can fol-

low the well-known STRIPS representation originally

designed for planning (Fikes and Nilsson, 1971).

STRIPS describes transitions in a system as opera-

tors, which are applicable if preconditions are met.

Originally, the effects of add and delete can be speci-

ﬁed for each operator. However, in our approach, we

completely omit the delete effect, which results in a

relaxed problem representation (Bonet and Geffner,

2001). Problem relaxation is a commonly used

method in a variety of AI areas. Such an approach

simpliﬁes the problem of traversing the state space.

3.5 Actions

The attacker’s actions follow the subset of techniques

for adversary behavior listed in Mitre ATT&CK

. As

we are only representing one type of goal in this

model, data exﬁltration, only the subset of Mitre ac-

tions related to data exﬁltration are used:

1. active scanning:

(a) ﬁnd computers in the network

(b) ﬁnd services run on the hosts in the network

2. attack service to execute code; and

3. exﬁltrate data to the Internet.

The attacker in our model follows a ﬁve-step ac-

tion as represented in Table 2 to reach its goal.

Table 2: List of actions and their effect on the network lead-

ing to a change in state.

Action Description Preconditions Effects

ScanNetwork

Scans complete

range of given network

network + mask extends ’known hosts’

FindServices

Scans given host

for running services

host IP

extends ’known services’

with host:service pairs

ExecudeCode

InService

Runs exploit in service

to gain control of a host

host:service extends ’controlled hosts’

FindData

Runs code to discover

data in a controlled host

hostIP

extends ’known data’ with

host:data pairs

ExﬁltrateData

Moves data from one

controlled host to another

host:data:host

extends ’known data’

with ’target:data’

3.6 Rewards

The reward is an incentive that the agent receives with

respect to the state action pair. In our model, the re-

ward of the agent is constructed as: −1 for every ac-

tion taken,−50 if the action is detected, and +100 if

the goal state is reached.

The small negative reward per action is intended

to motivate the agent to ﬁnd the shortest path to the

goal. The +100 reward for the achievement of the

https://attack.mitre.org/

Catch Me if You Can: Improving Adversaries in Cyber-Security with Q-Learning Algorithms

445

goal allows the attacker to take actions with a higher

expected detection probability if they lead to a higher

expected reward.

3.7 Implementation

The representation of a state, as described in sec-

tion 3.4, allows the modiﬁcation of the environment

without the need to retrain the agent from zero. This

differentiates our environment model and offers a

higher degree of modularity for various cybersecurity

scenarios. Instead of allocating the complete Q-table

prior to training, our agents create the Q-values dy-

namically, saving both memory and time during train-

ing.

4 LEARNING AGENTS

To train and evaluate the attacker’s performance, we

use Q-Learning (Jang et al., 2019a) and its vari-

ants: Naive Q-Learning and Double Q-Learning. Q-

Learning is a reinforcement learning algorithm that

approximates the optimal state-action value function

independently of the policy being followed. It is an

off-policy algorithm that separates learning from the

current acting policy by updating the Q-value Q(s,a),

which is an indication of how good a state-action pair

is. The equation for the Q-value update is:

Q(s,a)

= Q(s,a) + α(R

t+1

+ γV

)), (1)

where α ∈ [0,1] is the learning rate and γ ∈ [0,1] is the

discount factor that captures the concept of deprecia-

tion. A value closer to 0 means that the current reward

is preferred over future rewards.

In Naive Q-Learning, the learning rate is partially

allocated to the previous result to combine the knowl-

edge of the past history during learning, the actual

immediate reward in the current iteration, and the ex-

pected future reward (Chung et al., 2016). This leads

to the following variation of equation (1):

Q(s,a)

= αQ(s,a) + (1 − α)(R

t+1

+ γV

)) (2)

Double Q-Learning (Hasselt, 2010; Jang et al.,

2019b) proposes learning two Q-functions instead of

one. Each Q-function gets the update from the other

for the next state. These two Q-functions are an un-

biased estimate of the value of the action. The action

selection is then performed by averaging or adding

the two Q values for each action and then performing

ε-greedy action selection with the resulting Q values.

In this paper, action selection is performed by adding

the two Q values before performing the ε-greedy.

(s,a)

= Q

(s,a) + α(R + γQ

) − Q

(s,a))

(3)

(s,a)

= Q

(s,a) + α(R + γQ

) − Q

(s,a))

(4)

The other two learning agents also use ε-greedy

as the action selection criteria in accordance with the

original papers.

5 EXPERIMENT SETUP

Three different scenarios were used to train the learn-

ing agents: speciﬁc attacker position, random attacker

position, and random target server to attack.

In the ﬁrst scenario, the attacker is placed on

client 1 in subnetwork 2 (ﬁgure 1). We deﬁne a client

as an ofﬁcial device on the network used for work and

a server as a device that holds data and offers services

accessed by the clients. The attacker’s goal is to reach

the target server, which is speciﬁed as server 3 in sub-

network 1; exﬁltrate the data from the target server to

the C&C server outside the local network.

There are ﬁve clients in subnetwork 2, and in re-

ality, any connected device within the network is sus-

ceptible to an attack; therefore, for the second sce-

nario, we randomly assign the starting position of the

attacker. This was done to compare the performance

of the learning agents and see how they adapt to ran-

domness in the starting position. In addition to ran-

domizing the starting position, we also randomized

the target server for data exﬁltration; which was our

third scenario.

For successful achievement of the goal state, at

least 5 successful actions had to be performed in all

3 scenarios; however, if the agent exceeds the limit of

25 actions per episode, the interaction is terminated.

The defender in all 3 scenarios is an entity with

unlimited visibility and is present in all hosts, that is,

every action can be detected with a predeﬁned proba-

bility. Additionally, we assume that all services run-

ning on the hosts are exploitable and that a connection

to the Internet is available on all hosts.

The learning parameter for each algorithm is pre-

sented in Table 3. Experiments start with a random

attacker which randomly picks an action. The Q-

Learning agent and the DoubleQ-Learning agent were

trained on a learning rate of 0.3, while the Naive Q-

Learning agent was trained on a learning rate of 0.8.

The action selection parameter controlled by epsilon

ICAART 2023 - 15th International Conference on Agents and Artiﬁcial Intelligence

446

was kept at 0.2 for all the agents; however, Double Q-

Learning used a linearly decaying ε from 0.2 to 0.05.

In all experiments, we measured the win rate, the

detection rate, and the mean return of the episodes.

The win rate represents the percentage of interac-

tions that were successful for the attacker, which is

the number of times the attacker was able to reach

the goal state and exﬁlter the data in 10 000 episodes.

The detection rate represents the percentage of inter-

actions that were detected and resulted in the attacker

receiving a reward of −50.

Table 3: Training parameters: Q-Learning and DoubleQ-

Learning agents were trained in 10000 episodes, while

NaiveQ-Learning was trained in 5 000 episodes. The dis-

count factor γ was kept at 0.9 for all learning agents.

Algorithm α ε γ

No. of

episodes

Random - - - -

Q-Learning 0.3 0.2 0.9 10 000

Naive Q-Learning 0.8 0.2 0.9 5 000

Double Q-Learning 0.3 0.2 0.9 10 000

6 EXPERIMENTAL RESULTS

The following results were obtained in the ﬁrst sce-

nario of the experiment, when the attacker’s position

was speciﬁed in the network. Table 4 summarizes

the performance of the different learning agents. The

random attacker, without any knowledge of the net-

work and without any strategy, has a detection rate of

99.58%, while the DoubleQ-Learning attacker had a

detection rate of 33%. The Q-learning and Naive Q-

Learning agents have similar detection rates.

Table 4: Performance comparison of the learning agents

with a ﬁxed attacker starting in client1. Q-Learning and

Double Q-Learning were trained with 10 000 episodes,

while Naive Q-Learning was trained with 5 000 episodes.

Algorithm

Winning

rate (%)

Detection

rate (%)

Mean

return

Random 0.48 99.58 53.03

Q-Learning 66.4 40.4 43.94

Naive Q-Learning 66.91 40.19 43.94

Double Q-Learning 74.0 33.0 54.61

Randomizing the starting position decreases the

win rate and increases the detection rate for all learn-

ing agents, as shown in table 5. The Naive Q-

Learning agent had the greatest impact on perfor-

mance due to the randomness of the starting posi-

tion among all learning agents. The detection rate in-

Table 5: Comparison of performance for the learning agents

in a scenario with randomized attacker’s starting random-

ized. Q-Learning and Double Q-Learning were trained with

10 000 episodes, while Naive Q-Learning was trained with

5 000 episodes.

Algorithm

Winning

rate (%)

Detection

rate (%)

Mean

return

Random 0.34 99.48 -54.04

Q-Learning 65.4 39.27 41.97

Naive Q-Learning 54.27 50.78 25.59

Double Q-Learning 68.9 36.8 47.58

Table 6: Comparison of learning agents in a scenario where

the attacker’s starting point and target server were random-

ized. All algorithms were trained on 10 000 episodes.

Algorithm

Winning

rate (%)

Detection

rate (%)

Mean

return

Q-Learning 53.3 53 23.45

Naive Q-Learning 61.8 44.1 36.8

Double Q-Learning 64.9 41.7 41.2

creased from 40.4% to 50.78%

Figure 2: Comparison of the mean cumulative reward of

agents during the learning process in the scenario with de-

fender and randomized starting position for the attacker.

6.1 Analysis of Results

We compared how agents with varying parameters

learned a policy in a network with ten hosts in the

presence of a defender with full visibility. The detec-

tion probability was nonzero for actions at all clients

and servers. When comparing the win rate and the

detection rate for all learning agents, it is clear that

Double Q-Learning outperforms all other agents in

all scenarios. Two Q-functions are trained in such

agent but from different episodes which makes the

training more robust. A sum of the Q-functions is

used during inference. This avoids the overestimation

Catch Me if You Can: Improving Adversaries in Cyber-Security with Q-Learning Algorithms

447

Figure 3: Comparison of the winning rate of agents dur-

ing the learning process in the scenario with defender and

randomized starting position.

Figure 4: Comparison of the detection rate of agents dur-

ing the learning process in the scenario with defender and

randomized starting position.

bias of Q-Learning and leads to better training sta-

bility even in a noisy environment. The Q-Learning

attacker and the Naive Q-Learning attacker have the

same performance for the ﬁrst scenario where the

starting point was speciﬁed. This is due to the dis-

tribution of the learning rate according to equations

(1) and (2). A learning rate of 0.8 was used for Naive

Q-Learning, which in comparison with the Q-learning

gives similar results as the learning rate of 0.2. How-

ever, the performance of Naive Q-Learning decreased

when the starting position was randomized. This is

attributed to the weighting of the update rule of the

Q-value, as shown in equation ( 2). When consider-

ing negative rewards, the update affects the Q-value

more than the standard Q-Learning due to the split

update α. Although this can be beneﬁcial in the cases

of high positive rewards, the results show that this ap-

proach lacks adaptability in the case of the stochastic

environment.

Figures 3 and 4 show that DoubleQ-Learning out-

performs the other two agents in terms of winning and

detection rates. The high variance of the mean re-

turns, as shown in ﬁgure 2 is the result of the stochas-

tic environment and the reward distribution described

in section 3.6. The graphs also show that even though

DoubleQ-Learning performs badly in the beginning,

over time as the number of episodes increases and

state-action values are updated, it outperforms the

other two learning agents. In particular, even if the

agent’s policy is optimal, it cannot inﬂuence the de-

tection and subsequent reward of −50. Therefore, the

three agents share similar high variance in mean re-

turns but differ signiﬁcantly in metrics that focus on

reaching the goal, in which the Double Q-Learning

shows the most promising results.

Despite using random exploration ε in the three

agents based on Q-Learning, the results from the ﬁrst

and second scenarios show that the environment and

the goal are non-trivial and unsolvable for agents per-

forming purely random actions, which reached the

goal in fewer than 1% of the cases. For that reason,

the Random Agent was excluded from the compari-

son in ﬁgures 2, 3 and 4 and in the third scenario.

The results of our experiments show that despite

the defender having full visibility of the network, a

rational attacker was still able to reach the target and

exﬁltrate data. From a security perspective, this in-

dicates that the defensive tools in the network need

to be improved so as to prevent the attacker’s lateral

movement in the system.

7 CONCLUSION

In this paper, we propose a Q-Learning-based attack-

ing agent capable of performing data exﬁltration.

Our results show that even though the three learn-

ing agents can ﬁnd meaningful policies, Double Q-

Learning outperforms the others and provides the

most stable training. It reached the goal 70% of

the interactions while being undetected in 37%. This

shows that despite a globally present defender, a ra-

tional attacker could still reach the target.

The initial success and detection probabilities

were set based on expert knowledge, however, our re-

sults clearly show that there is room for improvement

in the detection capability of the defender. Having a

high success probability for attacker action highlights

the need for a robust defense mechanism that is capa-

ble of detecting any stealthy attacker. This provides a

foundation for studying and improving attacker tech-

niques to increase defense capability in the network.

Currently, the method is limited to small or

medium-sized networks. Although the interaction

ICAART 2023 - 15th International Conference on Agents and Artiﬁcial Intelligence

448

and world representation model can be easily ex-

tended to a more complex setup in size of the network

and action space, the scalability and computational

feasibility of such extensions have yet to be evaluated.

Therefore, the natural direction for future research

is to expand our approach towards larger environ-

ments, which will require subsequent scalability test-

ing due to those complex setups. We also plan to in-

corporate other types of cyber attacks into Mitre tax-

onomy and model the defender as a rational entity

with its own set of actions in the interaction. In ad-

dition, we plan to test the performance of our agent in

a simulated environment.

Along with increasing the environmental com-

plexity, the problem of more complex goals for the

attacker is also in the pipeline resulting in the need

for more reconnaissance from the agent.

ACKNOWLEDGMENTS

The authors acknowledge support from the Re-

search Center for Informatics (CZ.02.1.01/0.0/0.0/

16 019/0000765) and Strategic Support for the De-

velopment of Security Research in the Czech Repub-

lic 2019–2025 (IMPAKT 1) program, by the Min-

istry of the Interior of the Czech Republic under No.

VJ02010020 – AI-Dojo: Multi-agent testbed for the

research and testing of AI-driven cyber security tech-

nologies.

REFERENCES

Anwar, A. H. and Kamhoua, C. A. (2022). Cyber deception

using honeypot allocation and diversity: A game theo-

retic approach. In 2022 IEEE 19th Annual Consumer

Communications & Networking Conference (CCNC).

Bonet, B. and Geffner, H. (2001). Planning as heuristic

search. Artiﬁcial Intelligence, 129(1):5–33.

Chung, K., Kamhoua, C. A., Kwiat, K. A., Kalbarczyk,

Z. T., and Iyer, R. K. (2016). Game theory with learn-

ing for cyber security monitoring. In 2016 IEEE 17th

International Symposium on High Assurance Systems

Engineering (HASE), pages 1–8.

Dehghan, M., Sadeghiyan, B., Khosravian, E., Moghad-

dam, A. S., and Nooshi, F. (2022). ProAPT: Projection

of APT Threats with Deep Reinforcement Learning.

arXiv:2209.07215 [cs].

Dra

sar, M., Moskal, S., Yang, S., and Zat’ko, P. (2020).

Session-level adversary intent-driven cyberattack sim-

ulator. In 2020 IEEE/ACM 24th International Sympo-

sium on Distributed Simulation and Real Time Appli-

cations (DS-RT), pages 1–9.

Du, Y., Song, Z., Milani, S., Gonzales, C., and Fang, F.

(2022). Learning to play an adaptive cyber decep-

tion game. In The 13th Workshop on Optimization

and Learning in Multiagent Systems, AAMAS.

Durkota, K., Lisy, V., Kiekintveld, C., Bosansky, B., and

Pechoucek, M. (2016). Case studies of network de-

fense with attack graph games. IEEE Intelligent Sys-

tems.

ENISA (2022). ENISA threat landscape for ransomware

attacks. Technical report, ENISA, LU.

Fikes, R. E. and Nilsson, N. J. (1971). Strips: A new ap-

proach to the application of theorem proving to prob-

lem solving. Artiﬁcial Intelligence, 2(3):189–208.

Guo, M., Li, J., Neumann, A., Neumann, F., and Nguyen,

H. (2021). Practical ﬁxed-parameter algorithms for

defending active directory style attack graphs.

Hammar, K. and Stadler, R. (2020). Finding effective se-

curity strategies through reinforcement learning and

self-play. In 2020 16th International Conference on

Network and Service Management (CNSM). IEEE.

Hasselt, H. (2010). Double q-learning. Advances in neural

information processing systems, 23.

Hou, X., Jiang, Z., and Tian, X. (2010). The detection and

prevention for arp spooﬁng based on snort. In 2010 In-

ternational Conference on Computer Application and

System Modeling (ICCASM 2010), volume 5, pages

V5–137–V5–139.

Huang, Y.-T., Lin, C. Y., Guo, Y.-R., Lo, K.-C., Sun, Y. S.,

and Chen, M. C. (2022). Open source intelligence for

malicious behavior discovery and interpretation. IEEE

Transactions on Dependable and Secure Computing.

Jang, B., Kim, M., Harerimana, G., and Kim, J. W. (2019a).

Q-learning algorithms: A comprehensive classiﬁca-

tion and applications. IEEE Access.

Jang, B., Kim, M., Harerimana, G., and Kim, J. W. (2019b).

Q-learning algorithms: A comprehensive classiﬁca-

tion and applications. IEEE access.

Liu, P., Zang, W., and Yu, M. (2005). Incentive-based mod-

eling and inference of attacker intent, objectives, and

strategies. ACM Transactions on Information and Sys-

tem Security (TISSEC), 8(1):78–118.

Mitchell, R. and Healy, B. (2018). A game theoretic

model of computer network exploitation campaigns.

In 2018 IEEE 8th Annual Computing and Communi-

cation Workshop and Conference (CCWC).

Moskal, S., Yang, S. J., and Kuhl, M. E. (2018). Cyber

threat assessment via attack scenario simulation us-

ing an integrated adversary and network modeling ap-

proach. Journal of Defense Modeling and Simulation.

Niculae, S., Dichiu, D., Yang, K., and B

ack, T. (2020).

Automating penetration testing using reinforcement

learning.

Patil, A., Bharath, S., and Annigeri, N. (2018). Applica-

tions of game theory for cyber security system: A sur-

vey. International Journal of Applied Engineering Re-

search, 13(17):12987–12990.

Shiva, S., Roy, S., and Dasgupta, D. (2010). Game theory

for cyber security. In Proceedings of the Sixth Annual

Workshop on Cyber Security and Information Intelli-

gence Research, pages 1–4.

Sutton, R. S. and Barto, A. G. (2018). Reinforcement learn-

ing: An introduction. MIT press.

Catch Me if You Can: Improving Adversaries in Cyber-Security with Q-Learning Algorithms

449