Probabilistic Model Checking of Stochastic Reinforcement Learning

Policies

Dennis Gross and Helge Spieker

Simula Research Laboratory, Norway

Keywords:

Reinforcement Learning, Model Checking, Safety.

Abstract:

We introduce a method to verify stochastic reinforcement learning (RL) policies. This approach is compatible

with any RL algorithm as long as the algorithm and its corresponding environment collectively adhere to the

Markov property. In this setting, the future state of the environment should depend solely on its current state

and the action executed, independent of any previous states or actions. Our method integrates a veriﬁcation

technique, referred to as model checking, with RL, leveraging a Markov decision process, a trained RL policy,

and a probabilistic computation tree logic (PCTL) formula to build a formal model that can be subsequently

veriﬁed via the model checker Storm. We demonstrate our method’s applicability across multiple benchmarks,

comparing it to baseline methods called deterministic safety estimates and naive monolithic model checking.

Our results show that our method is suited to verify stochastic RL policies.

1 INTRODUCTION

Reinforcement Learning (RL) has revolutionized the

industry, enabling the creation of agents that can

outperform humans in sequential decision-making

tasks (Mnih et al., 2013a; Silver et al., 2016; Vinyals

et al., 2019).

In general, an RL agent aims to learn a near-

optimal policy to achieve a ﬁxed objective by taking

actions and receiving feedback through rewards and

observations from the environment (Sutton and Barto,

2018). We call a policy memoryless policy if it only

decides based on the current observation. A determin-

istic policy selects the same action in response to a

speciﬁc observation every time, whereas a stochastic

policy might select different actions when faced with

the same observation. A popular choice is to employ

a function approximator like a neural network (NN)

to model such policies.

RL Problem. Unfortunately, learned policies are

not guaranteed to avoid unsafe behavior (Garc

ıa and

Fern

andez, 2015; Carr et al., 2023; Alshiekh et al.,

2018). Generally, rewards lack the expressiveness to

encode complex safety requirements (Littman et al.,

2017; Hahn et al., 2019; Hasanbeig et al., 2020; Vam-

plew et al., 2022).

To resolve the issues mentioned above, veriﬁca-

tion methods like model checking (Baier and Ka-

toen, 2008) are used to reason about the safety of

RL, see for instance (Wang et al., 2020; Hasanbeig

et al., 2020; Br

azdil et al., 2014; Hahn et al., 2019).

Model checking is not limited by properties that can

be expressed by rewards but support a broader range

of properties that can be expressed by probabilistic

computation tree logic (PCTL; Hansson and Jonsson,

1994). At its core, model checking is a formal ver-

iﬁcation technique that uses mathematical models to

verify the correctness of a system concerning a given

(safety) property.

Hard to Verify. While existing research focuses on

verifying stochastic RL policies (Bacci and Parker,

2020; Bacci et al., 2021; Bacci and Parker, 2022),

these existing veriﬁcation methods do not scale well

with neural network policies that contain many layers

and neurons.

Approach. This paper presents a method for veri-

fying memoryless stochastic RL policies independent

of the number of NN layers, neurons, or the speciﬁc

memoryless RL algorithm.

Our approach hinges on three inputs: a Markov

Decision Process (MDP) modeling the RL environ-

ment, a trained RL policy, and a probabilistic compu-

tation tree logic (PCTL) formula (Hansson and Jons-

son, 1994) specifying the safety measurement. Using

438

Gross, D. and Spieker, H.

Probabilistic Model Checking of Stochastic Reinforcement Learning Policies.

DOI: 10.5220/0012357700003636

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 16th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2024) - Volume 3, pages 438-445

ISBN: 978-989-758-680-4; ISSN: 2184-433X

an incremental building process (Gross et al., 2022;

Cassez et al., 2005; David et al., 2015) for the for-

mal veriﬁcation, we build only the reachable MDP

portion by the trained policy. Utilizing Storm as

our model checker (Hensel et al., 2022), we assess

the policy’s safety measurement, leveraging the con-

structed model and the PCTL formula.

Our method is evaluated across various RL bench-

marks and compared to an alternative approach that

only builds the part of the MDP that is reachable via

the highest probability actions and an approach called

naive monolithic model checking. The results conﬁrm

that our approach is suitable for verifying stochastic

RL policies.

2 RELATED WORK

There exist a variety of related work focusing on the

trained RL policy veriﬁcation (Eliyahu et al., 2021;

Kazak et al., 2019; Corsi et al., 2021; Dr

ager et al.,

2015; Zhu et al., 2019; Jin et al., 2022). Bacci and

Parker (2020) propose MOSAIC (MOdel SAfe Intel-

ligent Control), which combines abstract interpreta-

tion and probabilistic veriﬁcation to establish prob-

abilistic guarantees of safe behavior. The authors

model the system as a continuous-space discrete-time

Markov process and focus on ﬁnite-horizon safety

speciﬁcations. They tackle the challenges of inﬁ-

nite initial conﬁgurations and policy extraction from

NN representation by constructing a ﬁnite-state ab-

straction as an MDP and using symbolic analysis of

the NN. The main difference to our approach lies in

constructing the induced discrete-time Markov chain

(DTMC). Our approach veriﬁes a speciﬁc probabilis-

tic policy derived from an MDP. In contrast, the MO-

SAIC approach veriﬁes safety guarantees by creating

a ﬁnite-state abstraction of the MDP that accommo-

dates the NN-based policy and focuses on ﬁnding safe

regions of initial conﬁgurations. The abstraction must

also extract the policy from its NN representation.

Our approach only queries the trained policy and is

not limited by the time horizon.

Bacci et al. (2021) presented the ﬁrst technique for

verifying if a NN policy controlling a dynamical sys-

tem maintains the system within a safe region for an

unbounded time. The authors use template polyhedra

to overapproximate the reach set and formulate the

problem of computing template polyhedra as an opti-

mization problem. They introduce a MILP (Mixed-

Integer Linear Programming) encoding for a sound

abstraction of NNs with ReLU activation functions

acting over discrete-time systems. Their method sup-

ports linear, piecewise linear, and non-linear systems

with polynomial and transcendental functions. The

safety veriﬁcation veriﬁes agents against a model of

the environment. The main difference to our approach

is that we are independent of the memoryless policy

architecture and do not need to encode the veriﬁcation

problem as a MILP problem.

Bacci and Parker (2022) deﬁne a formal model

of policy execution using continuous-state, ﬁnite-

branching discrete-time Markov processes and build

and solve sound abstractions of these models. The

paper proposes a new abstraction based on interval

Markov decision processes (IMDPs) to address the

challenge of probabilistic policies specifying different

action distributions across states. The authors present

methods to construct IMDP abstractions using tem-

plate polyhedra and MILP for reasoning about the NN

policy encoding and the RL agent’s environment. Fur-

thermore, they introduce an iterative reﬁnement ap-

proach based on sampling to improve the precision of

the abstractions.

Gross et al. (2022) incrementally build a formal

model of the trained RL policy and the environment.

They then use the model checker Storm to verify the

policy’s behavior. They support memoryless stochas-

tic policies by always choosing the action with the

highest probability (deterministic safety estimation).

In comparison, we build the model based on all pos-

sible actions with a probability greater than zero.

3 BACKGROUND

This section describes probabilistic model checking

and investigates RL’s details.

3.1 Probabilistic Model Checking

A probability distribution over a set X is a function

µ: X → [0, 1] with

∑

x∈X

µ(x) = 1. The set of all dis-

tributions on X is denoted Distr(X ).

Deﬁnition 1 (Markov Decision Process). A

Markov decision process (MDP) is a tuple

M = (S, s

, Act, Tr, rew,AP, L) where S is a ﬁnite,

nonempty set of states; s

∈ S is an initial state; Act is

a ﬁnite set of actions; Tr : S × Act → Distr(S)

is a partial probability transition function;

rew: S × Act → R is a reward function; AP is a

set of atomic propositions; L : S → 2

is a labeling

function that assigns atomic propositions to states.

We employ a factored state representation where

each state s is a vector of features ( f

, f

, ..., f

) where

each feature f

∈ Z for 1 ≤ i ≤ d (d is the dimension

of the state). Furthermore, we denote Tr(s, a)(s

) with

Probabilistic Model Checking of Stochastic Reinforcement Learning Policies

439

Tr(s, a, s

) and Tr(s, a, s

) can be written as s

−→ s

. If

| Act(s) |= 1, we can omit the action a in Tr(s, a, s

)

and write Tr(s, s

). Also, we deﬁne NEIGH(s) for the

set of all states s

∈ S that Tr(s, s

) 6= 0.

The available actions in s ∈ S are Act(s) = {a ∈

Act | Tr(s, a) 6= ⊥} where Tr(s, a) 6= ⊥ is deﬁned as

action a at state s does not have a transition (action

a is not available in state s). An MDP with only one

action per state (∀s ∈ S :| Act(s) |= 1) is a DTMC D.

Deﬁnition 2 (Discrete-time Markov Chain). A

discrete-time Markov chain (DTMC) is an MDP such

that | Act(s) |= 1 for all states s ∈ S. We denote a

DTMC as a tuple M

= (S, s

, Tr, rew) with S, s

, rew

as in Deﬁnition 1 and transition probability function

Tr : S → Distr(S).

When an agent is executed in an environment

modeled as an MDP, a policy is a rule that an agent

follows in deciding which action to take based on its

current observation of the state of the environment to

optimize its objective.

Deﬁnition 3 (Deterministic Policy). A mem-

oryless deterministic policy for an MDP

M=(S, s

, Act, Tr, rew) is a function π: S → Act

that maps a state s ∈ S to action a ∈ Act.

Applying a policy π to an MDP M yields an in-

duced DTMC D where all non-determinisms within

the system are resolved.

Deﬁnition 4 (Stochastic Policy). A memoryless

stochastic policy for an MDP M=(S, s

, Act, T, rew) is

a function

π: S → Distr(Act) that maps a state s ∈ S

to distribution over actions Distr(Act).

Permissive policies facilitate a richer exploration

of the state space by allowing the selection of multi-

ple actions at each observation, potentially uncover-

ing nuanced policies.

Deﬁnition 5 (Permissive Policy). A permissive policy

τ: S → 2

Act

selects multiple actions in every state.

We specify the properties of a DTMC via the spec-

iﬁcation language PCTL (Wang et al., 2020).

Deﬁnition 6 (PCTL Syntax). Let AP be a set of

atomic propositions. The following grammar deﬁnes

a state formula: Φ ::= true | a | Φ

∧ Φ

| ¬Φ |

p

| P

max

p

(φ) | P

min

p

(φ) where a ∈ AP, ∈ {<, >, ≤

, ≥}, p ∈ [0, 1] is a threshold, and φ is a path formula

which is formed according to the following grammar

φ ::= XΦ | φ

U φ

| φ

with θ

= {<, ≤}.

We further deﬁne “eventually” as ♦φ

= true U φ.

For MDPs, PCTL formulae are interpreted over

the states of the induced DTMC of an MDP and

a policy. In a slight abuse of notation, we use

RL Agent

Environment

Action

New State, Reward

Figure 1: This diagram represents an RL system in which

an agent interacts with an environment. The agent receives

a state and a reward from the environment based on its pre-

vious action. The agent then uses this information to select

the next action, which it sends to the environment.

PCTL state formulas to denote probability values.

That is, we sometimes write P

p

(φ) where we

omit the threshold p. For instance, in this pa-

per, P(♦coll) denotes the reachability probability

of eventually running into a collision. There exist

a variety of model checking algorithms for verify-

ing PCTL properties (Courcoubetis and Yannakakis,

1988, 1995). PRISM (Kwiatkowska et al., 2011) and

Storm (Hensel et al., 2022) offer efﬁcient and mature

tool support for verifying probabilistic systems.

3.2 Reinforcement Learning

For MDPs with many states and transitions, obtaining

optimal policies π

∗

is difﬁcult. The standard learn-

ing goal for RL is to learn a policy π in an MDP

such that π maximizes the accumulated discounted re-

ward, that is, E[

∑

t=0

], where γ with 0 ≤ γ ≤ 1 is

the discount factor, R

is the reward at time t, and N

is the total number of steps (see Figure 1). In RL,

an agent learns through interaction with its environ-

ment to maximize a reward signal, via RL algorithm

such as deep Q-learning (Mnih et al., 2013b). Note

that rewards lack the expressiveness to encode com-

plex safety requirements which can be speciﬁed via

PCTL (Littman et al., 2017; Hahn et al., 2019; Hasan-

beig et al., 2020; Vamplew et al., 2022). Therefore,

model checking is needed to verify that the trained

RL policies satisfy the complex requirements.

4 METHODOLOGY

In this section, we introduce a method for verifying

the safety of stochastic RL policies via probabilis-

tic model checking concerning a safety measurement.

Given the MDP M of the environment, a trained RL

policy

π, and a safety measurement m, the general

workﬂow is as follows and will be detailed below:

1. Induced MDP Construction: We construct a

new MDP using the original MDP and the trained

RL policy

π. This new MDP includes only the

states that are reachable by the policy and actions

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

440

that the policy considers, i.e., those for which the

policy selection probability is greater than zero.

2. Induced DTMC Transformation: The newly

constructed MDP

M is transformed into an in-

duced DTMC, denoted as

D. The transition prob-

abilities between states in

D are updated based on

both the modiﬁed MDP

M and the action distribu-

tion of the original RL policy

π.

3. Safety Veriﬁcation: Finally, the safety measure-

ment m of

D is rigorously veriﬁed using the Storm

model checker.

Running Example. To elucidate the methodology,

let’s consider a running example featuring an MDP

that models a grid-like environment (see Figure 2). In

this environment, an RL agent can navigate among

four discrete states, denoted as x = 1, x = 2, x = 3,

and x = 4, corresponding to positions A, B, C, and D,

respectively. From each state, the agent can execute

one of three actions: UP, NOP (No Operation), or

DOWN. The transition probabilities between states

are deﬁned based on the chosen actions. For instance,

executing the UP action from state A (x = 1) gives

the agent a 0.2 probability of transitioning to state B

(x = 2) and a 0.8 probability of transitioning to state

C (x = 3). For simplicity, let’s assume one reward

for each state-action pair. This example functions as

a foundational MDP against which we will test our

veriﬁcation method.

4.1 Induced MDP Construction

To perform the veriﬁcation of stochastic RL poli-

cies, we initially construct an induced MDP

M from

a given MDP M (as shown in Figure 2), a trained

stochastic RL policy

π, and a safety measurement m.

We commence the construction process at the ini-

tial state s

of the original MDP M. Starting from

the initial state s

, we iteratively visit each state s that

is reachable under the policy

π by only including ac-

tions a for which

π(a|s) > 0. This creates an MDP

induced by a permissive policy τ and ensures that

is reduced to those states and actions that are reach-

able by

π (see Figure 3). This induced MDP serves

as the basis for the subsequent transformation into an

induced DTMC

D for formal veriﬁcation.

4.2 Induced DTMC Transformation

To convert the induced MDP

M into an induced

DTMC

D, we perform a probabilistic transformation

on the state transitions (refer to Figure 4). Specif-

ically, we update the transition function

(s, s

A : x = 1

DOWN

B : x = 2

C : x = 3

D : x = 4

UP,NOP,DOWN

0.2

0.8

0.4

0.6

UP,NOP,DOWN

Figure 2: The MDP M with S = {x = 1, x = 2, x = 3, x = 4},

A={UP,NOP,DOWN}, and rew: S × A → {1}.

A : x = 1

DOWN

B : x = 2

C : x = 3

D : x = 4

0.2

0.8

0.4

0.6

Figure 3: Induced MDP

M. We add the state-action transi-

tion to the MDP for each action with a probability greater

than zero

π(a|s). In this example, the chosen actions are UP

and DOWN with the corresponding probabilities

π(UP|x =

1) = 0.3 and

π(DOW N|x = 1) = 0.7 at state x = 1.

A : x = 1 B : x = 2

C : x = 3

D : x = 3

0.3 · 0.2

0.7 · 0.4 + 0.3 ·0.8

0.7 · 0.6

Figure 4: Transitions update for

D. For example,

(x =

1)(x = 2) = Tr(x = 1,UP, x = 2)·

π(UP | x = 1) = 0.2·0.3 =

0.06 at state x = 1. Repeat with step from Figure 3 for all

reachable states by the trained RL policy.

which deﬁnes the probability of transitioning from

state s to state s

, for each s

that is a neighboring state

NEIGH(s) of s within the set of all states S

of the

induced MDP

The transition function

(s, s

) is given by the

following equation:

(s, s

) =

∑

a∈Act

(s)

(s, a, s

)

π(a|s)

Where Act

(s) represents the set of actions available

in state s of

M, Tr

(s, a, s

) denotes the original tran-

sition probability of moving from state s to state s

when action a is taken in

M, and

π(a|s) is the action-

selection probability of action a in state s under the

RL policy

π. The equation essentially performs a

weighted sum of all possible transitions from state s to

state s

, using the probabilities of selecting each action

a in

M according to

π as the weights. This transforma-

tion inherently incorporates the decision-making be-

havior of the RL policy into the probabilistic structure

D, enabling us to perform veriﬁcation tasks.

Probabilistic Model Checking of Stochastic Reinforcement Learning Policies

441

Table 1: Trained PPO Policies. The learning rates are

0.0001, the batch sizes are 32, and the seeds are 128. ”Ep.”

is shorthand for ”episode.” The reward is averaged over a

sliding window of 100 episodes.

Environment Layers Neurons Ep. Reward

Freeway 2 64 898 0.78

Crazy Climber 2 1024 3,229 76.22

Avoidance 2 64 7,749 6, 194

4.3 Safety Veriﬁcation

Once the induced DTMC

D is constructed, the ﬁnal

step in our methodology is to do the safety measure-

ment using the Storm model checker. It takes as in-

put the DTMC

D and a safety measurement m. Upon

feeding

D and the speciﬁed safety measurement m

into Storm, the model checker systematically explores

all possible states and transitions in

D to return the

safety measurement result.

4.4 Limitations

Our method is independent of the RL algorithm;

therefore, we can verify any neural network size.

However, it requires the policy to obey the Markov

property, i.e., there is no memory, and the action is

selected only using the current state. Furthermore, we

are limited to discrete state and action spaces due to

the discrete nature of the model checking component.

5 EXPERIMENTS

We now evaluate our method in multiple RL environ-

ments. We focus on the following research questions:

• Can we model check RL policies in common

benchmarks such as Freeway?

• How does our approach compare to COOL-

MC’s deterministic estimation and naive mono-

lithic model checking?

• How does the method scale?

The experiments are performed by initially training

the RL policies using the PPO algorithm (Barhate,

2021), then using the trained policies to answer our

research questions.

5.1 Setup

We now describe our setup. First, we describe the

used environments, then the trained policies, and ﬁ-

nally, our technical setup.

5.1.1 Environments

We focus on environments that have previously been

used in the RL literature.

Freeway. The RL agent controls a chicken (up,

down, no operation) running across a highway ﬁlled

with trafﬁc to get to the other side. Every time the

chicken gets across the highway, it earns a reward of

one. An episode ends if the chicken gets hit by a car

or reaches the other side. Each state is an image of the

game’s state. Note that we use an abstraction of the

original game, which sets the chicken into the middle

column of the screen and contains fewer pixels than

the original game, but uses the same reward function

and actions (Mnih et al., 2015).

Avoidance. This environment contains one agent

and two moving obstacles in a two-dimensional grid

world. The environment terminates when a collision

between the agent and an obstacle happens. For each

step that did not resolve in a collision, the agent gets

rewarded with a reward of 100. The environment con-

tains a slickness parameter, which deﬁnes the proba-

bility that the agent stays in the same cell (Gross et al.,

2022).

Crazy Climber. It is a game where the player has to

climb a wall (Mnih et al., 2015). This environment is

a PRISM abstraction based on this game. Each state

is an image. A pixel with a One indicates the player’s

position. A pixel with a Zero indicates an empty pixel.

A pixel with a Three indicates a falling object. A pixel

with a four indicates the player’s collision with an ob-

ject. The right side of the wall consists of a window

front. The player must avoid climbing up there since

the windows are unstable. For every level the play

climbs, the player gets a reward of 1. To avoid falling

obstacles, the player has the option to move left, move

right, or stay idle.

5.1.2 Trained RL Policies

We trained PPO agents in the previously introduced

RL environments. The RL training results are sum-

marized in Table 1.

5.1.3 Technical Setup

We executed our benchmarks in a docker container

with 16 GB RAM, and an AMD Ryzen 7 7735hs with

Radeon graphics x 16 processor with the operating

system Ubuntu 20.04.5 LTS. For model checking, we

use Storm 1.7.1 (dev) (Hensel et al., 2022).

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

442

Table 2: RL benchmarks across various environments and metrics. Columns display the number of built states, transitions,

safety measures, and computational time required. The time is calculated by adding the time taken to build the model to the

time taken to check it and is expressed in seconds. Notably, the time for model checking is negligible and, therefore, not

explicitly mentioned.

Setup Deterministic Policy Stochastic Policy Naive Monolithic (No RL Policy)

Environment Measurement States Transitions Result Time States Transitions Result Time States Transitions Result Time

Freeway P(♦goal) 123 340 0.7 2.5 496 2,000 0.7 11 496 2800 1.0 0.06

Freeway P(mid U mid

−1

) 21 36 1.0 0.5 272 1,120 1.0 6 272 1552 1.0 0.07

Crazy Climber P(♦coll) 8, 192 32, 768 0.0 146 40,960 270,336 1.0 1,313 40,960 385,024 1.0 0.4

Avoidance P(♦

≤100

coll) 625 8698 0.84 1.5 15,625 892,165 0.84 1,325 15,625 1,529,185 1.0 2.0

5.2 Analysis

We now go through our evaluation and answer our

research questions.

5.2.1 Can We Verify Stochastic RL Policies in

Common Benchmarks Such as Freeway?

In this experiment, we investigate the applicability of

our approach to evaluating RL policies, speciﬁcally

focusing on the Freeway environment.

Execution. Our analysis encompasses two dimen-

sions: (1) the probability of reaching the desired goal,

denoted by P(♦goal), and (2) the complex behavior

involving the likelihood of the agent reverting to its

starting position, denoted by P(midU mid

−1

Results. For the ﬁrst dimension, we evaluated the

probability of the trained Freeway RL agent suc-

cessfully crossing the street without colliding with a

car. The model checking result for this safety mea-

surement yielded P(♦goal) = 0.7, indicating that the

agent has a 70% chance of safely reaching the other

side of the road.

For the second dimension, we examined the

agent’s likelihood of reversing direction and return-

ing to the starting position after having entered the

street. The model checking process, in this case, re-

vealed that P(mid U mid

−1

) = 1, conﬁrming that the

agent will indeed return to the starting position with

certainty under the analyzed conditions.

In summary, our ﬁndings demonstrate that it is

not only feasible to apply model checking methods to

stochastic RL policies in commonly used benchmarks

like Freeway but also that these methods are versatile

enough to evaluate complex safety requirements.

5.2.2 How Does Our Approach Compare to

COOL-MC’s Deterministic Estimation

and Naive Monolithic Model Checking?

This experiment compares our approach with deter-

ministic estimations of the safety measures of trained

policies and naive monolithic model checking. Deter-

ministic estimation incrementally builds the induced

DTMC by querying the policy at every state for the

highest probability action (Gross et al., 2022). Naive

monolithic model checking is called ”naive” because

it does not take into account the complexity of the sys-

tem or the number of possible states it can be in, and

it is called ”monolithic” because it treats the entire

system as a single entity, without considering the in-

dividual components of the system or the interactions

between them (Gross et al., 2023a). In short, it does

not verify the trained RL policy but rather the overall

environment. For instance, P(♦goal) = 1 indicates a

path to reach the goal state eventually.

Execution. We use the trained stochastic policy to

build and verify the induced DTMCs correspondingly

for the deterministic estimation and our approach. For

naive monolithic model checking, we build the whole

MDP and verify it correspondingly.

Results. The data presented in Table 2 indicates

that our method yields precise results (see Crazy

Climber). In contrast, the deterministic estimation

technique exhibits faster performance. This is due

to its method of extending only the action transitions

associated with the highest-probability action. The

naive monolithic model checking results are bounds

and do not reﬂect the actual RL policy performance.

In the context of naive monolithic model checking,

the number of states and transitions is larger than the

other two approaches. This indicates that naive mono-

lithic model checking runs faster out of memory than

the other two approaches, which is critical in environ-

ments with many states and transitions (Gross et al.,

2023b).

5.2.3 How Does the Method Scale?

Based on our experiments, detailed in Table 2, we

ﬁnd that the primary limitations of our approach stem

from the increasing number of states and transitions.

Given the probabilistic characteristics inherent in the

RL policy, there is an increase in the number of states

Probabilistic Model Checking of Stochastic Reinforcement Learning Policies

443

and transitions that can be reached compared to a de-

terministic RL policy. This expansion results in more

extended times for the overall model checking. Op-

timizing our method’s incremental building process

may increase the model checking performance (Gross

et al., 2023a).

6 CONCLUSION

We presented a methodology for verifying memo-

ryless stochastic RL policies, thus addressing a gap

in the current body of research regarding the safety

veriﬁcation of RL policies with complex, layered

NNs. Our method operates independently of the spe-

ciﬁc RL algorithm in use. Furthermore, we demon-

strated the effectiveness of our approach across vari-

ous RL benchmarks, conﬁrming its capability to ver-

ify stochastic RL policies comprehensively.

For future work, exploring the integration of safe

RL (Carr et al., 2023) and stochastic RL veriﬁcation

offers a promising path to validate policies’ reliabil-

ity and enhance their operational safety across diverse

environments. Additionally, merging stochastic RL

veriﬁcation with interpretability (Zhao et al., 2023)

and explainability (Vouros, 2023) approaches could

signiﬁcantly bolster the understanding of RL policies.

REFERENCES

Alshiekh, M., Bloem, R., Ehlers, R., K

onighofer, B.,

Niekum, S., and Topcu, U. (2018). Safe reinforcement

learning via shielding. In AAAI, pages 2669–2678. AAAI

Press.

Bacci, E., Giacobbe, M., and Parker, D. (2021). Verifying

reinforcement learning up to inﬁnity. In IJCAI, pages

2154–2160. ijcai.org.

Bacci, E. and Parker, D. (2020). Probabilistic guarantees for

safe deep reinforcement learning. In FORMATS, volume

12288 of Lecture Notes in Computer Science, pages 231–

248. Springer.

Bacci, E. and Parker, D. (2022). Veriﬁed probabilistic poli-

cies for deep reinforcement learning. In NFM, volume

13260 of Lecture Notes in Computer Science, pages 193–

212. Springer.

Baier, C. and Katoen, J. (2008). Principles of model check-

ing. MIT Press.

Barhate, N. (2021). Minimal pytorch imple-

mentation of proximal policy optimization.

https://github.com/nikhilbarhate99/PPO-PyTorch.

azdil, T., Chatterjee, K., Chmelik, M., Forejt, V.,

Kret

ınsk

y, J., Kwiatkowska, M. Z., Parker, D., and Ujma,

M. (2014). Veriﬁcation of markov decision processes us-

ing learning algorithms. In ATVA, volume 8837 of Lec-

ture Notes in Computer Science, pages 98–114. Springer.

Carr, S., Jansen, N., Junges, S., and Topcu, U. (2023). Safe

reinforcement learning via shielding under partial ob-

servability. In AAAI, pages 14748–14756. AAAI Press.

Cassez, F., David, A., Fleury, E., Larsen, K. G., and Lime,

D. (2005). Efﬁcient on-the-ﬂy algorithms for the analysis

of timed games. In CONCUR, volume 3653 of Lecture

Notes in Computer Science, pages 66–80. Springer.

Corsi, D., Marchesini, E., and Farinelli, A. (2021). For-

mal veriﬁcation of neural networks for safety-critical

tasks in deep reinforcement learning. In de Campos, C.

and Maathuis, M. H., editors, Proceedings of the Thirty-

Seventh Conference on Uncertainty in Artiﬁcial Intelli-

gence, volume 161 of Proceedings of Machine Learning

Research, pages 333–343. PMLR.

Courcoubetis, C. and Yannakakis, M. (1988). Verifying

temporal properties of ﬁnite-state probabilistic programs.

In FOCS, pages 338–345. IEEE Computer Society.

Courcoubetis, C. and Yannakakis, M. (1995). The complex-

ity of probabilistic veriﬁcation. J. ACM, 42(4):857–907.

David, A., Jensen, P. G., Larsen, K. G., Mikucionis, M., and

Taankvist, J. H. (2015). Uppaal stratego. In TACAS, vol-

ume 9035 of Lecture Notes in Computer Science, pages

206–211. Springer.

ager, K., Forejt, V., Kwiatkowska, M. Z., Parker, D., and

Ujma, M. (2015). Permissive controller synthesis for

probabilistic systems. Log. Methods Comput. Sci., 11(2).

Eliyahu, T., Kazak, Y., Katz, G., and Schapira, M. (2021).

Verifying learning-augmented systems. In SIGCOMM,

pages 305–318. ACM.

Garc

ıa, J. and Fern

andez, F. (2015). A comprehensive sur-

vey on safe reinforcement learning. J. Mach. Learn. Res.,

16:1437–1480.

Gross, D., Jansen, N., Junges, S., and P

erez, G. A. (2022).

COOL-MC: A comprehensive tool for reinforcement

learning and model checking. In SETTA, volume 13649

of Lecture Notes in Computer Science, pages 41–49.

Springer.

Gross, D., Schmidl, C., Jansen, N., and P

erez, G. A.

(2023a). Model checking for adversarial multi-agent re-

inforcement learning with reactive defense methods. In

ICAPS, pages 162–170. AAAI Press.

Gross, D., Schmidl, C., Jansen, N., and P

erez, G. A.

(2023b). Model checking for adversarial multi-agent re-

inforcement learning with reactive defense methods. In

Proceedings of the International Conference on Auto-

mated Planning and Scheduling, volume 33, pages 162–

170.

Hahn, E. M., Perez, M., Schewe, S., Somenzi, F., Trivedi,

A., and Wojtczak, D. (2019). Omega-regular objectives

in model-free reinforcement learning. In TACAS (1), vol-

ume 11427 of Lecture Notes in Computer Science, pages

395–412. Springer.

Hansson, H. and Jonsson, B. (1994). A logic for reason-

ing about time and reliability. Formal Aspects Comput.,

6(5):512–535.

Hasanbeig, M., Kroening, D., and Abate, A. (2020). Deep

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

444

reinforcement learning with temporal logics. In FOR-

MATS, volume 12288 of Lecture Notes in Computer Sci-

ence, pages 1–22. Springer.

Hensel, C., Junges, S., Katoen, J., Quatmann, T., and Volk,

M. (2022). The probabilistic model checker Storm. Int.

J. Softw. Tools Technol. Transf., 24(4):589–610.

Jin, P., Wang, Y., and Zhang, M. (2022). Efﬁcient LTL

model checking of deep reinforcement learning systems

using policy extraction. In SEKE, pages 357–362. KSI

Research Inc.

Kazak, Y., Barrett, C. W., Katz, G., and Schapira, M.

(2019). Verifying deep-rl-driven systems. In Ne-

tAI@SIGCOMM, pages 83–89. ACM.

Kwiatkowska, M. Z., Norman, G., and Parker, D. (2011).

PRISM 4.0: Veriﬁcation of probabilistic real-time sys-

tems. In CAV, volume 6806 of LNCS, pages 585–591.

Springer.

Littman, M. L., Topcu, U., Fu, J., Isbell, C., Wen, M., and

MacGlashan, J. (2017). Environment-independent task

speciﬁcations via GLTL. CoRR, abs/1704.04341.

Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A.,

Antonoglou, I., Wierstra, D., and Riedmiller, M. A.

(2013a). Playing atari with deep reinforcement learning.

CoRR, abs/1312.5602.

Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A.,

Antonoglou, I., Wierstra, D., and Riedmiller, M. A.

(2013b). Playing atari with deep reinforcement learning.

CoRR, abs/1312.5602.

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Ve-

ness, J., Bellemare, M. G., Graves, A., Riedmiller, M. A.,

Fidjeland, A., Ostrovski, G., Petersen, S., Beattie, C.,

Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wier-

stra, D., Legg, S., and Hassabis, D. (2015). Human-

level control through deep reinforcement learning. Nat.,

518(7540):529–533.

Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L.,

van den Driessche, G., Schrittwieser, J., Antonoglou, I.,

Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe,

D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap,

T. P., Leach, M., Kavukcuoglu, K., Graepel, T., and Has-

sabis, D. (2016). Mastering the game of go with deep

neural networks and tree search. Nat., 529(7587):484–

489.

Sutton, R. S. and Barto, A. G. (2018). Reinforcement learn-

ing: An introduction. MIT press.

Vamplew, P., Smith, B. J., K

allstr

om, J., de Oliveira Ramos,

G., Radulescu, R., Roijers, D. M., Hayes, C. F., Heintz,

F., Mannion, P., Libin, P. J. K., Dazeley, R., and Foale,

C. (2022). Scalar reward is not enough: a response to

silver, singh, precup and sutton (2021). Auton. Agents

Multi Agent Syst., 36(2):41.

Vinyals, O., Babuschkin, I., Czarnecki, W. M., Mathieu,

M., Dudzik, A., Chung, J., Choi, D. H., Powell, R.,

Ewalds, T., Georgiev, P., Oh, J., Horgan, D., Kroiss,

M., Danihelka, I., Huang, A., Sifre, L., Cai, T., Aga-

piou, J. P., Jaderberg, M., Vezhnevets, A. S., Leblond, R.,

Pohlen, T., Dalibard, V., Budden, D., Sulsky, Y., Molloy,

J., Paine, T. L., G

ulc¸ehre, C¸ ., Wang, Z., Pfaff, T., Wu,

Y., Ring, R., Yogatama, D., W

unsch, D., McKinney, K.,

Smith, O., Schaul, T., Lillicrap, T. P., Kavukcuoglu, K.,

Hassabis, D., Apps, C., and Silver, D. (2019). Grandmas-

ter level in starcraft II using multi-agent reinforcement

learning. Nat., 575(7782):350–354.

Vouros, G. A. (2023). Explainable deep reinforcement

learning: State of the art and challenges. ACM Comput.

Surv., 55(5):92:1–92:39.

Wang, Y., Roohi, N., West, M., Viswanathan, M., and

Dullerud, G. E. (2020). Statistically model checking

PCTL speciﬁcations on markov decision processes via

reinforcement learning. In CDC, pages 1392–1397.

IEEE.

Zhao, C., Deng, C., Liu, Z., Zhang, J., Wu, Y., Wang, Y.,

and Yi, X. (2023). Interpretable reinforcement learning

of behavior trees. In ICMLC, pages 492–499. ACM.

Zhu, H., Xiong, Z., Magill, S., and Jagannathan, S. (2019).

An inductive synthesis framework for veriﬁable rein-

forcement learning. In PLDI, pages 686–701. ACM.

Probabilistic Model Checking of Stochastic Reinforcement Learning Policies

445