Assured Reinforcement Learning for Safety-critical Applications

George Mason

, Radu Calinescu

, Daniel Kudenko

and Alec Banks

Department of Computer Science, University of York, Deramore Lane, York, U.K.

Defence Science and Technology Laboratory, Salisbury, U.K.

{grm504, radu.calinescu, daniel.kudenko}@york.ac.uk, abanks@mail.dstl.gov.uk

1 RESEARCH PROBLEM

Reinforcement learning (RL) is a branch of machine

learning used to solve sequential decision making

problems. This is achieved by using an autonomous

agent to explore an initially unknown problem envi-

ronment in order to learn a set of actions (known as an

optimal policy) to perform in the environment’s states

that will return the maximum possible expected re-

ward from the system (Wiering and Otterlo, 2012).

Despite RL having successes in areas such as

robotics (Kober et al., 2013) and gaming (Szita, 2012)

it has had little appeal in the domain of safety-critical

applications. This is because RL has no inherent guar-

antees that a learned solution will satisfy safety, legal

or regulatory requirements. This limitation has pre-

vented RL from being adopted in areas such as health-

care, business or legal systems, where behaviours by

the agent can be dangerous to itself, other systems or

humans. Furthermore, the agent may learn a set of

behaviours that are unpredictable or unfamiliar to hu-

man operators, therefore making the system difﬁcult

to trust even if its operation may ultimately prove cor-

rect.

In recent years there has been growing interest in

this limitation with research emerging with the aim

to resolving it (Garc

ıa and Fern

andez, 2015). How-

ever, current approaches are still largely theoretical,

suffer from scalability issues, have difﬁculty in ex-

pressing non-trivial safety properties, or are unable to

offer ﬁrm guarantees that their RL solutions will sat-

isfy speciﬁc requirements.

The root of the problem lies in how objectives are

expressed in RL. Objectives are speciﬁed through a

reward function that returns a numerical “reward” to

the RL learning agent according to how beneﬁcial an

action it performs in a system state is to achieving its

objectives. Those actions that will cause the agent

to complete an objective will yield rewards that are

greater than those actions that do not. The problem

with this mechanism is that it can be infeasible to

express complex requirements using rewards alone.

Furthermore, it can necessitate introducing more de-

tails into the underlying RL environment, which will

exacerbate the state-space explosion problem that af-

fects RL (Garc

ıa and Fern

andez, 2015) and therefore

will signiﬁcantly reduce the rate of learning. When

objectives conﬂict with each other, there is the ad-

ditional issue of how to assign a reward function to

simultaneously reward the agent and punish it. The

nature of RL is to maximize a reward (or minimize a

cost), so traditional RL is inherently unable to ﬁnd a

solution that lies in the middle of these two extremes.

A further problem of this learning approach is that

since the agent is motivated solely through accumu-

lating as much reward as possible it can learn a solu-

tion that, despite being optimal with respect to max-

imizing reward acquisition, may not follow conven-

tional human behaviours. Unfamiliar, “quirky” be-

haviours by the system further reduce its appeal in

scenarios where it is crucial that the system can be

trusted (Lange et al., 2012).

This project aims to address the problem of how

to develop assurances that an RL system will a) ﬁnd a

solution that can guarantee to satisfy a wide range of

safety requirements, and b) learn solutions that can be

trusted to conform to conventional behaviours of the

domain where the system is deployed.

2 OUTLINE OF OBJECTIVES

Research towards safe RL faces two common prob-

lems. First, complex safety properties are often difﬁ-

cult to express using a reward function. Second, that

RL solutions have ﬁrm guarantees to satisfy speciﬁc

safety requirements, as opposed to a solution that is

“generally safer” than the one learned by traditional

RL.

The overarching aim of our project is to address

these limitations by developing a generic approach for

assured reinforcement learning (ARL) that achieves

two key objectives:

1. ARL should support the speciﬁcation and reali-

sation of a broad range of complex requirements

(including constraints and optimisation objectives

Mason, G., Calinescu, R., Kudenko, D. and Banks, A.

Assured Reinforcement Learning for Safety-critical Applications.

In Doctoral Consortium (DCAART 2017), pages 9-16

(1) provide

High-level problem

model (abstract Markov

decision process)

(2) drive

Quantitative

veriﬁcation

Selected safe abstract

policy (abstract MDP

policy)

(3) generates

(4) informs

(5) select

Reinforcement

learning

(6) constrains

Safe reinforcement

learning policy

(7) produces

Stage 1: Abstract policy generation

Stage 2: Safe reinforcement learning

AI engineer,

domain expert

Pareto front of safe

abstract policies

Constraints & optimisation

objectives (safety, cost,

reliability, rewards, etc.)

(8) updates

Figure 1: The envisaged approach for assured reinforcement learning.

for the RL solution) without impacting the size of

the underlying model.

2. ARL solutions should be guaranteed to sat-

isfy requirements within pre-speciﬁed probability

boundaries and execution costs.

The envisaged two-stage operation of our ARL

approach is depicted in Figure 1. In Stage 1, termed

abstract policy generation, we propose the use of an

abstract Markov decision process (AMDP) (Marthi,

2007; Li et al., 2006; Sutton et al., 1999) to model a

high-level representation of the RL problem. Also,

a set of safety constraints and optimisation objec-

tives are speciﬁed using probabilistic computation

tree logic (PCTL) (Hansson and Jonsson, 1994), an

expressive temporal logic that allows complex prop-

erties of Markov decision processes to be formu-

lated as concise formulae. Devising the AMDP and

the PCTL-encoded constraints and optimisation crite-

ria for the RL problem requires both domain knowl-

edge and AI expertise. Accordingly, teams compris-

ing both an AI engineer and a domain expert pro-

vide (1) these inputs for the ﬁrst ARL stage. The

problem AMDP and constraints/optimisation criteria

are then used to drive (2) the search for safe AMDP

policies using quantitative veriﬁcation (QV), a vari-

ant of model checking for the analysis and veriﬁca-

tion of stochastic models (Kwiatkowska et al., 2007).

By exploring different areas of the AMDP parameter

spaces, QV generates (3) a set of abstract policies for

the AMDP, with the policies that are veriﬁed as satis-

fying all the constraints organised into a Pareto front.

This Pareto front captures the safe abstract policies

that are Pareto optimal with respect to the optimisa-

tion criteria, and can therefore be used to inform (4)

the users’ selection (5) of a suitable safe abstract pol-

icy.

In Stage 2 of ARL, termed safe reinforcement

learning, the selected safe abstract policy is translated

into a set of safety rules that constrains (6) the RL

agent’s exploration to low-level states and actions that

map to the high-level states and actions of the AMDP

known to be safe. As a result, the RL agent produces

(7) a safe reinforcement learning policy, i.e. an RL

policy that when followed will have equal safety lev-

els to those veriﬁed for the abstract policy, thus meet-

ing the safety requirements.

The two-stage ARL approach described above

makes two important assumptions: i) the AMDP

model will contain all necessary information for ab-

stract safe policies to fully apply to the low-level RL

model, and ii) this information will accurately reﬂect

the RL model. Should one or both of these assump-

tions not be satisﬁed, e.g. because the initial knowl-

edge is incomplete or the parameters of the system

change during runtime, then an abstract policy may

not necessarily provide the levels of safety that it was

veriﬁed to give. This necessitates a means of be-

ing able to detect inaccuracies in the AMDP model

and to then ﬁnd an alternate safe abstract policy for

it. We therefore intend to extend the basic ARL

approach to incorporate knowledge revision of the

AMDP (Efthymiadis and Kudenko, 2015; Calinescu

et al., 2011), with the RL agent identifying discrepan-

cies between the safe abstract policy and the explored

environment and updating (8) the AMDP accordingly.

DCAART 2017 - Doctoral Consortium on Agents and Artiﬁcial Intelligence

3 STATE OF THE ART

Research into safe RL is scant but has had growing in-

terest in recent years. Existing approaches generally

fall in to one of two categories (Garc

ıa and Fern

andez,

2015). The ﬁrst looks into modifying how the accu-

mulation of rewards is optimised. The second focuses

on adjusting the strategies employed by the agent to

explore the environment.

3.1 Preliminaries

RL uses a Markov decision process (MDP) as its un-

deryling framework. Formally an MDP is a tuple

(S, A, T, R), where: S is a ﬁnite set of states; A is a

ﬁnite set of actions; T : S×A×S → [0, 1] is a state

transition function such that for any s, s

∈ S and any

action a ∈ A that is allowed in state s, T (s, a, s

) gives

the probability of transitioning to state s

when per-

forming action a in state s; and R : S×A×S → R is

a reward function such that R(s, a, s

) = r is the re-

ward received by the agent when action a performed

in state s leads to state s

(Wiering and Otterlo, 2012).

Supplemental to this deﬁnition are policies, de-

noted π. A policy is a mapping of states to actions

such that for each state s ∈ S there is a corresponding

action a ∈ A. The notion of solving an MDP is to ﬁnd

a policy that when followed will return the maximum

possible expected reward from the MDP environment,

such a policy is optimal and is denoted π

∗

. MDPs can

be solved using dynamic or linear programming, how-

ever, when the transition and/or reward functions are

initially unknown RL is used.

RL uses an autonomous agent to explore the MDP

environment to learn about its dynamics and where-

abouts of rewards contained within it. This explo-

ration is initially through the arbitrary selections of

actions in states and over time the agent encounters

rewards in the environment. Knowledge of these re-

wards is retained in the form of Q-values, a state-

action pair Q(s, a) that speciﬁes the utility of perform-

ing action a in state s. In subsequent learning episodes

the agent can then reuse this information by selecting

the action in a state which has the highest utility value.

These Q-values are updated each time the agent revis-

its the state according to an update function such as Q-

learning (Watkins and Dayan, 1992). Given sufﬁcient

learning episodes the agent will converge to accurate

(i.e. unchanging) utility values, when this eventuality

is reached the agent has learned an optimal solution.

3.2 Optimisation Strategies

An intuitive approach for safe RL is to give a negati-

ve reward (i.e. a cost) to the agent if it performs ac-

tions which lead to the agent entering states which

are denoted as unsafe. However, this simplistic ap-

proach suffers from several limitations. Firstly, it re-

quires knowing a priori exactly which states of the RL

environment are unsafe which may not be possible.

Secondly, assigning a cost of suitable magnitude is

not always obvious and requires trial and error to de-

termine, especially if safety objectives conﬂicts with

mission objectives. Finally, often it can be difﬁcult

to deﬁne complex safety requirements in the reward

function and may necessitate signiﬁcantly expanding

the state space of the model to accommodate the prop-

erties.

Therefore, instead of focussing on how to deﬁne

rewards for unsafe behaviours, various approaches

have been proposed that consider the criterion for how

the accumulation of rewards is optimised.

Ultimately we desire the agent to accumulate as

much reward as possible for achieving mission objec-

tives, but if the agent’s behaviour means it enter states

where it cannot achieve the objectives then it will not

acquire these rewards. Even though on average the

learned solution may receive the highest reward pos-

sible from the system, the variance of this reward in-

dicates how risky the solution is. I.e., if the agent

regularly receives a low reward for not achieving all

mission objectives then the solution is not one that

is particularly reliable. Therefore, redeﬁning how the

agent optimises its reward accumulation is one avenue

of safe RL approaches.

One method is to optimise a policy so that no ac-

tions are irreversible, i.e. no actions can be done

which the agent cannot recover from and are thus safe.

Such an example is (Moldovan and Abbeel, 2012)

where a set of ergodic policies are identiﬁed and the

agent optimises over them. Whilst this approach guar-

antees that a solution will never lead to the agent be-

ing unsafe, the solution is often excessively far from

being optimal. Despite the fact that signiﬁcant re-

wards could be gained at very low risk, since the risk

is not zero the rewards are not considered.

The worst-case criterion (Heger, 1994) optimises

a solution so that the worst possible outcome when

following it offers the maximum reward from the sys-

tem relative to all other possible solutions’ worst pos-

sible outcomes. This approach ensures that a solu-

tion will guarantee a minimum level of return will al-

ways be achieved, however, solutions typically yield

signiﬁcantly lower returns than an optimal solution

would. Even though the probability of a worst-case

scenario occurring may be very low, this optimisa-

tion approach will disregard potentially large future

rewards.

Assured Reinforcement Learning for Safety-critical Applications

A similar approach is the risk-sensitive criterion

(Mihatsch and Neuneier, 2002) which uses a param-

eter to specify what amount of variability of return

is permissible. This parameter can be tuned so that

a solution is optimised to avoid variability or seek it.

Through this approach a solution can be found that

satisﬁes the level of risk that the user is comfortable

with. However, as with the worst-case criterion, low

variability can be produce solutions that are far from

optimal.

3.3 Exploration Strategies

An alternative to modifying how the agent’s solutions

are optimised is to modify how the agent actually ex-

plores the environment. Traditionally, an RL agent

starts with no knowledge of the environment and must

initially explore it randomly. This can lead to the

agent ﬁnding solutions which traverse unsafe states.

To ameliorate this problem, the exploration strategy

of the agent can be inﬂuenced so that it has some

knowledge of which states to transition to and which

to avoid.

The Lyapunov design (Perkins and Barto, 2003)

uses control Lyapunov functions to measure the dis-

tance from a safe system state to a failure (unsafe)

state. By constricting the set of actions to only ones

that cause the system to descend on a control Lya-

punov function, i.e. towards a stable equilibrium, safe

RL can achieved. However, ﬁnding appropriate Lya-

punov functions is often a difﬁcult task and are be-

spoke for every problem scenario.

The Policy Improvement through Safe Rein-

forcement Learning (PI-SRL) algorithm (Garc

ıa and

Fern

andez, 2012) is a two-stage process where the

ﬁrst stage has a predeﬁned safe baseline policy that

is assumed to be suboptimal and the second stage is

to learn. The approach differs from policy iteration

as it uses two new components, a risk function to de-

termine the risk of a particular state and a baseline

behaviour which can be used when in states of risk.

However, in areas where there is no area of risk and

only a discrete change from safe to unsafe states the

algorithm can still result in failure.

Another approach is to use a series of demonstra-

tions to aid the agent discover an initial solution (Ar-

gall et al., 2009) for the problem. This initial so-

lution provides a basis for the agent so that it need

not explore unnecessary or unsafe states. The so-

lution derived from demonstrations will typically be

sub-optimal so the agent uses it only as guidance and

optimises it using traditional RL techniques. This

approach is limited, though, by the fact that it is

not always feasible to provide a safe demonstration

for every possible scenario. In such an eventuality

the agent must rely on conventional exploration tech-

niques, falling back to the problem of the agent un-

knowingly exploring dangerous states.

4 METHODOLOGY

To achieve our project objectives we have decom-

posed the research work into the following tasks.

1. Review existing literature on safe RL to learn the

current state of research and the limitations of ex-

isting approaches. Speciﬁcally, we aimed to iden-

tify limitations of existing RL solutions in order

to determine the trajectory of our research.

2. Form a theoretical method as a potential solution

for the identiﬁed limitations.

3. Implement the theoretical method as an actual ap-

plication. This can be done in an interleaving

fashion with task 2.

4. Evaluate the method using at least two qualita-

tively different case studies taken from bench-

mark RL experiments and real-world applications.

Evaluation will focus on how well safety levels

have been assured using our method compared to

traditional RL.

5. Extend the approach to accommodate incorrect

knowledge contained in the initial high-level

problem model. Information that is different in

the RL model can be relayed back to the high-

level model and an updated solution is generated.

6. Further evaluate the full framework for scalability

and generality by expanding existing case studies

and develop new ones.

To implement the RL experiments we will be

using the York Reinforcement Learning Library

(YORLL)

, which supports a wide range of environ-

ments and learning algorithms. To perform QV we

will use the PRISM model checker (Kwiatkowska

et al., 2011), which supports the veriﬁcation of

reward-extended PCTL properties for MDPs. QV and

PRISM have been successfully used to analyse simi-

lar models of systems ranging from cloud infrastruc-

ture (Calinescu et al., 2012) and service-based sys-

tems (Calinescu et al., 2013) to unmanned vehicles

(Gerasimou et al., 2014), and thus we expect them to

also work well for ARL.

The evaluation of our ARL approach will focus

on how closely the learned RL policy adheres to the

YORLL is programmed in Java and is developed by

the University of York. It is free to download from

http://www.cs.york.ac.uk/rl/software.php

DCAART 2017 - Doctoral Consortium on Agents and Artiﬁcial Intelligence

safety requirements, relative to a control experiment

using traditional RL. Validating the policies can be

done empirically by running the learned safe RL pol-

icy and comparing the outcome to the safety levels

of the selected high-level policy. Given the stochas-

tic nature typical of RL experiments it is necessary to

run the safe RL policy numerous times to obtain an

average result (Arcuri and Briand, 2011).

5 EXPECTED OUTCOME

The expected outcome of our project is a general

framework to provide an assured RL solution that will

satisfy a broad range of safety requirements. Our

novel approach will mitigate the existing limitation

in safe RL research that safety cannot be guaranteed

to fulﬁl speciﬁc safety requirements. By using QV to

provably verify the properties of solution constraints

our assured RL framework will guarantee to satisfy

safety requirements.

We intend that the framework can be used across

multiple domains and will support large-scale scenar-

ios. The framework will require a set of requirements

expressed using PCTL as well as a high-level model

of the problem scenario. The end result will be a set

of rules specifying which actions the agent should, or

should not, perform in speciﬁc states of the RL model.

The outcome being that the RL agent will learn a so-

lution such that it will behave in a way that is guar-

anteed not to violate safety requirements whilst also

being optimal subject to the contraints.

Our framework will contribute to the ongoing re-

search into safe RL. Speciﬁcally, our approach will al-

leviate the recurring limitation that safety assurances

can not be guaranteed, or when guarantees are given

they are at an undesirably large detriment to optimal-

ity of the solution.

6 STAGE OF THE RESEARCH

So far, steps (1)–(7) of the approach shown in Figure 1

have been developed and evaluated. Preliminary work

on these steps is summarised in (Mason et al., 2016),

and the formalisation of this part of the approach and

experimental results of its evaluation are presented in

(Mason et al., 2017). Our current work is towards de-

veloping the knowledge-revision algorithm for updat-

ing the AMDP as discussed in Section 2 and shown

by step (8) from Figure 1.

6.1 Completed Work

For our high-level model we use an AMDP, which is

an abstract version of the low-level RL MDP. AMDPs

differ to conventional MDPs by being signiﬁcantly

smaller in terms of their state space and action set

(Marthi, 2007). To achieve this, superﬂuous states of

the MDP are ignored and similar states conﬂated, only

those states with signiﬁcant features, such as those

containing rewards, are retained (Li et al., 2006). The

low-level action set can be abstracted so that instead

of requiring a series of individual actions to enter the

next state of interest, high-level options are used in-

stead (Sutton et al., 1999). These options replace the

actions with single transitions between states. The

end AMDP has several orders of magnitude fewer

states than the RL MDP, and can be formally analysed

using QV. This allows the rapid veriﬁcation of can-

didate high-level policies against the PCTL-encoded

problem requirements.

An algorithm has been developed to automate the

process of generating and verifying abstract policies

and assembling a Pareto-front of those policies that

were veriﬁed as being safe; this algorithm is presented

in (Mason et al., 2017). To generate candidate ab-

stract policies, the algorithm can utilize a range of

search techniques, including hill climbing, genetic al-

gorithms, e.g. (Gerasimou et al., 2015), or a simple

random search. As different search techniques can

perform better than others for certain types of prob-

lem scenarios, experiments are planned to identify

suitable techniques for different classes of RL prob-

lems (e.g., planning and navigation).

We have evaluated the method in two qualitatively

different case studies. The ﬁrst case study is based on

the benchmark RL ﬂag-collection experiment (Dear-

den et al., 1998) which we have extended by intro-

ducing the risk of the agent being captured. Details

of this case study are given below. The second case

study is based on an assisted living system for demen-

tia patients (Boger et al., 2006). In this experiment an

autonomous agent must give voice prompts to a de-

mentia sufferer to instruct them on what task to do

next when undertaking the activity of washing their

hands. The system learns what style of voice prompt

is most appealing to the patient, considering the vol-

ume of the prompt, the gender of the voice and the

explicitness of the instructions. This system must not

overload the patient with prompts as this can become

stressful for them; conversely the agent must mini-

mize the necessity of summoning a carer to intervene

should the patient not progress effectively.

The environment for the guarded ﬂag-collection

case study is shown in Figure 2; the objective in this

benchmark RL experiment is to learn a route through

Assured Reinforcement Learning for Safety-critical Applications

Start

Goal

Figure 2: The layout of the guarded ﬂag collection environ-

ment. The areas of risk are illustrated by security cameras

which with a certain probability can detect the agent as it

traverses the doorways.

the environment to collect the ﬂags A-F. In our case

study we have augmented the environment with secu-

rity cameras which can detect the agent as it passes

through certain doorways. Detection of the agent re-

sults in its capture and the experiment ending in fail-

ure, regardless of any ﬂags already collected.

Each of the cameras has a different probability of

detecting the agent. Therefore, along with the origi-

nal optimisation objective to maximize the number of

ﬂags collected we also have the conﬂicting safety ob-

jective of minimizing the probability that the agent is

captured. For our case study we specify the following

constraints:

The agent should reach the ‘goal’ area with pro-

bability at least 0.75.

The agent should cumulate more than two ﬂags

before it reaches the ‘goal’ area.

Furthermore, we aim to maximise:

The probability that the agent reaches the ‘goal’.

The number of collected ﬂags.

After constructing an AMDP for the problem we

used QV to identify a set of abstract policies that sat-

isfy the constraints C

and C

. From these safe poli-

cies we identiﬁed a Pareto front, i.e. a “front” of poli-

cies whose expected reward and probability of reach-

ing ‘goal’ cannot be both bettered by any other safe

policy. These safe policies and their associated Pareto

front are shown in Figure 3.

From this Pareto front three policies were selected

to be used for safe ARL; these are labelled A, B and

C on the Pareto front. These safe policies were trans-

lated into safe RL rules for ARL. The results of the

evaluation of the learned ARL policies are presented

in Table 1 alongside a baseline experiment using tra-

ditional RL, which we carried out in order to establish

0.8

0.85

0.9

3.5

4.5

Probability of reaching ‘goal’

Expected reward

Abstract policy

Pareto-front

Figure 3: Plot of abstract policies that satisfy the safety con-

straints C

and C

, and the Pareto front of safe policies A,

B and C.

the effects of ARL.

In both case studies our approach was successful

at satisfying the safety requirements (Mason et al.,

2017). Furthermore, the learned policy matched the

speciﬁc safety results that were veriﬁed for the high-

level safe policy. From these results we show that our

approach can achieve the levels of safety that were

required.

6.2 Ongoing Work

Currently, research is progressing with development

on a knowledge-revision algorithm as discussed in

Section 2. An algorithm is being formulated where an

RL agent ﬁrst attempts ARL using what is presumed

to be a suitable abstract policy. Should the agent dis-

cover that the policy is not viable (e.g. the abstract

policy dictates the RL agent should perform an ac-

tion in a state where the action is not available), then

knowledge of the error is fed back to the AMDP and

a revised model is constructed.

Work is currently focussing on a means of recy-

cling redundant abstract policies where possible, thus

signiﬁcantly reducing the veriﬁcation effort required

to update the Pareto front of safe abstract policies.

Since it is often the case that there are only minor dif-

ferences between the initial, incorrect AMDP model

Table 1: The results for safe abstract policies A, B and C

when used for ARL along with a baseline, traditional RL

experiment for the guarded ﬂag-collection (results averaged

over 5 independent experiments).

Abstract

Policy

Probability of

Reaching ‘goal’

Standard

Error

Expected

Reward

Standard

Error

None 0.72 0.0073 4.01 0.031

A 0.9 0.0012 2.85 0.0029

B 0.81 0.0019 3.62 0.0037

C 0.78 0.0012 4.5 0.0041

DCAART 2017 - Doctoral Consortium on Agents and Artiﬁcial Intelligence

and the corrected model, we envisage that the abstract

policies for the two models will be similar. Therefore,

it is intuitively not necessary to generate an entirely

different abstract policy. Since it is time consuming

to generate policies as well as to verify them, we aim

to reuse those elements of the initial abstract policy

that still match the abstract model.

To evaluate the algorithm, a series of experiments

will be conducted using extensions of the two case

studies described in the previous section. For each

case study we will produce a series of RL environ-

ments where the environment is uniquely different

from that of the initial AMDP. We will then deter-

mine if on average the knowledge revision algorithm

is faster at ﬁnding a new safe abstract policy than if

an AI engineer were to manually inspect the initial

RL model, reconstruct the high-level model, and gen-

erate safe abstract policies from scratch.

6.3 Future Work

Future work will involve more in depth analysis of

the framework’s performance. This includes evalu-

ation of how long it takes for a safe solution to be

learned and the processing overheads incurred. Ad-

ditionally, further experiments will be conducted by

expanding the existing case studies to establish how

well the framework will scale, also, new case studies

will be developed for different domains to determine

the range of scenarios which the technique can be ap-

plied to.

ACKNOWLEDGEMENTS

This paper presents research sponsored by the UK

MOD. The information contained in it should not be

interpreted as representing the views of the UK MOD,

nor should it be assumed it reﬂects any current or fu-

ture UK MOD policy.

REFERENCES

Arcuri, A. and Briand, L. (2011). A practical guide for us-

ing statistical tests to assess randomized algorithms in

software engineering. In 33rd Intl. Conf. Software En-

gineering, pages 1–10.

Argall, B. D., Chernova, S., Veloso, M., et al. (2009). A

survey of robot learning from demonstration. Robotics

and Autonomous Systems, 57(5):469–483.

Boger, J., Hoey, J., Poupart, P., et al. (2006). A planning

system based on markov decision processes to guide

people with dementia through activities of daily liv-

ing. IEEE Transactions on Information Technology in

Biomedicine, 10(2):323–333.

Calinescu, R., Johnson, K., and Raﬁq, Y. (2011). Using ob-

servation ageing to improve Markovian model learn-

ing in QoS engineering. In 2nd Intl. Conf. Perfor-

mance Engineering, pages 505–510.

Calinescu, R., Johnson, K., and Raﬁq, Y. (2013). Devel-

oping self-verifying service-based systems. In 28th

IEEE/ACM Intl. Conf. Automated Software Engineer-

ing, pages 734–737.

Calinescu, R., Kikuchi, S., and Johnson, K. (2012). Compo-

sitional reveriﬁcation of probabilistic safety properties

for large-scale complex IT systems. In Large-Scale

Complex IT Systems. Development, Operation and

Management, pages 303–329. Springer Berlin Heidel-

berg.

Dearden, R., Friedman, N., and Russell, S. (1998).

Bayesian Q-learning. In 15th National Conference on

Artiﬁcial Intelligence, pages 761–768.

Efthymiadis, K. and Kudenko, D. (2015). Knowledge revi-

sion for reinforcement learning with abstract MDPs.

In 14th Intl. Conf. Autonomous Agents and Multiagent

Systems, pages 763–770.

Garc

ıa, J. and Fern

andez, F. (2012). Safe exploration

of state and action spaces in reinforcement learning.

Journal of Artiﬁcal Intelligence Research, 45(1):515–

564.

Garc

ıa, J. and Fern

andez, F. (2015). A comprehensive sur-

vey on safe reinforcement learning. Journal of Ma-

chine Learning Research, 16(1):1437–1480.

Gerasimou, S., Calinescu, R., and Banks, A. (2014). Efﬁ-

cient runtime quantitative veriﬁcation using caching,

lookahead, and nearly-optimal reconﬁguration. In 9th

International Symposium on Software Engineering for

Adaptive and Self-Managing Systems, pages 115–124.

Gerasimou, S., Tamburrelli, G., and Calinescu, R. (2015).

Search-based synthesis of probabilistic models for

quality-of-service software engineering. In 30th

IEEE/ACM Intl. Conf. Automated Software Engineer-

ing, pages 319–330.

Hansson, H. and Jonsson, B. (1994). A logic for reasoning

about time and reliability. Formal Aspects of Comput-

ing, 6(5):512–535.

Heger, M. (1994). Consideration of risk in reinforcement

learning. In 11th Intl. Conf. Machine Learning, pages

105–111.

Kober, J., Bagnell, J. A., and Peters, J. (2013). Reinforce-

ment learning in robotics: A survey. International

Journal of Robotics Research, 32(11):1238–1274.

Kwiatkowska, M., Norman, G., and Parker, D. (2007).

Stochastic model checking. In 7th Intl. Conf. Formal

Methods for Performance Evaluation, volume 4486,

pages 220–270.

Kwiatkowska, M., Norman, G., and Parker, D. (2011).

PRISM 4.0: Veriﬁcation of probabilistic real-time sys-

tems. In 23rd Intl. Conf. Computer Aided Veriﬁcation,

volume 6806, pages 585–591.

Lange, D. S., Verbancsics, P., Gutzwiller, R. S., et al.

(2012). Command and control of teams of au-

Assured Reinforcement Learning for Safety-critical Applications

tonomous systems. In Large-Scale Complex IT

Systems. Development, Operation and Management,

pages 81–93. Springer Berlin Heidelberg.

Li, L., Walsh, T. J., and Littman, M. L. (2006). Towards a

uniﬁed theory of state abstraction for MDPs. In 9th In-

ternational Symposium on Artiﬁcial Intelligence and

Mathematics, pages 531–539.

Marthi, B. (2007). Automatic shaping and decomposition of

reward functions. In 24th Intl. Conf. Machine learn-

ing, pages 601–608.

Mason, G., Calinescu, R., Kudenko, D., and Banks, A.

(2016). Combining reinforcement learning and quan-

titative veriﬁcation for agent policy assurance. In 6th

Intl. Workshop on Combinations of Intelligent Meth-

ods and Applications, pages 45–52.

Mason, G., Calinescu, R., Kudenko, D., and Banks, A.

(2017). Assured reinforcement learning with formally

veriﬁed abstract policies. In 9th International Confer-

ence on Agents and Artiﬁcial Intelligence. To appear.

Mihatsch, O. and Neuneier, R. (2002). Risk-sensitive re-

inforcement learning. Machine Learning, 49(2):267–

290.

Moldovan, T. M. and Abbeel, P. (2012). Safe exploration

in Markov decision processes. In 29th Intl. Conf. Ma-

chine Learning, pages 1711–1718.

Perkins, T. J. and Barto, A. G. (2003). Lyapunov design

for safe reinforcement learning. Journal of Machine

Learning Research, 3(1):803–832.

Sutton, R. S., Precup, D., and Singh, S. (1999). Between

MDPs and semi-MDPs: A framework for temporal

abstraction in reinforcement learning. Artiﬁcial Intel-

ligence, 112(1-2):181–211.

Szita, I. (2012). Reinforcement learning in games. In Rein-

forcement Learning: State-of-the-art, pages 539–577.

Springer-Verlag Berlin Heidelberg.

Watkins, C. J. C. H. and Dayan, P. (1992). Q-learning. Ma-

chine Learning, 8(3):279–292.

Wiering, M. and Otterlo, M. (2012). Reinforcement learn-

ing and markov decision processes. In Reinforcement

Learning: State-of-the-art, pages 3–42. Springer-

Verlag Berlin Heidelberg.

DCAART 2017 - Doctoral Consortium on Agents and Artiﬁcial Intelligence