Search for Robust Policies in Reinforcement Learning

Qi Li

University College London, 105 Gower Street, London, U.K.

Keywords:

Reinforcement Learning, Model Free Policy Search, Robust Agents.

Abstract:

While Reinforcement Learning (RL) often operates in idealized Markov Decision Processes (MDPs), their

applications in real-world tasks often encounter noise such as in uncertain initial state distributions, noisy

dynamics models. Further noise can also be introduced in actions, rewards, and the observations. In this paper

we speciﬁcally focus on the problem of making agents act in a robust manner under different observation

noise distributions for during training and for during testing. Such characterization of training and testing

distributions is not common in RL as it is more common to train and deploy the agent on the same MDP. In

this work, two methods of improving agent robustness to observation noise - training on noisy environments

and modifying the reward function directly to encourage stable policies, are proposed and evaluated. We show

that by training on noisy observation distributions, even if the distribution is different from the one in test, can

beneﬁt agent performance in test, while the reward modiﬁcations are less generally applicable, only improving

the optimisation in some cases.

1 INTRODUCTION

A common formulation of Reinforcement Learning

(RL) is to learn agents that act in an environment with

previously unknown dynamics, and sometimes obser-

vation models, in such a way that optimises for some

stationary reward signal. Traditional RL formulation

operates on a single Markov Decision Process (MDP),

where an agent is allowed to interact with the MDP

in many episodes. This data collection process often

gives variations in the initial state distribution - that

is - the task given to the agent is expected to begin

not at one particular state, but rather a set of states.

In theory, with an adequate exploration policy, un-

der the Markovian assumption and assumptions about

observably and stationary dynamics and rewards, it

shouldn’t matter which states the agents begin during

training, as the agent should learn a (probabilistic) op-

timal policy or value function for all possible states.

In practice, especially for RL tasks involving high di-

mensions and continuous state or action spaces, this

does not happen, as the agent is only expected to in-

teract with the MDP a ﬁnite number of times. This

ﬁnite sampling means that the agent usually performs

well if it is initialized in the initial state distribution it

was trained on, and not on states outside of that distri-

bution. Such out-of-distribution issues is usually not a

concern for most RL applications, as there is no clear

difference between the “training” and “testing” initial

state distributions.

For more complex tasks however, a similar mis-

match might occur for the distribution of the state-

action transition model from “training” to “testing.”

This distributional shift is natural to formulate when

the MDP has stochastic transitions. During training,

the agent might interact with the MDP under one tran-

sition distribution, but during test, the agent might be

expected to act in an environment with a slightly dif-

ferent dynamics model. If the divergence between the

dynamics model of train and test is too wide, then of

course the agent will not perform well - it is like act-

ing on a different MDP altogether. If the transition

models differ in some structured manner, then one

might be able to adapt the agent’s behavior through

informed exploration, or train the agent on a wide

enough distribution during training such that the dis-

tribution of the testing dynamics is covered by the

training distribution. Some subset of the former ap-

proach can be called Transfer Learning, and the lat-

ter technique is called Domain Randomization (DR).

These techniques have been studied by many works

in the past.

Different from initial state distribution and dy-

namics mismatch, this work focuses on another as-

pect of uncertainty that is of interest to RL, which is

the potential noise in observations. The observation

Li, Q.

Search for Robust Policies in Reinforcement Learning.

DOI: 10.5220/0008917404210428

In Proceedings of the 12th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2020) - Volume 2, pages 421-428

ISBN: 978-989-758-395-7; ISSN: 2184-433X

421

noise studied in this work is assumed to be uncorre-

lated across time and independent of an agent’s state

or actions. Further, it is assumed the noise is drawn

from the same, stationary distribution at all times. In

this way, the observation noise is i.i.d. This obser-

vation noise is of interest to us, because it is a very

common phenomenon when deploying RL agents in

real applications. Observations in the real world are

always noisy. This is the case for a physical sensor

such as cameras and accelerometers, and it is also the

case for digital “sensors,” such as survey results (hu-

man subjects may not accurately respond) and web

trafﬁc (dropped packets, unreliable cookies).

The general formulation for noisy observations

is the Partially Observable Markov Decision Process

(POMDP). In POMPDP, the agent does not know the

actual underlying state that it is in, and instead it re-

lies on an observation model to maintain a distribution

over possible states that it could be in. This distribu-

tion is then used for policy or value function optimiza-

tion. POMDPs are very difﬁcult to solve, because the

uncertainty in maintaining an agent’s state distribu-

tion widens quickly as a function of time. This makes

planning in the general POMDP problem very inefﬁ-

cient, so it is hard to make model-based approaches

to POMPDPs tractable in practice.

A model-free approach, while more difﬁcult to

provide optimality bounds, tends to be more achiev-

able, and is the focus of this preliminary work. In

short, we’d like to train an agent in a model-free man-

ner that is robust to observation noise during “test”

time.

There are many ways to accomplish this goal, and

two approaches are presented:

1. Like training with a distribution over initial state

distributions and a distribution over the dynam-

ics model, the agent can be trained to act in an

MDP with a distribution over observation noise.

This “observation” randomization hopefully en-

courages the agent to take actions that is more “re-

versible” or “conservative,” so that even if the ob-

servation is incorrect, the agent can recover from

suboptimal actions in the future.

2. The reward function can also be augmented, or

modiﬁed, with additional reward signals that ex-

plicitly encourage the policy the agent learns to

be robust. One additional reward that is exper-

imented with is adding a cost that corresponds

to the standard deviation of episode rewards of

an agent across many rollouts. This cost favors

agents that show consistent behavior across many

samples of the noisy observation environments.

Another cost in the experiments with is the stan-

dard deviation of episode rewards of an agent with

slightly perturbed parameters. This parameter-

space perturbation encourages the optimization of

more “stable” agents. This stability in the param-

eter space implicitly corresponds to a more sta-

ble local minima during optimization, which may

correspond to a higher quality policy that is less

likely to change given new agent interactions. The

hypothesis is that an agent less likely to change

during training is also an agent that is more robust

to observation noise.

The following sections give a detailed description

and background discussions, our policy search algo-

rithms, and the results of the experiments that were

performed to evaluate the proposed robustness mod-

iﬁcations. By leveraging fast model-free parameter-

space policy search algorithms applied to an easy-

to-understand toy task (the CartPole), we are able to

characterize the agents’ behaviors across many dif-

ferent observation noise models in carefully designed

train and test distributions. This allows us to gain a

clear understanding of the performance and impact of

the proposed modiﬁcations.

2 RELATED WORKS

One context where the robustness framework has ap-

peared is on-line reinforcement learning (Singh et al.,

1994). Previous work has also studied robustness in

terms of input disturbance (action noise) (Morimoto

and Doya, 2005) (Anderson et al., 2007) (Kretchmar

et al., 2001) (Tessler et al., 2019) and modeling er-

rors (dynamics noise or stochastic transitions) (Sami

and Memon, 2018) (Rajeswaran et al., 2016) (Kinjo

et al., 2018). Some Other works introduce some mod-

iﬁcations on RL algorithms to improve the robust-

ness of the agent in speciﬁc settings, such as in Parti-

game (Al-Ansari and Williams, 1999), multi-task

learning (Teh et al., 2017), signal temporal logic (Ak-

saray et al., 2016) (Jones et al., 2015), and hierarchi-

cal options (Mankowitz et al., 2018).

For many works, robustness is framed with re-

spect to an adversary that can affect the RL agent

during training or testing. Many works have pro-

posed ways to improve RL in such adversarial envi-

ronments (Lim et al., 2013) (Gu et al., 2018) (Pat-

tanaik et al., 2018) (Gu et al., 2018) (Abdullah et al.,

2019). In particular, these works might focus on “dis-

tributional robustness,” which is in general applicable

to the dynamics noise case, where the agent is made

to perform well even under distribution shifts in the

stochastic transition model.

Another way to improve robustness of RL is to

make the agent easy to transfer to new domains.

ICAART 2020 - 12th International Conference on Agents and Artiﬁcial Intelligence

422

(Oguni. et al., 2014) studies a case to transfer learned

transition and reward models, so the agent can poten-

tially adapt better online to say an environment with a

different noise distribution. (Killian et al., 2017) stud-

ies policy transfer in the case of MDPs with hidden

parameters.

Unlike previous works that focus on input distur-

bance (action noise) or modeling errors (dynamics

noise), this work is unique in that it explicitly con-

siders distributional shifts in observation noise, and

studies how might an agent learn to be robust to that.

3 ROBUST POLICY SEARCH

3.1 Markov Decision Processes

A mathematically idealized model, Markov decision

processes (MDPs) constitutes a basic framework for

dynamically controlling systems that could evolve in

a stochastic way. An MDP could have discrete or con-

tinuous state and action spaces, and can have ﬁnite or

inﬁnite time horizons. An MDP has states s

∈ S, ac-

tions a

∈ S , and rewards R

∈ R. The focus of this

work is on the case of continuous states and actions,

so s

∈ R

, and a

∈ R

. Importantly, the Markovian

assumption made about MDPs states that the future

state, conditioned on the current state and action, is

independent of previous states and actions.

Reinforcement learning aims to ﬁnd a policy π :

S → A, mapping from states to actions of an MDP,

that maximizes the expected sum of discounted re-

wards:

∑

t=1

γR

(1)

Where γ is a discount factor between 0 and 1. Here

the expectation can be taken over the distribution over

transitions if the dynamics is stochastic. Or, it can

also be taken over the distribution of the policy if the

policy is stochastic. In general, T is the time horizon

of the task, and it can be inﬁnite. However, even if the

time horizon is inﬁnite, a γ smaller than 1 effectively

imposes a ﬁnite time horizon, as future rewards fall

off exponentially.

Different from supervised learning, RL is not su-

pervised with the optimal action, or actions, to take

at each state. Instead, it is given a evaluative super-

vision - how good is an action or a sequence of ac-

tions. In addition, actions chosen by the RL agent

will affect future states that the agent is in, hence it

will affect future inputs into the model. The sequen-

tial decision nature of RL may lead to distributional

shifts over time in the states and dynamics the agent

encounters, making RL much harder to learn than su-

pervised learning.

3.2 Policy Search in Parameter Space

There are many ways to search for a policy π in

RL, and they roughly separate into Model-based or

Model-free methods. Model-based methods ﬁrst sam-

ple transitions from the MDP. Then it ﬁts a func-

tion to model the dynamics of the environment, as

well as the reward, to learn the MDP. With the MDP

learned, standard MDP optimization algorithms can

be used to ﬁnd the optimal policy in the learned MDP.

Model-based methods often suffer from distribution

mismatch issues - the data collected during model

learning may not correspond well to the data seen by

the agent during execution. However, when the distri-

bution shift is small, and when the learned model has

low estimation bias, model-based RL tend to be more

sample efﬁcient than model-free ones.

Model-free methods can be further separated into

on-policy and off-policy algorithms, as well as value-

based and policy-based algorithms. This work fo-

cuses on the on-policy, policy-based variation. This

type of policy search usually ﬁrst represents the pol-

icy in some parameterized way, like a linear or Neu-

ral Network (NN) function. Denote these parameters

as θ, and denote the policy with such parameters π

Then, exploration is performed either in the action

space, by adding noise to the policy outputs, or in

the parameter space, by directly perturbing the pol-

icy parameters. The former method corresponds to

Policy Gradients, while the latter is usually done by

derivative-free optimization.

Derivative-free policy search in the parameter

space is used, because it achieves comparable perfor-

mance with policy gradients but with less variance

and higher sample efﬁciency (Mania et al., 2018).

Generally, a derivative-free policy search contains the

following steps:

1. Sample policies

2. Evaluate each policy

3. Update optimiser

4. Keep track of the best policy so far until reaching

a reward threshold or out of compute budget

Training policies can be done through either step-

based or episode-based feedback. Episode based

training waits for each episode to ﬁnish before start-

ing another policy, whereas step based training gives

a reward at each state-transition pair. Derivative-free

policy search methods are usually episode-based, so

the policy training feedback takes into account of the

agent’s cumulative performance across the entire task.

Search for Robust Policies in Reinforcement Learning

423

In our experiments, we evaluate and analyze 3 dif-

ferent derivative-free optimisers to search for agent

policies in the parameter space:

3.3 Random Search (RS)

RS evaluates a random set of directions that the policy

parameters can go into, and weights these directions

by how well, in relative terms, the policies perform.

At each training iteration, it ﬁrst samples N numbers

of ∆θ

, then π

θ+∆θ

and π

θ−∆θ

are evaluated. Update

formula is as following:

t+1

= θ

∑

n=1

(R(π

+∆θ

)−R(π

−∆θ

))∆θ

(2)

where α is the learning rate.

3.3.1 Cross Entropy Method (CEM)

CEM uses Gaussian distribution to maintain a dis-

tribution over the policy parameters N (µ

, Σ

). At

each training iteration, CEM samples N numbers of

∼ N (µ

, Σ

), then evaluates π

s. The update pro-

cess is as the following:

1. Sort θ

s by their rewards.

2. Choose the top N

best

t+1

best

∑

n=1

(3)

t+1

best

∑

n=1

(θ

− µ

)(θ

− µ

)

(4)

3.3.2 Genetic Algorithm (GA)

GA is a “population-based” algorithm. Unlike RS and

CEM, it does not explicitly maintain a current policy

or a current distribution of policies. Instead, it main-

tains a population of policies.

Initially, GA samples N numbers of parameters of

the policy from θ

(1)

to θ

(N)

. Then, at each training

iteration:

1. Choose N pairs of previously sampled solutions,

i.e (θ

(i)

,θ

( j)

), where i 6= j.

2. Create N new policies, with parameter values set

to θ

[k] = β

(i)

[k] + (1 − β

)θ

( j)

[k] + ε

is either 0 or 1, and it’s sampled from a Bernoulli

distribution with p = 0.5. ε

is a noise parameter sam-

pled from a Gaussian distribution with ε ∼ N(0, σ).

Unlike RS and CEM, GA can handle multi-modal

object landscapes, because it is not forced to converge

to a policy, it can maintain many policies that all per-

form well, even if they have very different parame-

ters (Loughlin et al., 2001; Zhan et al., 2013).

3.4 Noisy MDPs and Robust Policies

Noise in RL is common for real world applications.

Three types of noise models can be used to simulate

realistic learning environment: 1) Action noise - for

example, in robotics applications, a robot controller

might not follow exactly what the commanded actions

ask. 2) Reward noise - during training, the reward

feedback is noisy. 3) Observation noise - an observa-

tion that an agent receives is always a perturbed ver-

sion of the true observation. This paper studies how

to make policy search robust to observation noise.

In reinforcement learning, a policy is said to be

robust if it maximizes the reward when noise is intro-

duced during the process. One way to measure robust-

ness is to see how wide the variation, or the standard

deviation, of the agent’s performance is over multiple

rollouts:

∑

k=1

− µ

)

(5)

means the total cumulative reward of the kth policy

rollout. An agent with high µ

and small σ

is set to

be more robust than the one with high σ

T .

The standard deviation of rewards on multiple

policies around a given policy is as:

∑

k=1

(R(π

θ+δθ

) − µ

)

(6)

δθ

is a small perturbation on the true policy param-

eters, which is sampled from a Gaussian distribution.

This measures how sensitive the policy is to param-

eter perturbations, and hence to future training data

points.

Another way to measure an agent’s robustness is

by taking the ratio of rollout rewards under some ob-

servation noise distribution o

, while the agent was

trained on another o

R(o

)

R(o

)

(7)

The higher this ratio is, the more robust the agent is to

changes in the observation noise distribution.

The following experiments evaluate two ways of

training that may help improve robustness of a policy:

1. First, the policy is trained in noisy environment,

so it is expected a better performance during eval-

uation than a policy trained on environments with

no observation noise.

ICAART 2020 - 12th International Conference on Agents and Artiﬁcial Intelligence

424

2. Second, the training objective function of the opti-

miser is changed to optimise the robustness of the

policy, while the reward of the environment stays

the same. The modiﬁed reward function for the

optimisers can be written as the following:

= R − α

− α

(8)

The α

and α

terms are scaling hyperparameters.

4 EXPERIMENTS

4.1 Benchmark Tasks

The CartPole task is used to perform learning experi-

ments. In a CartPole task, a pole is attached to a cart

by an un-actuated joint, which can be moved to left or

right along a track. So the action space of the task is

discrete and binary (either left or right). An agent ob-

serves the current position and velocity of the cart, as

well as the current angle and angular velocity of the

pole. These 4 numbers make up the state space.

By moving the cart backwards and forwards, the

pole can be controlled to stay upright. The goal of this

task is to prevent the pole from falling over. The ini-

tial state of the task has the pole angled upright with

0 angular and linear velocity. A reward +1 is given

when the pole stays upright. The performance of an

agent is measured by the cumulative reward at the end

of a ﬁxed time horizon. If the time horizon is H, then

the maximum total reward is also H. A time horizon

of H = 200 is used in all experiments below.

The agent uses a linear policy consisting of a vec-

tor with 4 numbers: θ ∈ R

. The action is determined

by π

(s) = sign(θ

s). If the output is 1, the agent

moves right. If the output is −1, the agent moves left.

The CartPole task and the linear policy are cho-

sen, because they are fast to simulate, compute, and

optimise, allowing us to perform many ablation ex-

periments to understand effects of the algorithm.

4.2 Observation Noise

To add observation noise, we do s

= s + βε, where

ε ∼ N (0, Σ

). Σ

is a diagonal covariance matrix,

and its values are set to be the corresponding standard

deviations of states encountered by a random agent.

This ensures the magnitude of the noise added is in

the appropriate scale for each dimension of the obser-

vation vector, which all have different units. β is a

hyperparameter setting to scale the noise amounts.

Our experiments swept across 4 types of param-

eters to evaluate the performance of the agent with

different modiﬁcations on the CartPole task:

1. Random seed - each of the following experiments

were run with 3 different random seeds.

2. Training noise level - agents were trained in 4 dif-

ferent noise levels for β ∈ [0, 0.25, 0.5, 1].

3. Training algorithm - agents were trained in the 3

derivative-free optimization algorithms explained

above (RS, CEM, GA).

4. Reward modiﬁcation - the reward function used

by the optimiser was modiﬁed to explicitly ac-

count for both the standard deviation of roll-

out rewards and the standard deviation across

policy parameter perturbations. Here we put

four kinds of modiﬁcation settings: (α

, α

) ∈

[(0, 0), (0.5, 0), (0, 0.5), (0.25, 0.25)].

The CartPole environment is based on OpenAI

gym

. Derivative-free optimisation was performed

in Numpy with Python 3.6, and the implementation

is fairly fast given the small size of the observation

and action spaces. All trained policies are then evalu-

ated on the 4 different types of observation noise lev-

els (β ∈ [0, 0.25, 0.5, 1]). Totally, 3 × 4 × 3 × 4 = 144

policies were trained, and each of them were evalu-

ated in 4 testing environments. The experiments were

run by a computer with i7-8550U CPU at 1.80GHz,

taking about 2 hours.

4.3 Results

Table 1: Robustness ratio φ across all algorithms and ran-

dom seeds. The rows are testing noise, while the columns

are training noise. For example, row 1 column 2 is φ

which gives the mean ratio of the total reward of the agent

trained with 0.25 observation noise level and tested under 0

noise level over the rewards it had during training.

0 0.25 0.5 1

0 1 1.15 1.38 2.36

0.25 0.91 1 1.25 2.14

0.5 0.73 0.83 1 1.71

1 0.42 0.48 0.58 1

Table 2: Robustness ratio φ of Genetic algorithm (GA).

Both rows and columns are training noise as well as test-

ing noise. For example, row 1 column 2 is φ

, which gives

the ratio of the reward of the agent trained with 0.25 ob-

servation noise level and tested under 0 noise level over the

rewards it had with 0 training noise and 0 testing noise.

0,0 0.25,0 0.5,0 1,0

0,0 1 1 1 1

0,0 0.78 1 0.99 0.98

0,0 0.62 0.77 0.9 0.91

0,0 0.36 0.39 0.56 0.61

https://gym.openai.com/envs/CartPole-v1/

Search for Robust Policies in Reinforcement Learning

425

Table 1 shows the robustness ratio φs, where the en-

try in the ith row and jth column corresponds to the

average φ

i j

across all 3 random seeds and 3 optimi-

sation algorithms. The row headers mean the testing

noise level, while the column headers mean the train-

ing noise level. For example, row 1 column 2 is φ

which gives the ratio of the agent trained with 0.25

observation noise level and tested under 0 noise level.

Of course, an agent tested on the same observation

noise distribution as it was trained on will perform the

same, hence the 1s across the diagonals. It can be seen

that across the board, testing performance decreases

as the testing noise level increases, and increases as

the training noise level increases. The agent tends to

learn better and to be more robust when it is trained

in a noisy environment.

Table 2 shows the robustness ratio φ of the Ge-

netic algorithm (GA), where the entry in the ith row

and jth column corresponds to φ

i j

. Both row head-

ers and column headers mean the training noise level

and testing noise level. For example, row 1 column

2 is φ

, which gives the ratio of the reward of the

agent trained with 0.25 observation noise level and

tested under 0 noise level over the rewards it had with

0 training noise and 0 testing noise.

It can be shown that training on noisy environ-

ments lead to better performance overall. For exam-

ple, φ

gives the robustness ratio of the agent trained

on 1 noise level and tested on 0 noise level. The agent

is able to gain a maximum reward of 0.61. While,

, the robustness ratio showing an agent trained on 0

noise level and tested on 0 noise level, shows that the

agent can only achieve a maximum reward of 0.36.

Figure 1 shows the testing rewards of different

policies (RS ,CEM, GA). Each policy was trained on

0 noise and tested on the indicated noise level on the

x-axis. Each bar and its error bar give mean rewards

and standard deviation of rewards of all evaluations

across seeds 0, 1 and 2.

It shows that training rewards decrease when ob-

servation noise rises from 0 to 1 for each algorithm.

Overall, GA gives highest rewards in all three noise

levels followed by Random search and Cross En-

tropy. However, the standard deviation of mean re-

wards for GA increases as the noise level increases,

which means that the performance is less reliable and

stable. Under noise level 0.25 and 1, RS and CEM

have roughly the same mean rewards.

Figure 2 gives training curves for all 3 algorithms

under different training observation noise levels. Each

curve is accompanied by a shaded region that denotes

the standard deviation across multiple rollouts, while

the mean is computed across the random seeds.

For RS, as the test noise level increases, the

Figure 1: Performance of different algorithms trained with

0 observation noise and tested under different observation

noise levels. Length of black bar means the standard devia-

tion of mean rewards across 3 random seeds.

growth of the mean rewards gained by the agent goes

down, and the ﬁnal reward earned reduces. Although

the performance of this policy is not good in the noisy

environments, its variance is relatively smaller than

the other algorithms.

For CEM, under 0, 0.25 and 0.5 noise levels, the

mean rewards rise rapidly relative to that under 1

noise level. However, the mean standard deviation of

these three mean rewards under the three noise levels

are quite big, meaning that the rewards earning pro-

cess is not very stable.

For GA the agent learns to achieve optimal re-

wards rapidly when noise level is 0. Under 0.25 and

0.5 noise levels, the agent learns and is able to achieve

optimal rewards but with a high ﬂuctuation during

learning process. When noise level becomes to be 1,

the agent learns slowly with high ﬂuctuations and is

not able to achieve optimization.

Figure 3 illustrates the mean and standard devi-

ation of the mean reward that an agent earns in RS,

CEM, and GA. It is trained under 0 noise level with

different reward modiﬁcation settings but evaluated

under 0, 0.25, 0.5 and 1 noise levels across seeds 0,

1 and 2. In terms of reward modiﬁcation settings,

weights are added to the standard deviation of reward

and to that of policy.

The motivation of adding a cost term to the stan-

dard deviation across episode rollouts is to encourage

the agent to have a stable and consistent performance

in noisy environments. The motivation of adding a

cost term to the standard deviation of policy parame-

ter perturbations helps the agent to behave roughly the

same under new iterations and to decrease the possi-

bility of optimisation to be ’getting lucky’.

For all algorithms, the mean rewards decrease as

the test noise level rises. The standard deviation of

the mean rewards ascends by the noise levels, mean-

ICAART 2020 - 12th International Conference on Agents and Artiﬁcial Intelligence

426

Figure 2: Comparison of training curves of three different policies for different training noise levels aggregated across 3

random seeds. CEM and GA perform better than RS at higher noise levels, but RS has the least reward variance.

Figure 3: Performance of Different reward modiﬁcations for all optimisation algorithms (left = RS, middle = CEM, right =

GA). All agents were trained at 0 noise level with the reward modiﬁcations, and the noise levels in the plot mean training

noise levels. The ﬁrst number in the legend tuples refer to α

, the second is α

. It can be observed that while the reward

modiﬁcations improve performance by either higher mean or lower reward variance in some cases for CEM, in other cases it

performs worse than having no reward modiﬁcation at all.

ing that the consistence of the agent’s performance

for different iterations is reduced. Under 0 test noise

level, the modiﬁcation setting does not result much

difference about the mean rewards which achieved by

the agent from the ﬁrst group of bars. The difference

between rewards collected by an agent under differ-

ent modiﬁcation setting then rises as the noise level

increases. For CEM, the difference between mean re-

wards collected by an agent under various modiﬁca-

tion settings is staggered at each noise level. Also,

the standard deviation of mean rewards is huge for

each modiﬁcation setting, which shows that the agent

may not give consistent behaviour in noisy environ-

ments or the policy may likely to change its results

when an agent learns in new iterations. For GA, the

modiﬁcation setting with 0 weight on rewards and 0.5

weight on policy does not end up with an excellent

rewards-collecting results compared to the other mod-

iﬁcation settings in noisy environments. And its mean

rewards earned by the agent stay at a relatively low

level once the noise participates. However, the stan-

dard deviation of the speciﬁc modiﬁcation setting re-

duces quickly when noise is added.

In general, it is observed that the reward modiﬁca-

tions can be beneﬁcial for improving the performance

of CEM (higher rewards and lower reward variance)

in some noise levels, but it does worse than not having

reward modiﬁcations for RS and GA.

5 CONCLUSIONS

The paper investigates the robust agent problem for

Reinforcement Learning in the context of observa-

tion noise. Two main proposed methods were in-

troduced - training in noisy environments and mod-

ifying optimisation objectives to encourage more sta-

ble learnt policies. Extensive experiments were per-

formed on the CartPole problem via three derivative-

free parameter-space policy search algorithms: Ran-

dom Search, Cross Entropy Method, and the Ge-

netic Algorithm. Several experiments are done to

compare the performance in both noisy environment

and non-noise environment. It is observed that the

agent performance decreases further if the observa-

tion noise the agent tested on is more than the noise it

was trained on, so that training in noisy environments

helps to improve agent performance in noisy environ-

ments. The reward modiﬁcations, used to encourage

optimisers to ﬁnd stable policies are also analyzed. In

the experiments, it is observed that while these modi-

ﬁcations help CEM in some cases, in other cases they

perform worse. It is likely that the decreased reward

makes the agents overly conservative, as it also dis-

courages policy exploration and obfuscates the true

reward signal. Therefore, improving the robustness of

an agent via training on noisier environments is pre-

ferred.

Search for Robust Policies in Reinforcement Learning

427

REFERENCES

Abdullah, M. A., Ren, H., Ammar, H. B., Milenkovic, V.,

Luo, R., Zhang, M., and Wang, J. (2019). Wasser-

stein robust reinforcement learning. arXiv preprint

arXiv:1907.13196.

Aksaray, D., Jones, A., Kong, Z., Schwager, M., and Belta,

C. (2016). Q-learning for robust satisfaction of signal

temporal logic speciﬁcations. In 2016 IEEE 55th Con-

ference on Decision and Control (CDC), pages 6565–

6570. IEEE.

Al-Ansari, M. A. and Williams, R. J. (1999). Robust, efﬁ-

cient, globally-optimized reinforcement learning with

the parti-game algorithm. In Advances in Neural In-

formation Processing Systems, pages 961–967.

Anderson, C. W., Young, P. M., Buehner, M. R., Knight,

J. N., Bush, K. A., and Hittle, D. C. (2007). Robust re-

inforcement learning control using integral quadratic

constraints for recurrent neural networks. IEEE Trans-

actions on Neural Networks, 18(4):993–1002.

Gu, Z., Jia, Z., and Choset, H. (2018). Adversary a3c for

robust reinforcement learning.

Jones, A., Aksaray, D., Kong, Z., Schwager, M., and Belta,

C. (2015). Robust satisfaction of temporal logic spec-

iﬁcations via reinforcement learning. arXiv preprint

arXiv:1510.06460.

Killian, T. W., Daulton, S., Konidaris, G., and Doshi-

Velez, F. (2017). Robust and efﬁcient transfer learning

with hidden parameter markov decision processes. In

Advances in Neural Information Processing Systems,

pages 6250–6261.

Kinjo, K., Uchibe, E., and Doya, K. (2018). Robustness

of linearly solvable markov games employing inac-

curate dynamics model. Artiﬁcial Life and Robotics,

23(1):1–9.

Kretchmar, R. M., Young, P. M., Anderson, C. W., Hit-

tle, D. C., Anderson, M. L., and Delnero, C. C.

(2001). Robust reinforcement learning control with

static and dynamic stability. International Journal of

Robust and Nonlinear Control: IFAC-Afﬁliated Jour-

nal, 11(15):1469–1500.

Lim, S. H., Xu, H., and Mannor, S. (2013). Reinforce-

ment learning in robust markov decision processes. In

Advances in Neural Information Processing Systems,

pages 701–709.

Loughlin, D. H., Ranjithan, S. R., Brill Jr, E. D., and

Baugh Jr, J. W. (2001). Genetic algorithm approaches

for addressing unmodeled objectives in optimization

problems. Engineering Optimization, 33(5):549–569.

Mania, H., Guy, A., and Recht, B. (2018). Simple random

search provides a competitive approach to reinforce-

ment learning. arXiv preprint arXiv:1803.07055.

Mankowitz, D. J., Mann, T. A., Bacon, P.-L., Precup, D.,

and Mannor, S. (2018). Learning robust options. In

Thirty-Second AAAI Conference on Artiﬁcial Intelli-

gence.

Morimoto, J. and Doya, K. (2005). Robust reinforcement

learning. Neural computation, 17(2):335–359.

Oguni., K., Narisawa., K., and Shinohara., A. (2014).

Reducing sample complexity in reinforcement learn-

ing by transferring transition and reward probabili-

ties. In Proceedings of the 6th International Confer-

ence on Agents and Artiﬁcial Intelligence - Volume 1:

ICAART,, pages 632–638. INSTICC, SciTePress.

Pattanaik, A., Tang, Z., Liu, S., Bommannan, G., and

Chowdhary, G. (2018). Robust deep reinforcement

learning with adversarial attacks. In Proceedings

of the 17th International Conference on Autonomous

Agents and MultiAgent Systems, pages 2040–2042. In-

ternational Foundation for Autonomous Agents and

Multiagent Systems.

Rajeswaran, A., Ghotra, S., Ravindran, B., and Levine,

S. (2016). Epopt: Learning robust neural network

policies using model ensembles. arXiv preprint

arXiv:1610.01283.

Sami, A. and Memon, A. Y. (2018). Robust optimal control

of continuous time linear system using reinforcement

learning. In 2018 Australian & New Zealand Control

Conference (ANZCC), pages 154–159. IEEE.

Singh, S. P., Barto, A. G., Grupen, R., and Connolly, C.

(1994). Robust reinforcement learning in motion plan-

ning. In Advances in neural information processing

systems, pages 655–662.

Teh, Y., Bapst, V., Czarnecki, W. M., Quan, J., Kirkpatrick,

J., Hadsell, R., Heess, N., and Pascanu, R. (2017).

Distral: Robust multitask reinforcement learning. In

Advances in Neural Information Processing Systems,

pages 4496–4506.

Tessler, C., Efroni, Y., and Mannor, S. (2019). Action robust

reinforcement learning and applications in continuous

control. arXiv preprint arXiv:1901.09184.

Zhan, Z.-H., Li, J., Cao, J., Zhang, J., Chung, H. S.-H.,

and Shi, Y.-H. (2013). Multiple populations for multi-

ple objectives: A coevolutionary technique for solving

multiobjective optimization problems. IEEE transac-

tions on cybernetics, 43(2):445–463.

ICAART 2020 - 12th International Conference on Agents and Artiﬁcial Intelligence

428