Data Augmentation Through Expert-Guided Symmetry Detection to

Improve Performance in Ofﬂine Reinforcement Learning

Giorgio Angelotti

1,2 a

, Nicolas Drougard

1,2 b

and Caroline P. C. Chanel

1,2 c

ISAE-SUPAERO, University of Toulouse, France

ANITI, University of Toulouse, France

Keywords:

Ofﬂine Reinforcement Learning, Batch Reinforcement Learning, Markov Decision Processes, Symmetry

Detection, Homomorphism, Density Estimation, Data Augmenting, Normalizing Flows, Deep Neural Networks.

Abstract:

Ofﬂine estimation of the dynamical model of a Markov Decision Process (MDP) is a non-trivial task that

greatly depends on the data available in the learning phase. Sometimes the dynamics of the model is invariant

with respect to some transformations of the current state and action. Recent works showed that an expert-

guided pipeline relying on Density Estimation methods as Deep Neural Network based Normalizing Flows

effectively detects this structure in deterministic environments, both categorical and continuous-valued. The

acquired knowledge can be exploited to augment the original data set, leading eventually to a reduction in the

distributional shift between the true and the learned model. Such data augmentation technique can be exploited

as a preliminary process to be executed before adopting an Ofﬂine Reinforcement Learning architecture,

increasing its performance. In this work we extend the paradigm to also tackle non-deterministic MDPs, in

particular, 1) we propose a detection threshold in categorical environments based on statistical distances, and 2)

we show that the former results lead to a performance improvement when solving the learned MDP and then

applying the optimized policy in the real environment.

1 INTRODUCTION

In Ofﬂine Reinforcement Learning (ORL) and Of-

ﬂine Learning for Planning the environment dynamics

and/or value functions are inferred from a batch of

already pre-collected experiences. Wrong previsions

lead to bad decisions. The distributional shift, deﬁned

as the discrepancy between the learnt model and real-

ity, is the main responsible for the performance deﬁcit

of the (sub)optimal policy obtained in the ofﬂine set-

ting compared to the true optimal policy (Levine et al.,

2020; Angelotti et al., 2020). Is there a way to exploit

expert knowledge or intuition about the environment

to limit the distributional shift? Several models ben-

eﬁt from a dynamics that is invariant with respect to

some transformations of the system of reference. In

physics, such a property of a system is called a symme-

try (Gross, 1996). In the context of Markov Decision

Processes (MDPs) (Bellman, 1966) a symmetry can

be deﬁned as a particular case of an MDP’s homomor-

https://orcid.org/0000-0002-1878-5833

https://orcid.org/0000-0003-0002-9973

https://orcid.org/0000-0003-3578-4186

phism (Angelotti et al., 2022). Knowing that a system

to be learned is endowed with a symmetry or of a ho-

momorphic structure can lead to more data-efﬁcient

solutions of an MDP.

The automatic discovery of homomorphic struc-

tures in MDPs has a long story (Dean and Givan, 1997;

Ravindran and Barto, 2001; Ravindran and Barto,

2004). In (Li et al., 2006) a theoretical analysis of

the possible types of MDPs state abstractions proved

which properties of the original MDP would be in-

variant under the transformation: the optimal value

function, the optimal policy, etc. Eventually, the full

automatic discovery of a factored MDP representation

was proven to be as hard as verifying whether two

graphs are isomorphic (Narayanamurthy and Ravin-

dran, 2008). In recent years (van der Pol et al., 2020a;

van der Pol et al., 2020b; Angelotti et al., 2022) rekin-

dled the topic.

In (van der Pol et al., 2020a) a contrastive loss

function that enforces action equivariance on a to-

be-learned representation of an MDP was adopted to

learn a structured latent space that was then exploited

to increase the data efﬁciency of a data-driven plan-

ner. (van der Pol et al., 2020b) introduced peculiar

Angelotti, G., Drougard, N. and Chanel, C.

Data Augmentation Through Expert-Guided Symmetry Detection to Improve Performance in Ofﬂine Reinforcement Learning.

DOI: 10.5220/0011633400003393

In Proceedings of the 15th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2023) - Volume 2, pages 115-124

ISBN: 978-989-758-623-1; ISSN: 2184-433X

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

115

classes of Deep Neural Network (DNN) architectures

that by construction enforce the invariance of the op-

timal MDP policy under some set of transformations

obtained through other Deep RL paradigms. The lat-

ter also provided an increase in data efﬁciency. In

(Angelotti et al., 2022) an expert-guided detection of

alleged symmetries based on Density Estimation sta-

tistical techniques in the context of the ofﬂine learning

of both continuous and categorical environments was

proposed in order to eventually augment the starting

data set. The authors showed that correctly detecting

a symmetry (based on the computation of a symme-

try conﬁdence value

> ν

) and data augmenting the

starting data set exploiting this information led to a

decrease in the distributional shift. Unfortunately, the

said work concerned only deterministic MDPs and

did not include an analysis of the performance of the

policy obtained in the end. In other ﬁelds of Machine

Learning data augmentation has been extensively ex-

ploited to boost the efﬁciency of the algorithms in data-

limited setups (van Dyk and Meng, 2001; Shorten and

Khoshgoftaar, 2019; Park et al., 2019).

Recently (Yarats et al., 2022) showed the impor-

tance of large and diverse datasets for ORL by demon-

strating empirically that ofﬂine learning using a vanilla

online RL algorithm over a batch that is diverse enough

can lead to performances that are comparable to, or

even better than, pure ORL approaches.

In this context, the present work addresses the fol-

lowing research questions: Is it possible to develop

a method for expert-guided detection of alleged sym-

metries based on Density Estimation statistical tech-

niques in the context of ofﬂine learning that also works

for stochastic MDPs? The main idea is to extend previ-

ous works (van der Pol et al., 2020a; van der Pol et al.,

2020b; Angelotti et al., 2022) to deal with stochastic

MDPs; and, Is Data Augmentation exploiting a de-

tected symmetry really beneﬁcial to the learning of an

MDP policy in the ofﬂine context? We would like to

empirically demonstrate (O)RL policy improvement

when enriching the batch as proposed by (Yarats et al.,

2022).

Contributions. In this work, we take over and ex-

tend the state-of-the-art with the aim of providing an

answer to the listed research questions. More speciﬁ-

cally, the contributions of this paper are the followings:

Algorithmic Contribution. A reﬁnement of the de-

cision threshold, based on statistical distances, is

deﬁned for categorical MDPs. This new decision

threshold is valid also in both stochastic and de-

terministic environments, improving hence over

the state-of-the-art that only tackled deterministic

scenarios;

Experimental Contribution. The improvement of

the policy performance obtained by augmenting

the data with the symmetric images of the transi-

tions is demonstrated experimentally in an ofﬂine

learning context. The good quality of the method is

clear in the categorical setting while it is fuzzier in

the continuous setting since ofﬂine methods with

Deep Neural Networks are affected by the (non-

trivial) choice of the hyperparameters.

It is worth saying that the presented work aim is not

to be a competitor to the ORL algorithms, but a way

to augment the batch by validating expert intuition.

Once the batch has been augmented one could use any

ofﬂine RL method.

2 BACKGROUND

Deﬁnition 1 (Markov Decision Process). An MDP

(Bellman, 1966) is a tuple

M = (S,A,R,T,γ)

and

are the sets of states and actions, R : S × A → R is the

reward function,

T : S ×A → Dist(S)

is the transition

function, where

Dist(S)

is the set of probability distri-

butions on

, and

γ ∈ [0,1)

is the discount factor. Time

is discretized and at each step

t ∈ N

the agent observes

a system state

s = s

∈ S

, acts with

a = a

∈ A

drawn

from a policy

π : S → Dist(A)

, and with probability

T (s,a,s

′

)

transits to a next state

′

= s

t+1

, earning a

reward

R(s,a)

. The value function of

and

is deﬁned

as the expected total discounted reward using

and

starting with

(s) = E



∑

∞

t=0

R(s

)|s

= s



The optimal value function

∗

is the maximum of the

latter over every policy π.

Deﬁnition 2 (MDP Symmetry). Given an MDP

let

be a surjection on

S ×A ×S

such that

k(s, a, s

′

) =



(s,a,s

′

),k

(s,a,s

′

),k

′

(s,a,s

′

)



∈ S × A × S

. Let

(T ◦ k)(s,a,s

′



= T (k(s, a, s

′

))

is a symmetry if

∀(s,s

′

) ∈ S

a ∈ A

both

and

are invariant with

respect to the image of k:

(T ◦ k)(s, a, s

′



= T (s,a,s

′

), (1)



(s,a,s

′

),k

(s,a,s

′

)



= R(s,a). (2)

As (Angelotti et al., 2022), in this paper we will

focus only on the invariance of

, therefore we will

only demand for the validity of Equation 1. Problems

with a known reward function as well as model-based

approaches can thus beneﬁt directly from the method.

Probability Mass Function Estimation for Dis-

crete MDPs. Let

D = {(s

′

)}

i=1

be a batch of

recorded transitions. Performing mass estimation over

amounts to compute the probabilities that deﬁne the

ICAART 2023 - 15th International Conference on Agents and Artiﬁcial Intelligence

116

categorical distribution

by estimating the frequen-

cies of transition in D. In other words:

T (s,a,s

′

) =







s,a,s

′

∑

′

s,a,s

′

∑

′

s,a,s

′

> 0,

|S|

−1

otherwise.

(3)

where

s,a,s

′

is the number of times the transition

s,a

= a,s

t+1

= s

′

) appears in D.

Probability Density Function Estimation for Con-

tinuous MDPs. Performing density estimation over

means obtaining an analytical expression for

the probability density function (pdf) of transitions

(s,a,s

′

)

given

L(s,a, s

′

|D)

. Normalizing ﬂows

(Dinh et al., 2015; Kobyzev et al., 2020) allow deﬁn-

ing a parametric ﬂow of continuous transformations

that reshapes a known initial pdf to one that best ﬁts

the data.

Expert-Guided Detection of Symmetries. The

paradigm described in (Angelotti et al., 2022) can be

resumed as follows:

An expert presumes that a to be learned model is

endowed with the invariance of

with respect to

a transformation k;

She/He computes the probability function estima-

tion based on the batch D:

(a)

(categorical case) She/He computes

, an esti-

mate of

, using the transitions in a batch

applying Equation 3;

(b)

(continuous case) She/He performs Density Es-

timation over D using Normalizing Flows;

She/He applies

to all transitions

(s,a, s

′

) ∈ D

and then checks whether the symmetry conﬁdence

value ν

;

(a)

(categorical case) of samples

k(s, a, s

′

) =



(s,a, s

′

),k

(s,a, s

′

),k

′

(s,a, s

′

)



∈ k(D)

s.t.

T (s,a,s

′

) = (T ◦ k)(s,a,s

′

)

exceeds an expert

given threshold ν;

(b)

(continuous case) of probability values

eval-

uated on

k(D)

exceeds a threshold

that cor-

responds to the

q−

order quantile of the distri-

bution of probability values evaluated on the

original batch. The quantile order

is given

as an input to the procedure by an expert (see

Algorithm 2);

If the last condition is fulﬁlled then

is aug-

mented with k(D).

Note that once a transformation

is detected as a

symmetry the dataset is potentially augmented with

transitions that are not present in the original batch,

injecting hence unseen and totally novel information

into the dataset.

3 ALGORITHMIC

CONTRIBUTION

Our algorithmic contribution consists in the improve-

ment of the calculation of

in part (3.a) of the previ-

ous list (Angelotti et al., 2022). Indeed, that approach

does not yield valid results when applied to stochas-

tic environments. In order for the method to work in

stochastic environments we need to measure a distance

in distribution. The latter somehow was considered

in the version of the approach that took care of con-

tinuous deterministic environments since learning a

distribution over transitions represented by their fea-

tures is independent of the nature of the dynamics.

However, when dealing with categorical states the no-

tion of distance between features can’t be exploited.

We propose to compute the percentage

relying

on a distance between categorical distributions. Since

the transformation

is a surjection on transition tuples,

we do not know a-priori which will be the correct

mapping

′

(s,a, s

′

) ∀s

′

∈ S

. In other words, we can

compute

′

, the symmetric image of

′

, only when we

receive as an input the whole tuple

(s,a, s

′

)

since an

inverse mapping might not exist.

Therefore we will resort to computing a pes-

simistic approximation of the Total Variational Dis-

tance (proportional to the

-norm). In particu-

lar, given

(s,a, s

′

)

, we aim to calculate the Cheby-

shev distance (the

∞

-norm) between

T (s,a,·)

and



(s,a, s

′

),k

(s,a, s

′

),·



. Recall that given two vec-

tors of dimension

and

both

∈ R

||x − y||

∞

≤

||x − y||

Let us then deﬁne the following four functions:

m(s,a, s

′

) = min

s∈S\{s

′

T ̸=0

T (s,a,s) (4)

M(s,a,s

′

) = max

s∈S\{s

′

}

T (s,a,s), (5)

(s,a, s

′

) = min

s∈S s.t.

s̸=k

′

(s,a,s

′

)

and

T ◦k̸=0



(s,a, s

′

),k

(s,a, s

′

),s



(6)

(s,a, s

′

) = max

s∈S s.t.

s̸=k

′

(s,a,s

′

)



(s,a, s

′

),k

(s,a, s

′

),s



(7)

where

(

) and

(

) are the minimum (maxi-

mum) of the probability mass function (pmf)

when

evaluated respectively on an initial state and action

(s,a)

and



(s,a, s

′

),k

(s,a, s

′

)



for which

T ̸= 0

Those zero values are excluded because, in the context

of a small dataset, many transitions are unexplored,

and including values

= 0

would often lead to over-

pessimistic estimates.

In order to approximate the Chebyshev distance

between

T (s,a,·)

and

T (k

(s,a, s

′

),k

(s,a, s

′

),·)

Data Augmentation Through Expert-Guided Symmetry Detection to Improve Performance in Ofﬂine Reinforcement Learning

117

deﬁne a pessimistic approximation d

as follows:

(s,a, s

′

) = max



M(s,a,s

′

) − m

(s,a, s

′

)



| {z }

(I)



(s,a, s

′

) − m(s,a,s

′

)



| {z }

(II)

, (8)



T (s,a,s

′

) − (

T ◦ k)(s, a, s

′

)



| {z }

(III)

For the moment consider

T (s,a,·)

and



(s,a, s

′

),k

(s,a, s

′

),·



just as two sets of

numbers. Remove the value corresponding to

′

from

the ﬁrst set, the one corresponding to

′

(s,a, s

′

)

from

the second set, and any remaining zeros from both.

Taking the max between (I) and (II) just equates to

selecting the maximum possible difference between

any two values of these modiﬁed sets. Equation 8 sim-

ply tells us to select the worst possible case since we

do not know which permutations of states we should

compare when computing the Chebyshev distance.

′

is removed from

T (s,a,·)

and

′

(s,a, s

′

)

is removed

from



(s,a, s

′

),k

(s,a, s

′

),·



since we know that

maps

(s,a, s

′

)



(s,a, s

′

),k

(s,a, s

′

),k

′

(s,a, s

′

)



and hence we can compare those values directly (III).

Notice that

0 < d

(s,a, s

′

) ≤ 1 ∀(s,a,s

′

) ∈ S × A × S. (9)

In the following, we propose to improve the algorithms

proposed in (Angelotti et al., 2022). In detail, we rede-

ﬁne the symmetry conﬁdence value

. We propose to

estimate ν

as in Line 2 of Algorithm 1 as:

(D) = 1 −

|D|

∑

(s,a,s

′

)∈D

(s,a, s

′

). (10)

From equations 8 and 9, it follows that: (i) in deter-

ministic environments

(Eq. 10) coincides with the

one prescribed in (Angelotti et al., 2022); and, (ii)

1 > ν

≥ 0

, so

can be interpreted as a percentage.

This last allows us to suppose that

is an estimate of

the probability of

being a symmetry of the dynamics,

and therefore we can relax the necessity of deﬁning an

expert-given threshold

(cf. (Angelotti et al., 2022)

Alg. 1). We then set

ν = 0.5

as an input in Algorithm

1 and eventually augment the batch if

> 0.5

(Lines

3-5).

Remark (Extreme Case Scenario). Is Equation 8

too pessimistic? Consider that for a given state ac-

tion couple

(s,a)

we have a transition distributed

over 3 states

s ∈ S = {One,Two,Three}

with proba-

bilities

T (s,a,One) = 0.01

T (s,a,Two) = 0.01

and

T (s,a,T hree) = 0.98

. Now, assume the estimate of

Algorithm 1: Symmetry detection and data augmenting in a

categorical MDP.

Input: Batch of transitions D , k alleged

symmetry

Output: Possibly augmented batch D ∪ D

T ← Most Likely Categorical pmf from D

2 ν

= 1 −

|D|

∑

(s,a,s

′

)∈D

(s,a, s

′

) (where

is deﬁned in Equation 8)

3 if ν

> 0.5 then

4 D

= k(D) (alleged symm. transitions)

return

D ∪ D

(the augmented batch)

6 else

7 return D (the original batch)

8 end

the transition function is perfect. Does the dis-

tance in Equation 8 converge to

? Not always,

but what matters for the detection of symmetries is

the average of the distances over the whole batch

(Eq. 10). Suppose that these probabilities were in-

ferred from a batch with the transition

(s,a, One)

once,

(s,a, Two)

once and

(s,a, Three)

ninety-eight

times. Consider

(s,a, Three)

M(s,a,Three) =

(s,a, Three) = m(s,a,Three) = m

(s,a, Three) =

0.01

. Following Eq. 8,

(s,a, Three) = 0

. How-

ever,

(s,a, One) = d

(s,a, Two) = 0.97

, which is

a too pessimistic estimate. Nevertheless let’s calcu-

late

(Eq.10). For this state-action pair

(s,a)

, the

average over the batch is therefore:

(s,a, One) +

(s,a, Two) + 98d

(s,a, Three))/100 = 0.0194

. If

the estimation is the same for other pairs

(s,a)

, then

= 1 − 0.0194 = 0.9806

. This is a value close to 1

suggesting k is a symmetry.

4 EXPERIMENTS

In order to show the improvements provided by our

contribution we tested the algorithms in a stochas-

tic version of the toroidal Grid environment and two

continuous state environments of the OpenAI’s Gym

Learning Suite: CartPole and Acrobot. We have cho-

sen the same scenarios as (Angelotti et al., 2022) in

order to demonstrate that our approach generalizes

well to the stochastic case contrary to the approach

proposed in (Angelotti et al., 2022).

4.1 Setup

We collect a batch of transitions

using a uniform

random policy. An expert alleges the presence of a

symmetry k and we proceed to its detection using

ICAART 2023 - 15th International Conference on Agents and Artiﬁcial Intelligence

118

Algorithm 2: Symmetry detection and data augmenting in a

continuous MDP with detection threshold

ν = 0.5

(Angelotti

et al., 2022).

Input:

Batch of transitions

q ∈ [0,1)

order

of the quantile, k alleged symmetry

Output: Possibly augmented batch D ∪ D

1 L ← Density Estimate (D) (e.g. with

Normalizing Flows)

2 Λ ←

Distribution

L(D)

(

evaluated over

)

3 θ = q-order quantile of Λ

4 D

= k(D) (alleged symmetric transitions)

5 ν

∑

(s,a,s

′

)∈D

{L(s,a,s

′

|D)>θ}

6 if ν

> 0.5 then

return

D ∪ D

(the augmented batch)

8 else

9 return D (the original batch)

10 end

Algorithm 1 (categorical case) or Algorithm 2 (contin-

uous case). In the continuous case, Density Estimation

is performed by a Masked Autoregressive Flow archi-

tecture (Papamakarios et al., 2017) with

layers of

bijectors.

The experiments were performed using 2 Dodeca-

core Skylake Intel® Xeon® Gold 6126 @ 2.6 GHz

and 96 GB of RAM and 2 GPU NVIDIA® V100 @

192GB of RAM. The code to run the experiments is

available at https://github.com/giorgioangel/dsym.

Computation of

and Batch Augmentation. We

report the

obtained with an ensemble of

different

iterations of the procedure: we generate

z ∈ N

sets of

different batches

of increasing size.Remember that

since

∈ [0,1)

we can interpret it as the probability

of the presence of a symmetry and select a detection

threshold

ν = 0.5

or higher, while in (Angelotti et al.,

2022) the threshold

was expert-given. We calculate

with both the (Angelotti et al., 2022) method and

the approach here presented.

Evaluation of the Performance (Categorical Case).

In the end, let

be the distribution of initial states

∈ S

and let the performance

of a policy

= E

s∼ρ

(s)].

Our experimental contribution is

the comparison between the performances obtained

by acting in the real environment with

(the optimal

policy solving the MDP deﬁned with

) and

(the

optimal policy obtained with

). In particular we

consider the quantity

∆U = U

−U

. (11)

∆U > 0

means that data augmenting leads to better

policies.

In categorical environments the policies are ob-

tained with Policy Iteration and evaluated with Policy

Evaluation.

Evaluation of the Performance (Continuous Case).

In continuous environments Ofﬂine Learning is not

trivial. We use the implementation of two Model-

Free Deep RL architectures: Deep Q-Network (DQN)

(Mnih et al., 2015) and Conservative Q-Learning

(CQL) (Kumar et al., 2020) of the d3rlpy learning

suite (Seno and Imai, 2021) to obtain a policy start-

ing from the batches. The ﬁrst method is the one that

originally established the validity of Deep RL and it

is used in online RL while the second was speciﬁcally

developed to tackle ofﬂine RL problems. Since the

convergence of the training of Deep RL baselines is

greatly dependent of hyperparameter tuning that itself

depends on both the environment and the batch (Paine

et al., 2020), we will apply DQN and CQL with the

default parameters provided by d3rlpy, abiding hence

more faithfully to an ofﬂine learning duty. This means

that sometimes the learning might not converge to a

good policy. We ﬁnd this philosophy more honest

than showing the results obtained with the best seed or

the ﬁnest-tuned hyperparameters. Each architecture is

trained for a number of steps equal to ﬁfty times the

number of transitions present in the batch.

4.2 Environments

Stochastic Grid (Categorical). In this environment,

the agent can move along ﬁxed directions over a torus

by acting with any

a ∈ A = {↑,↓,←,→}

(see Figure

1). The grid meshing the torus has size l = 10.

Figure 1: Representation of the Grid Environment (Angelotti

et al., 2022). The red dot is the position of a state

on the

torus. A possible displacement obtained by acting with

action a =↑ is shown as a red arrow.

The agent can spawn everywhere on the torus with

a uniform probability and must reach a ﬁxed goal. At

every time step, the agent receives a reward

r = −1

it does not reach the goal and a reward

r = 1

once the

goal has been reached, terminating the episode. When

performing an action the agent has

60%

chances of

moving to the intended direction,

20%

to the opposite

Data Augmentation Through Expert-Guided Symmetry Detection to Improve Performance in Ofﬂine Reinforcement Learning

119

one, and

10%

along an orthogonal direction. We col-

lect

z = 10

sets of

M = 100

batches with respectively

N = 1000 × i

steps in each batch (

going from

z).

Table 1: Toroidal Grid: proposed transformations and label.

k Label

(s,a,s

′

) = s

′



s,a = (↑,↓,←,→),s

′



= (↓, ↑, →, ←)

TRSAI

′

(s,a,s

′

) = s

(s,a,s

′

) = s



s,a = (↑,↓,←,→),s

′



= (↓, ↑, →, ←) SDAI

′

(s,a,s

′

) = s

′

(s,a,s

′

) = s



s,a = (↑,↓,←,→),s

′



= (↓, ↑, →, ←) ODAI

′

(s,a = (↑,↓,←,→),s

′

) =



′

− (0,2), s

′

+ (0,2), s

′

+ (2,0), s

′

− (2,0)



(s,a,s

′

) = s

(s,a = (↑,↓,←,→),s

′

) = (→, ←, ↑, ↓) ODWA

′

(s,a = (↑,↓,←,→),s

′

) =



′

− (0,2), s

′

+ (0,2), s

′

+ (2,0), s

′

− (2,0)



(s,a,s

′

) = s

′

(s,a,s

′

) = a TI

′



s,a = (↑,↓,←,→),s

′





′

+ (0,1), s

′

− (0,1), s

′

− (1,0), s

′

+ (1,0)



(s,a,s

′

) = s

′

(s,a,s

′

) = a TIOD

′

(s,a,s

′

) = s

The proposed symmetries for this environment are

outlined in Table 1. We check for the invariant of

the dynamics with respect to the following six trans-

formations (the valid symmetries are displayed in

bold): (1) Time reversal symmetry with action in-

version (TRSAI); (2) Same dynamics with action in-

version (SDAI); (3) Opposite dynamics and action

inversion (ODAI); (4) Opposite dynamics but wrong

action (ODWA); (5) Translation invariance (TI); (6)

Translation invariance with opposite dynamics (TIOD).

The

dependent average results for symmetry detec-

tion using the method from (Angelotti et al., 2022) are

reported in Figure 2, and results using our method are

displayed in Figure 3a. Figure 3b presents the perfor-

mance improvement

∆U

, with its standard deviation

being represented by a vertical error bar.

Stochastic CartPole (Continuous). A pole is pre-

cariously balanced on a cart and an agent can push

the whole system left or right to prevent the pole from

falling.

The dynamics is similar to that of CartPole (Brock-

man et al., 2016) (see Figure 5), however the force that

the agent uses to push the cart is sampled from a nor-

mal distribution with mean

(the force deﬁned in the

deterministic version) and standard deviation

σ = 2

Recall that the state is represented by the features

(x,θ,v,ω)

and

A = {←, →}

. For the evaluation of

Figure 2: Stochastic toroidal Grid Environment. Probability

of symmetry

calculated with the method proposed by (An-

gelotti et al., 2022).The threshold at

ν = 0.5

is displayed as a

dashed line. Since all

< 0.5

means that no transformation

is detected as a symmetry.

we set the quantile

q = 0.1

and we collect

z = 10

sets

M = 100

batches with respectively

N = 1000 × i

steps in each batch (and

going from

). We

evaluate

∆U

by training the agent on single batches

N = 5000 × i

(and

going from

) both aug-

mented and not augmented with

. The acronyms of

the valid symmetric transformations are displayed in

bold: (1) State and action reﬂection with respect to an

axis in

x = 0

(SAR); (2) Initial state reﬂection (ISR);

(3) Action inversion (AI); (4) Single feature inversion

(SFI); (5) Translation invariance (TI). Their effects on

the transition

(s,a, s

′

)

are listed in Table 2. Average re-

sults and errors are displayed in Figure 4a. The results

considering the evaluation of performance gain (

∆U

)

are shown in Table 4.

Stochastic Acrobot (Continuous). The Acrobot is

a planar two-link robotic arm working against gravity,

the agent can decide whether to swing or not the elbow

left or right to balance the arm straightened up (see

Figure 6). It is the very same Acrobot of (Brockman

et al., 2016) but at every time step a noise

is sampled

from a uniform distribution on the interval

[−0.5,0.5]

and added to the torque. A state is represented by

the features

,ω

)

where

and

are

respectively

sin(α

)

and

cos(α

)

in shorthand notation.

The action set

A = {−1, 0,1}

. For the evaluation of

we set

q = 0.1

. For the detection case, we collected

z = 5

sets of

M = 100

batches with

N = 1000 × i

steps within each one (

going from

). The

evaluation of the performance was carried out on single

batches, with and without data augmentation, with

N = 10000 × i

steps and

going from

. For the

evaluation of

∆ z = 5

due to computational necessities.

We allege the following transformations

, as always

the valid ones are bolded: (1) Angles and angular

ICAART 2023 - 15th International Conference on Agents and Artiﬁcial Intelligence

120

(a) Probability of symmetry

with our approach. The

threshold at

ν = 0.5

is displayed as a dashed line.

> 0.5

means that the transformation is detected as a symmetry.

(b) Performance difference

∆U

(Eq. 11). The threshold at

∆U = 0

is displayed as a dashed line.

∆U > 0

means that

data augmenting leads to better policies.

Figure 3: Stochastic Toroidal Grid Environment.

and

∆U

for the transformations

computed over sets of

100

different

batches of size

. Points are mean values and bars standard

deviations.

velocities inversion (AAVI); (2) Cosines and angular

velocities inversion (CAVI); (3) Action inversion (AI);

(4) Starting state inversion (SSI).

The images of the transformations are reported in

Table 3. The

dependent average results and standard

deviations are reported in Figure 4b. The results con-

sidering the evaluation of performance gain (

∆U

) are

shown in Table 5.

5 DISCUSSION

Stochastic Grid (Categorical). Detection phase

(

). We see from Figure 2 that using the state-of-

the-art approach no transformation is detected as a

symmetry because

< 0.5

∀k

in the proposed set of

transformations. This result highlights the inadequacy

(a) Stochastic CartPole. Probability of symmetry

. The

threshold at

ν = 0.5

is displayed as a dashed line.

> 0.5

means that the transformation is detected as a symmetry.

(b) Stochastic Acrobot. Probability of symmetry

. The

threshold at

ν = 0.5

is displayed as a dashed line.

> 0

means that the transformation is detected as a symmetry.

Figure 4:

, for the transformations

computed over sets of

different batches of size

in Stochastic CartPole (left) and

Stochastic Acrobot (right). Points are mean values and are

a bit shifted horizontally for the sake of display. Standard

deviation is displayed as a vertical error bar.

x = 0

g(a)

Figure 5: The cart in the right is a representation of a Cart-

Pole’s state

with

> 0

and action

=←

(Angelotti et al.,

2022). The dashed cart in the left is the image of

)

under the transformation

which inverses state

f (s) = −s

and action g(a) = −a.

of the state-of-the-art method to deal with stochas-

tic environments. On the contrary, our novel algo-

rithm perfectly manages to identify the real symme-

tries of the environment (see Figure 3a):

> 0.5

k ∈ {TRSAI,ODAI,TI}

. Moreover, there are no false

positives:

< 0.5

k ∈ {SDAI,ODWA,TIOD}

. We

Data Augmentation Through Expert-Guided Symmetry Detection to Improve Performance in Ofﬂine Reinforcement Learning

121

> 0

< 0

Figure 6: Representation of a state of the Acrobot environ-

ment (Angelotti et al., 2022).

Table 2: Proposed transformations and labels for Stochastic

CartPole.

k Label

(s,a, s

′

) = −s



s,a = (←,→), s

′



= (→, ←) SAR

′

(s,a, s

′

) = −s

′

(s,a, s

′

) = −s

(s,a, s

′

) = a ISR

′

(s,a, s

′

) = s

′

(s,a, s

′

) = s



s,a = (←,→), s

′



= (→, ←) AI

′

(s,a, s

′

) = s

′



s = (x,...), a,s

′



= (−x, ...)

(s,a, s

′

) = a SFI

′

(s,a, s

′

) = s

′



s = (x,...), a,s

′



= (x + 0.3,...)

(s,a, s

′

) = a TI

′



s,a, s

′

= (x

′

,...)



= (x

′

+ 0.3,...)

notice that while in a deterministic environment

= 0

∀k

which is not a symmetry, here the stochasticity

makes the detection more complicated since

≈ 0.5

−

for N = 2000.

Evaluation of performance gain (

∆U

). The dif-

ference in the performance of the deployed policies

Table 3: Proposed transformations and labels for Stochastic

Acrobot.

k Label



s = (s

,ω

,. .. ),a, s

′



= (−s

,−s

,−ω

,. . . )

(s,a = (−1,0, 1),s

′



= (1, 0,−1) AAVI

′



s,a, s

′

= (s

′

,ω

′

,ω

′

,. .. )



= (−s

′

,−s

′

,−ω

′

,−ω

′

,. . . )



s = (c

,ω

,. .. ),a, s

′



= (−c

,−c

,−ω

,. . . )



s,a = (−1,0, 1),s

′



= (1, 0,−1) CAVI

′



s,a, s

′

= (c

′

,ω

′

,ω

′

,. .. )



= (−c

′

,−c

′

,−ω

′

,−ω

′

,. . . )

(s,a, s

′

) = s



s,a = (−1,0, 1),s

′



= (1, 0,−1) AI

′

(s,a, s

′

) = s

′

(s,a, s

′

) = −s



s,a, s

′

) = a SSI

′

(s,a, s

′

) = s

′

∆U

perfectly ﬁts the expected behavior. When

is a

symmetry

∆U > 0

and saturates to

with

increas-

ing. When

is not a symmetric transformation of the

dynamics

∆U < 0

and keeps decreasing with

(see

Figure 3b).

Stochastic CartPole (Continuous). Detection

phase (

) In Stochastic CartPole the algorithm fails

to detect the symmetry

k = TI

. This could be due

to the fact that the translation invariance symmetry

in this case is ﬁxed for a speciﬁc value (see TI in

Table 2 where the translation is set at

0.3

). If the

translation is too small the neural network fails

to discern the transformation from the noise. The

algorithm classiﬁes correctly as a symmetry

k = SAR

and the remaining transformations as non-symmetries

(see Figure 4a).

Evaluation of performance gain (

∆U

). Results are dis-

played in Table 4. ORL is very unstable and sensitive

to the choice of hyperparameters. On top of that, the

training is carried out for a ﬁxed number of epochs.

We notice that, on average over different batch sizes,

∆U > 0

for DQN and SAR, and SFI transformations.

While SAR is a valid symmetry, SFI it’s not. A more

conservative algorithm like CQL only detects SAR as

a valid symmetry. The performance difference for TI

both for DQN and CQL is so close to zero that we

think that augmenting the dataset with this symmetry

might not be a substantial power-up over using just the

information contained in the original batch.

Stochastic Acrobot (Continuous). Detection phase

(

). In this environment the only real symmetry of

the dynamics,

AAVI

, gets successfully detected by the

algorithm with

q = 0.1

. Non symmetries yield a

0.5 (Figure 4b).

Evaluation of performance gain (

∆U

). Results

are displayed in Table 5 and show that the training in

Stochastic Acrobot is harder than in Stochastic Cart-

Pole since, even with a large dataset, sometimes the

algorithms do not manage to learn a good policy. In

particular, while CQL manages to learn how to behave

in the environment exploiting the AAVI symmetry (av-

erage

∆U = 52.9

), DQN still struggles with every

good and wrong. Nevertheless, CQL apparently bene-

ﬁts from augmenting the dataset also with wrong sym-

metries even though to a smaller extent. We suppose

this effect is due to the instability in ORL training.

6 CONCLUSIONS

Data efﬁciency in the ofﬂine learning of MDPs is

highly coveted. Exploiting the intuition of an expert

ICAART 2023 - 15th International Conference on Agents and Artiﬁcial Intelligence

122

Table 4: ∆U for every alleged symmetry in Stochastic CartPole with two baselines and different batch sizes N.

N (number of transitions in the original batch)

k Baseline 5000 10000 15000 20000 25000 30000 Average ∆U

SAR

DQN -7.3 25.4 41.8 7.2 9.0 3.4 13.3

CQL 37.4 -2.5 -4.1 20.1 17.9 -9.0 10.0

ISR

DQN -1.3 -48.5 -29.9 -78.7 -107.8 -29.1 -49.2

CQL 6.4 1.6 -2.2 -22.3 -10.3 -25.9 -8.8

DQN 26.9 -48.5 -43.7 -74.6 -41.3 -84.6 -44.3

CQL -13.1 -7.6 -29.8 -6.5 -22.3 -15.3 -15.8

SFI

DQN -33.4 17.9 21.4 45.4 -6.9 -0.1 7.4

CQL -5.5 -2.1 7.4 -3.9 -3.6 -18.5 -4.4

DQN 36.9 -28.1 34.5 15.7 6.1 -9.1 -0.2

CQL 7.6 -1.3 -2.1 11.8 -16.5 5.2 0.8

Table 5: ∆U for every alleged symmetry in Stochastic Acrobot with two baselines and different batch sizes N.

k Baseline 10000 20000 30000 40000 Average ∆U

AAVI

DQN 24.7 -17.5 -63.4 -10.6 -16.7

CQL -2.8 10.5 -9.5 213.3 52.9

CAVI

DQN 8.9 -9.3 -24.6 -48.0 -12.2

CQL -8.8 0.5 4.4 1.1 -0.7

DQN -377.3 -399.3 -386.8 -388.5 -388.0

CQL -25.6 235.3 -88.2 -49.9 17.9

SSI

DQN 265.7 -408.2 -334.9 -396.3 -218.4

CQL 35.8 4.0 11.9 -22.8 7.2

about the nature of the model can help to learn dynam-

ics that better represent reality.

In this work, we built a semi-automated tool that

can aid an expert in providing a statistical data-driven

validation of her/his intuition about some properties of

the environment. Correct deployment of the tool could

improve the performance of the optimal policy ob-

tained by solving the learned MDP. Indeed, our results

suggest that the proposed algorithm can effectively

detect a symmetry of the dynamics of an MDP with

high accuracy and that exploiting this knowledge can

not only reduce the distributional shift but also pro-

vide performance gain in an envisaged optimal control

of the system. However, when applied to ORL envi-

ronments with DNN, all the prescriptions (and issues)

about hyperparameter ﬁne-tuning well known to ORL

practitioners persist.

Besides its pros, the current work is still con-

strained by several limitations. We note that the quality

of the approach in continuous MDPs is greatly affected

by the architecture of the Normalizing Flow used for

Density Estimation and, more generally, by the state-

action space preprocessing. In detail, sometimes an

environment is endowed by symmetries that an expert

can not straightforwardly perceive in the default repre-

sentation of the state-action space and a transformation

would be required (imagine the very same CartPole,

but with also the linear speed and position of the car

expressed in polar coordinates).

In the future we plan: (i) to expand this approach by

trying out more recent Normalizing Flow architectures

like FFJORD (Grathwohl et al., 2019); (ii) to consider

combinations of multiple symmetries; (iii) after the

ofﬂine detection of a symmetry, to exploit the data

augmentation to improve the learning phase of online

agents.

Data Augmentation Through Expert-Guided Symmetry Detection to Improve Performance in Ofﬂine Reinforcement Learning

123

ACKNOWLEDGEMENTS

This work was funded by the Artiﬁcial and Natural

Intelligence Toulouse Institute (ANITI) - Institut 3iA

(ANR-19-PI3A-0004).

REFERENCES

Angelotti, G., Drougard, N., and Chanel, C. P. C. (2020).

Ofﬂine learning for planning: A summary. In Proceed-

ings of the 1st Workshop on Bridging the Gap Between

AI Planning and Reinforcement Learning (PRL) at the

30th International Conference on Automated Planning

and Scheduling, pages 153–161.

Angelotti, G., Drougard, N., and Chanel, C. P. C. (2022).

Expert-guided symmetry detection in markov decision

processes. In Proceedings of the 14th International

Conference on Agents and Artiﬁcial Intelligence - Vol-

ume 2: ICAART,, pages 88–98. INSTICC, SciTePress.

Bellman, R. (1966). Dynamic Programming. Science,

153(3731):34–37.

Brockman, G., Cheung, V., Pettersson, L., Schneider, J.,

Schulman, J., Tang, J., and Zaremba, W. (2016). Ope-

nAI Gym. arXiv preprint arXiv:1606.01540.

Dean, T. and Givan, R. (1997). Model Minimization in

Markov Decision Processes. In AAAI/IAAI, pages 106–

111.

Dinh, L., Krueger, D., and Bengio, Y. (2015). NICE: Non-

linear Independent Components Estimation. In Bengio,

Y. and LeCun, Y., editors, 3rd International Confer-

ence on Learning Representations, ICLR 2015, San

Diego, CA, USA, May 7-9, 2015, Workshop Track Pro-

ceedings.

Grathwohl, W., Chen, R. T. Q., Bettencourt, J., Sutskever, I.,

and Duvenaud, D. (2019). FFJORD: Free-Form Con-

tinuous Dynamics for Scalable Reversible Generative

Models. In 7th International Conference on Learning

Representations, ICLR 2019, New Orleans, LA, USA,

May 6-9, 2019. OpenReview.net.

Gross, D. J. (1996). The role of symmetry in fundamen-

tal physics. Proceedings of the National Academy of

Sciences, 93(25):14256–14259.

Kobyzev, I., Prince, S., and Brubaker, M. (2020). Normal-

izing Flows: An Introduction and Review of Current

Methods. IEEE Transactions on Pattern Analysis and

Machine Intelligence.

Kumar, A., Zhou, A., Tucker, G., and Levine, S. (2020). Con-

servative q-learning for ofﬂine reinforcement learning.

Advances in Neural Information Processing Systems,

33:1179–1191.

Levine, S., Kumar, A., Tucker, G., and Fu, J. (2020). Ofﬂine

reinforcement learning: Tutorial, review, and perspec-

tives on open problems. ArXiv, abs/2005.01643.

Li, L., Walsh, T. J., and Littman, M. L. (2006). Towards a

Uniﬁed Theory of State Abstraction for MDPs. ISAIM,

4:5.

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness,

J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidje-

land, A. K., Ostrovski, G., et al. (2015). Human-level

control through deep reinforcement learning. Nature,

518(7540):529–533.

Narayanamurthy, S. M. and Ravindran, B. (2008). On the

Hardness of Finding Symmetries in Markov Decision

Processes. In Proceedings of the 25th international

conference on Machine learning, pages 688–695.

Paine, T. L., Paduraru, C., Michi, A., Gulcehre, C., Zolna, K.,

Novikov, A., Wang, Z., and de Freitas, N. (2020). Hy-

perparameter selection for ofﬂine reinforcement learn-

ing. arXiv preprint arXiv:2007.09055.

Papamakarios, G., Pavlakou, T., and Murray, I. (2017).

Masked autoregressive ﬂow for density estimation.

In Proceedings of the 31st International Conference

on Neural Information Processing Systems, NIPS’17,

page 2335–2344, Red Hook, NY, USA. Curran Asso-

ciates Inc.

Park, D. S., Chan, W., Zhang, Y., Chiu, C.-C., Zoph, B.,

Cubuk, E. D., and Le, Q. V. (2019). Specaugment: A

simple data augmentation method for automatic speech

recognition. In Proc. Interspeech 2019, pages 2613–

2617.

Ravindran, B. and Barto, A. G. (2001). Symmetries and

Model Minimization in Markov Decision Processes.

Technical report, USA.

Ravindran, B. and Barto, A. G. (2004). Approximate Homo-

morphisms: A Framework for Non-exact Minimization

in Markov Decision Processes.

Seno, T. and Imai, M. (2021). d3rlpy: An ofﬂine deep

reinforcement library. In NeurIPS 2021 Ofﬂine Rein-

forcement Learning Workshop.

Shorten, C. and Khoshgoftaar, T. M. (2019). A survey on

Image Data Augmentation for Deep Learning. Journal

of Big Data, 6(1):1–48.

van der Pol, E., Kipf, T., Oliehoek, F. A., and Welling,

M. (2020a). Plannable Approximations to MDP Ho-

momorphisms: Equivariance under Actions. In Pro-

ceedings of the 19th International Conference on Au-

tonomous Agents and MultiAgent Systems, AAMAS

’20, page 1431–1439, Richland, SC. International

Foundation for Autonomous Agents and Multiagent

Systems.

van der Pol, E., Worrall, D., van Hoof, H., Oliehoek, F.,

and Welling, M. (2020b). MDP Homomorphic Net-

works: Group Symmetries in Reinforcement Learning.

In Larochelle, H., Ranzato, M., Hadsell, R., Balcan,

M. F., and Lin, H., editors, Advances in Neural Infor-

mation Processing Systems, volume 33, pages 4199–

4210. Curran Associates, Inc.

van Dyk, D. A. and Meng, X.-L. (2001). The art of data aug-

mentation. Journal of Computational and Graphical

Statistics, 10(1):1–50.

Yarats, D., Brandfonbrener, D., Liu, H., Laskin, M., Abbeel,

P., Lazaric, A., and Pinto, L. (2022). Don’t change the

algorithm, change the data: Exploratory data for ofﬂine

reinforcement learning. In ICLR 2022 Workshop on

Generalizable Policy Learning in Physical World.

ICAART 2023 - 15th International Conference on Agents and Artiﬁcial Intelligence

124