Learning Independently from Causality in Multi-Agent Environments

Rafael Pina

, Varuna De Silva

and Corentin Artaud

Institute for Digital Technologies, Loughborough University London, 3 Lesney Avenue, London, U.K.

Keywords:

Multi-Agent Reinforcement Learning, Causality, Deep Learning.

Abstract:

Multi-Agent Reinforcement Learning (MARL) comprises an area of growing interest in the ﬁeld of machine

learning. Despite notable advances, there are still problems that require investigation. The lazy agent pathol-

ogy is a famous problem in MARL that denotes the event when some of the agents in a MARL team do not

contribute to the common goal, letting the teammates do all the work. In this work, we aim to investigate this

problem from a causality-based perspective. We intend to create the bridge between the ﬁelds of MARL and

causality and argue about the usefulness of this link. We study a fully decentralised MARL setup where agents

need to learn cooperation strategies and show that there is a causal relation between individual observations

and the team reward. The experiments carried show how this relation can be used to improve independent

agents in MARL, resulting not only on better performances as a team but also on the rise of more intelligent

behaviours on individual agents.

1 INTRODUCTION

The use of causality in the ﬁeld of Artiﬁcial Intelli-

gence (AI) has been gaining the attention of the re-

search community. Recent discussions argue how

causality can play an important role to improve many

traditional machine learning approaches (Peters et al.,

2017). More speciﬁcally, recent works argue that

causality can be used to get a deeper understanding of

the underlying properties of systems within the ﬁeld

of AI. While it can be relatively straight forward to

learn the underlying distributions of a given system, to

understand the cause of the events in the environments

can be a key to a richer representation of the dynamics

of the system (Peters et al., 2017). For instance, hu-

mans apply causal reasoning in their everyday lives,

whereas AI entities are currently incapable of such

reasoning. A popular example in the healthcare ﬁeld

is when a certain model trained to make predictions

mistakes correlations for causations. If the model is

predicting patient health needs based on many fac-

tors, it is likely that it will be doing it based on some

correlation found. However, in certain cases some of

the correlations found might not explain the predic-

tion for all the cases, since the factor found would

not be the cause of the prediction, although they were

somehow correlated (Sgaier et al., 2020).

https://orcid.org/0000-0003-1304-3539

https://orcid.org/0000-0001-7535-141X

Motivated by the indications of how causality can

be so successfully linked to machine learning, the ap-

plications have been studied in different ﬁelds. In

neurology, causality has been used to ﬁnd causal rela-

tions among different regions of the brain (Glymour

et al., 2019). This can be important to understand the

reason of certain events lighted by different parts of

our brains. Besides the relevance in the healthcare

ﬁeld, in agriculture it can be critical to understand

what is causing the harvesting to be less fruitful in one

year than in the previous, and not only to see what is

correlated to this event (Sgaier et al., 2020).

Applications involving more advanced machine

learning methods have been rising. For instance,

in (Kipf et al., 2018) the authors enlighten how

causal relations can also be found in human bodies

to understand relations among joints when moving.

When it comes to time series analysis, the applica-

tions are numerous. Starting from the foundations

of Granger Causality (Granger, 1969), causal discov-

ery in time series has evolved quickly with notable

progress. Causal discovery has proved to be help-

ful in time series prediction related to areas such as

weather forecasting or ﬁnance (Hlav

ckov

a-Schindler

et al., 2007). More advanced techniques have also

shown how the basic concepts of the traditional statis-

tics causality can be integrated with deep learning

methods. A popular approach is to use an encoder-

decoder architecture to model the causal relations of

Pina, R., De Silva, V. and Artaud, C.

Learning Independently from Causality in Multi-Agent Environments.

DOI: 10.5220/0011747900003411

In Proceedings of the 12th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2023), pages 481-487

ISBN: 978-989-758-626-2; ISSN: 2184-4313

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

481

a certain system (Zhu et al., 2020; L

owe et al., 2022;

Huang et al., 2020). Alternatives inspired by Granger

Causality have also been presented. For instance, the

application of Transfer Entropy (Schreiber, 2000) is

seen as a popular alternative for causal discovery in

time series. In fact, this metric is so much related

to the traditional Granger Causality that it has been

proved that the two metrics are equivalent for Gaus-

sian variables (Barnett et al., 2009).

Despite the advances made with causal relations in

machine learning, ﬁnding these relations is still chal-

lenging (Sgaier et al., 2020). Most machine learning

models have problems to recognize the patterns nec-

essary to say if two events are correlated or if instead

one is the cause of the other. Prompted by the inspir-

ing discoveries in the ﬁeld, we intend to show how we

can use causal estimations within the context of AI in

the aims of solving complex cooperative tasks. In par-

ticular, we aim to demonstrate how causal estimations

can be used in the ﬁeld of Multi-Agent Reinforce-

ment Learning (MARL) and train these AI agents to

work as a team and solve cooperative tasks. MARL

is a growing topic in the ﬁeld of machine learning

with many challenges yet to address (Canese et al.,

2021). In this sense, the goal of this work is to show

how causal estimations can be used to improve inde-

pendent learners in MARL, by tackling a well-known

problem in the ﬁeld: the lazy agent problem (Sunehag

et al., 2018). By creating the bridge between MARL

and causal estimations we hope to open doors to fu-

ture works in the ﬁeld, demonstrating how beneﬁcial

the link of causal estimations and MARL can be to

develop more intelligent and capable entities.

2 BACKGROUND

2.1 Decentralised Partially Observable

Markov Decision Processes

(Dec-POMDPs)

In this work we consider Dec-POMDPs (Oliehoek A.

and Amato, 2016), deﬁned by the tuple

⟨S, A, O, P, Z, r, γ,N⟩, where S and A represent the state

and joint action spaces, respectively. We consider

a setting where each agent i ∈ N ≡ {1, . . . , N}

has an observation o

∈ O(s, i) : S × N → Z.

Each agent keeps an action-observation history

∈ T : (Z × A)

∗

, on which it conditions a stochastic

policy π

|τ

) → [0, 1]. At each time step, each

agent i takes an action a

∈ A forming a joint

action a = {a

, . . . , a

}. Taking the joint action

at a state s, will make the environment transit to

a next state s

′

according to a probability function

P(s

′

|s, a) : S × A × S → [0, 1]. All the agents in the

team share a reward r(s, a) : S × A → R. Let γ ∈ [0, 1)

be a discount factor.

2.2 Independent Deep Q-Learning

Independent Q-learning (IQL) was introduced by

(Tan, 1993) as one of the foundations in MARL.

This approach follows the principles of single-agent

reinforcement learning and applies the basics of Q-

learning (Watkins and Dayan, 1992) for independent

learners that update their Q-functions individually

with a learning rate α following the rule

Q(s, a) = (1 − α)Q(s, a)

+ α



r + γmax

′

Q(s

′

, a

′

)



(1)

Later on, (Mnih et al., 2015) introduce Deep Q-

Networks (DQNs), a method that combines deep

neural networks with Q-learning, allowing agents to

approximate their Q-functions as deep neural net-

works instead of simple lookup tables as in simple

Q-learning. This approach introduces the use of a re-

play buffer that stores past experiences of the agents,

and a target network that aims to stabilise the learning

process. The DQN is updated in order to minimise

the loss (Mnih et al., 2015)

L(θ) = E

b∼B





r + γmax

′

Q(s

′

, a

′

;θ

−

)

−Q(s, a; θ)



(2)

where θ and θ

−

are the parameters of the Q-network

and a target Q-network, respectively, for a certain ex-

perience sample b sampled from a replay buffer B.

(Tampuu et al., 2015) put together the concepts of

IQL and DQN to create independent learners that use

DQNs. In this paper we refer to this method as Inde-

pendent Deep Q-learning (IDQL).

2.3 Causality in Time Series

To build our causality-based approach we used as mo-

tivation the concept of Transfer Entropy (Schreiber,

2000). Transfer entropy is a metric that has been

widely used to infer about causality relationships in

time series for different ﬁelds (Smirnov, 2013). Cor-

relation and causality are two concepts that are of-

ten misunderstood. Given two time series, while they

might be correlated with each other, they may not nec-

essarily be impacted one by another (Rohrer, 2018).

Prompted by these concepts, we use the transfer en-

tropy to create the link between causality and MARL

ICPRAM 2023 - 12th International Conference on Pattern Recognition Applications and Methods

482

Figure 1: Architecture of the framework described to improve independent learners in MARL using causality detection. Each

agent is controlled by an independent network that is updated independently based on the output of the causality estimator, as

per Eq. 6.

scenarios. Deﬁnition 1 states the formal deﬁnition of

the Transfer Entropy.

Deﬁnition 1. Given a bivariate time series y with vari-

ables X and Y , the transfer entropy of X on Y can be

expressed as (Aziz, 2017)

X→Y

H(Y

−

) − H(Y

−

, X

−

) (3)

where H represents the Shannon’s Entropy at a given

time stamp t and X

−

and Y

−

represent the previous

values of the time series, up to t.

3 CAUSALITY IN MARL

In this section we present the proposed method in this

paper, named Independent Causal Learning (ICL).

This method aims to punish lazy agents and as a re-

sult improve the performance of MARL agents in co-

operative tasks. The key for the proposed method is

the use of an agent-wise causality factor c

that each

agent i : i ∈ {1, . . . , N} uses to attribute or not the

team reward to itself. This way, with the proposed

method we intend to improve the credit assignment to

the agents in cooperative multi-agent tasks. Each in-

dividual agent should be able to understand whether it

is helping the team to achieve the intended team goal

of the task or not. At the same time, this method en-

courages the agents to learn only by themselves. This

can be beneﬁcial, since in many real scenarios it is

often unfeasible to provide the agents with the full in-

formation of the environment (Canese et al., 2021).

Hence, to learn optimal policies in such scenarios,

agents may be forced to rely only on their individual

observations to understand whether they are perform-

ing well or not. To motivate the proposed method, we

explore the concept of temporal causality, commonly

used in the context of time series, as stated in Def-

inition 1. Building up from this deﬁnition, we can

then create the bridge between causality and a MARL

problem.

Deﬁnition 2. Let E represent a certain episode sam-

pled from a replay buffer of experiences. Let E be

denoted as a multivariate time series of observations

O and rewards R. From Deﬁnition 1, we can then de-

ﬁne the transfer entropy for the time series E as

O→R

= H(R

−

) − H(R

−

, O

−

) (4)

where T deﬁnes the amount of information reduced in

the future values of R by knowing the previous values

of R and O, and H is the Shannon’s Entropy.

Deﬁnition 2 states the motivation that supports the

existence of a causality relationship between observa-

tions and rewards when we see a reinforcement learn-

ing episode as a time series. Assuming that these

causal relations are indeed present in MARL and

can be estimated accurately, we can deﬁne a certain

function that calculates the causal relations between

the team reward and individual observations for each

agent i,

, r) =



1 o

causes r

0 ¬ o

causes r

, i ∈ {1, . . . , N} (5)

Learning Independently from Causality in Multi-Agent Environments

483

When this causality factor is calculated accurately,

each agent can adjust the team reward received from

the environment and update its individual network

following the rule

(τ

, a

) = (1 − α)Q

(τ

, a

)

+ α

(τ

, r) × r + γ max

′

(τ

′

, a

′

)

(6)

In the considered training setting, the loss calcu-

lated by each agent is the same as in IDQL (Eq. 2),

but with respect to their individual networks Q

and

following the update in Eq. 6, resulting in the loss,

L(θ

) = E

b∼B



(τ

, r) × r

+ γ max

′

(τ

′

, a

′

;θ

−

) − Q

(τ

, a

;θ

)



(7)

3.1 Causality Effect in MARL

Environments

In this subsection we describe the MARL environ-

ment used for the experiments in this paper and

show how c

can be intuitively estimated, when there

is some prior knowledge about the task. Fig. 1

illustrates the architecture of the proposed method

and how the calculation of c

is incorporated within

MARL. Note that the team rewards given result frrom

the sum of N individual rewards.

Warehouse. Warehouse is a grid world environment

with dimensions 10x15 (Fig. 2). The goal of this task

is to carry boxes from a box delivery queue (red) to

a pre-deﬁned dropping station (yellow), simulating a

workstation in a real factory of robots with delivery

tasks to complete. However, in this environment each

box can only be carried by two agents at the same

time. In this sense, a team of 4 agents needs to co-

operate in order to maximise the number of boxes de-

livered to the dropping station over the duration of the

episodes (300 steps). Every time a box is successfully

dropped at the station, each agent receives a reward of

+5. Additionally, there is an intermediate reward of

+2 for when a box is lifted from the delivery queue.

However, in this environment there is also a wrong

dropping station that the agents should avoid (left yel-

low). If the agents drop a box at this fake station,

then each one will receive a reward of -5. So, beyond

learning to pick the boxes together, the agents should

also learn to always choose the right dropping station

(right yellow). The observation space for this envi-

ronment consists of a vector with the position of the

agent, a boolean ﬂag F

that tells whether the agent i is

carrying a box or not, and an observation mask with

dimensions 5x5 around the agent. In this sense, the

Figure 2: Cooperative Warehouse environment used in the

experiments.

causality factors are calculated based on the following

condition: there is a positive team reward (box was ei-

ther picked from queue or delivered), and the agent is

carrying a box at the moment right before the reward

(box delivered) or the agent was carrying a box at the

moment of the reward (box picked),



True (F

t−1

∨ F

) ∧ r

> 0

False otherwise.

where F

is the boolean ﬂag in the observation o

an agent i at a timestep t. In this scenario, the learn-

ing is improved by the perception of each agent about

whether they are carrying a box or not at the moment

of receiving a team reward. This intuition will coax

the agents to be more cooperative with the team, elim-

inating lazy agents and maximising the number of de-

livered boxes over the duration of the episodes.

4 EXPERIMENTS AND

DISCUSSIONS

In this section we evaluate the performance of the

proposed approach and use IDQL as the benchmark.

The motivation behind the use of the baseline IDQL

is to demonstrate how the proposed method improves

the quality of the behaviours learned for fully inde-

pendent learners that use only local observations to

learn the tasks. Importantly, we investigate the per-

formances of independent learners and how we can

improve independent learning. Hence, we do not use

the popular parameter sharing convention in MARL

(Gupta et al., 2017) and do not follow the CTDE

paradigm (Lowe et al., 2017). In other words, the

learning process is fully decentralised and indepen-

dent.

ICPRAM 2023 - 12th International Conference on Pattern Recognition Applications and Methods

484

Figure 3: Team rewards obtained for the experimented

Warehouse environment (6 independent runs). The bold

area represents the 95% conﬁdence interval.

4.1 Team Performance and Rewards

Fig. 3 illustrates the results for the Warehouse task,

where ICL achieves a much higher performance. Al-

though IDQL can still achieve an intermediate reward,

it is clear the discrepancy of performance between the

two methods. The sub-optimal reward of IDQL is ex-

plained by insufﬁcient exploration, caused by lack of

cooperation of some agents. As Fig. 3 shows, ICL

encouraged the agents to learn different policies and

hence to cooperate more, leading to an optimal re-

ward that could not be achieved by the fully inde-

pendent learners of IDQL. This means that using a

causality estimator for independent agents to under-

stand whether they have caused the team rewards or

not will beneﬁt the team as a whole, pushing them to

develop more cooperative behaviours and not become

lazy.

4.2 Individual Learned Behaviours

We have seen that the agents trained by ICL with

a causality estimator can achieve higher rewards as

a team in the complex environment experimented.

We now investigate the quality of the individual be-

haviours learned by ICL compared to the purely inde-

pendent learners of IDQL. The aim of this analysis is

to demonstrate how ICL enables independent and in-

dividual agents to learn more intelligent behaviours.

For the results presented in this subsection, we se-

lected trained policies of IDQL and ICL for the ware-

house environment.

As we discuss in detail ahead, Fig. 4a and Fig. 4b

show how the proposed method eliminates lazy agents

in the warehouse task by looking at how much the

agents move during the task and how many boxes they

deliver (contribute to the goal of the team). The dis-

(a)

(b)

Figure 4: Behaviour metrics for ICL vs IDQL in the ex-

perimented environments. From top to bottom: (a) distance

travelled per agent in Warehouse (one successful episode);

(b) boxes delivered per agent in Warehouse (one successful

episode, same as (a)).

crepancy of rewards between IDQL and ICL in Fig.

3 is explained by the existence of agents that do not

cooperate and let the others do the task all by them-

selves. For instance, this can be conﬁrmed in Fig. 4a

that shows how some of the agents trained by IDQL

do not move nearly as much as the agents trained by

ICL. This means that some of them have become lazy

and will not be cooperative with the team, waiting for

the other agents to solve the task for them. They keep

receiving the team reward, but without credit for it.

Additionally, Fig. 4b shows that, while for ICL all

the agents carry roughly the same amount of boxes,

for IDQL some of them almost do not cooperate. As

an example, agent 4 barely participates in the task in

IDQL, carrying almost no boxes during the episode.

As Fig. 3 demonstrates, such non-cooperative indi-

vidual behaviours will be very harmful for the team as

Learning Independently from Causality in Multi-Agent Environments

485

a whole, leading to worst overall outcomes. In addi-

tion, agents with such undesirable behaviours would

not be qualiﬁed to be placed together with a new team

of other trained agents. For instance, if we get to train

a couple of agents to solve a certain task and want to

transfer them to a different team to help other agents,

they would not be qualiﬁed to do so.

5 CONCLUSION AND FUTURE

WORK

This paper introduced Independent Causal Learning

(ICL), a method for learning fully independent be-

haviours in cooperative MARL tasks that bridges the

concepts of causality and MARL. When there is some

prior knowledge of the environment, a causal rela-

tionship between the individual observations and the

team reward becomes perceptible. This allows the

proposed method to improve learning in a fully de-

centralised and fully independent manner. The re-

sults showed that providing an environment depen-

dent causality estimation allows agents to perform ef-

ﬁciently in a fully independent manner and achieve a

certain goal as a team. In addition, we showed how

using causality in MARL can improve individual be-

haviours, eliminating lazy agents that are present in

normal independent learners and enabling more in-

telligent behaviours, leading to better overall perfor-

mances in the tasks. These preliminary results are in-

spiring as they enlighten the potential of the link be-

tween causality and MARL.

In the future, we aim to study how causality esti-

mations can be used to also improve centralised learn-

ing. In addition, although the recognition of patterns

that identify causality relations has shown to be chal-

lenging in machine learning methods, we intend to ex-

tend this method to more cases and show that causal-

ity discovery can be generalised to MARL problems.

Furthermore, we intend to study how ICL can be ap-

plied in real scenarios that require online learning and

prohibit excessive trial-and-error episodes due to po-

tentially catastrophic events caused by the learning

agents. We believe that this can be a breakthrough for

online learning in real scenarios where reliable com-

munication or a centralised oracle is not available,

and agents must learn to coordinate independently. At

last, we expect that this link can bring more relevant

research questions to the ﬁeld of MARL.

ACKNOWLEDGEMENTS

This work was funded by the Engineering and Physi-

cal Sciences Research Council in the United Kingdom

(EPSRC), under the grant number EP/T000783/1.

REFERENCES

Aziz, N. A. (2017). Transfer entropy as a tool for inferring

causality from observational studies in epidemiology.

preprint, Epidemiology.

Barnett, L., Barrett, A. B., and Seth, A. K. (2009). Granger

causality and transfer entropy are equivalent for gaus-

sian variables. Phys. Rev. Lett., 103:238701.

Canese, L., Cardarilli, G. C., Di Nunzio, L., Fazzolari,

R., Giardino, D., Re, M., and Span

o, S. (2021).

Multi-Agent Reinforcement Learning: A Review

of Challenges and Applications. Applied Sciences,

11(11):4948.

Glymour, C., Zhang, K., and Spirtes, P. (2019). Review of

causal discovery methods based on graphical models.

Frontiers in Genetics, 10.

Granger, C. W. J. (1969). Investigating causal relations

by econometric models and cross-spectral methods.

Econometrica, 37(3):424–438.

Gupta, J. K., Egorov, M., and Kochenderfer, M. (2017). Co-

operative Multi-agent Control Using Deep Reinforce-

ment Learning. In Sukthankar, G. and Rodriguez-

Aguilar, J. A., editors, Autonomous Agents and Multi-

agent Systems, volume 10642, pages 66–83. Springer

International Publishing, Cham. Series Title: Lecture

Notes in Computer Science.

Hlav

ckov

a-Schindler, K., Palu

s, M., Vejmelka, M., and

Bhattacharya, J. (2007). Causality detection based on

information-theoretic approaches in time series analy-

sis. Physics Reports, 441(1):1–46.

Huang, X., Zhu, F., Holloway, L., and Haidar, A. (2020).

Causal discovery from incomplete data using an en-

coder and reinforcement learning.

Kipf, T., Fetaya, E., Wang, K.-C., Welling, M., and Zemel,

R. (2018). Neural Relational Inference for Interact-

ing Systems. arXiv:1802.04687 [cs, stat]. arXiv:

1802.04687.

Lowe, R., Wu, Y., Tamar, A., Harb, J., Abbeel, O. P.,

and Mordatch, I. (2017). Multi-Agent Actor-Critic

for Mixed Cooperative-Competitive Environments. In

Proceedings of the 31st International Conference on

Neural Information Processing Systems, pages 6382–

6393.

owe, S., Madras, D., Zemel, R., and Welling, M.

(2022). Amortized Causal Discovery: Learning

to Infer Causal Graphs from Time-Series Data.

arXiv:2006.10833 [cs, stat]. arXiv: 2006.10833.

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Ve-

ness, J., Bellemare, M. G., Graves, A., Riedmiller,

M., Fidjeland, A. K., Ostrovski, G., Petersen, S.,

Beattie, C., Antonoglou, I., King, H., Kumaran, D.,

Wierstra, D., Legg, S., Hassabis, D., and Sadik, A.

ICPRAM 2023 - 12th International Conference on Pattern Recognition Applications and Methods

486

(2015). Human-level control through deep reinforce-

ment learning. Nature, 518(7540):529–533.

Oliehoek A., F. and Amato, C. (2016). A Concise Introduc-

tion to Decentralized POMDPs. Springer Publishing

Company, Incorporated, 1st edition.

Peters, J., Janzing, D., and Scholkopf, B. (2017). Elements

of Causal Inference. MIT Press.

Rohrer, J. M. (2018). Thinking Clearly About Correlations

and Causation: Graphical Causal Models for Obser-

vational Data. Advances in Methods and Practices in

Psychological Science, page 16.

Schreiber, T. (2000). Measuring Information Transfer.

Physical Review Letters, 85(2):461–464.

Sgaier, S. K., Huang, V., and Charles, G. (2020). The

case for causal ai. Stanford Social Innovation Review,

18:50–55.

Smirnov, D. A. (2013). Spurious causalities with transfer

entropy. Physical Review E, 87(4):042917.

Sunehag, P., Lever, G., Gruslys, A., Czarnecki, W. M.,

Zambaldi, V., Jaderberg, M., Lanctot, M., Son-

nerat, N., Leibo, J. Z., Tuyls, K., and Graepel, T.

(2018). Value-Decomposition Networks For Coopera-

tive Multi-Agent Learning. In Proceedings of the 17th

International Conference on Autonomous Agents and

MultiAgent Systems, pages 2085– 2087, Stockholm,

Sweden,.

Tampuu, A., Matiisen, T., Kodelja, D., Kuzovkin, I., Kor-

jus, K., Aru, J., Aru, J., and Vicente, R. (2015). Mul-

tiagent Cooperation and Competition with Deep Re-

inforcement Learning. arXiv:1511.08779 [cs, q-bio].

arXiv: 1511.08779.

Tan, M. (1993). Multi-Agent Reinforcement Learning: In-

dependent vs. Cooperative Agents. In Proceedings

of the Tenth International Conference on Machine

Learning, pages 330–337.

Watkins, C. and Dayan, P. (1992). Technical Note Q,-

Learning. In Machine Learning, volume 8, pages

279–292.

Zhu, S., Ng, I., and Chen, Z. (2020). Causal discovery with

reinforcement learning. In International Conference

on Learning Representations.

Learning Independently from Causality in Multi-Agent Environments

487