Learning to Participate Through Trading of Reward Shares

Michael K

olle, Tim Matheis, Philipp Altmann and Kyrill Schmid

Institute of Informatics, LMU Munich, Oettingenstraße 67, Munich, Germany

Keywords:

Multi-Agent Systems, Reinforcement Learning, Social Dilemma.

Abstract:

Enabling autonomous agents to act cooperatively is an important step to integrate artiﬁcial intelligence in our

daily lives. While some methods seek to stimulate cooperation by letting agents give rewards to others, in

this paper we propose a method inspired by the stock market, where agents have the opportunity to participate

in other agents’ returns by acquiring reward shares. Intuitively, an agent may learn to act according to the

common interest when being directly affected by the other agents’ rewards. The empirical results of the tested

general-sum Markov games show that this mechanism promotes cooperative policies among independently

trained agents in social dilemma situations. Moreover, as demonstrated in a temporally and spatially extended

domain, participation can lead to the development of roles and the division of subtasks between the agents.

1 INTRODUCTION

The ﬁeld of cooperative AI seeks to explore methods

which establish cooperative behavior among indepen-

dent and autonomous agents (Dafoe et al., 2020). The

ability to act cooperatively is a mandatory step in

order to integrate artiﬁcial intelligence in our daily

lives especially in applications where different deci-

sion makers interact like autonomous driving. Vari-

ous breakthroughs in the ﬁeld of single agent domains

(Mnih et al., 2015; Silver et al., 2017) have also led to

the successful application of reinforcement learning

in the ﬁeld of multi-agent systems (Leibo et al., 2017;

Phan et al., 2018; Vinyals et al., 2019). However,

while purely cooperative scenarios, where all agents

receive the same reward and thus pursue the same

goal, can be addressed with centralized training tech-

niques, this is not the case if agents have individual re-

wards and goals. Moreover, if agents share resources

it is likely that undesired behaviors are learned, espe-

cially when resources are getting scarce (Leibo et al.,

2017). Independent optimization may lead to sub-

optimal outcomes such as for the Prisoner’s dilemma

or public good games. In more complex games, the

agents’ risk-aversion as well as information asymme-

try additionally deteriorate the likelihood of a desired

outcome (Schmid et al., 2020).

In recent years, various approaches have been

proposed to promote cooperation among independent

agents, such as learning proven game theoretic strate-

gies like tit-for-tat (Lerer and Peysakhovich, 2018),

C D

C (-1, -1)

(-3, 0)

D (-3, 0) (-2, -2)

(a) PD without particpation

C D

C (-1, -1)

(

-1.5,-1.5

)

(

-1.5,-

1.5)

(-2, -2)

(b) PD with 50% particpation

Figure 1: By enabling agents to trade shares in their payoffs,

a socially optimal outcome may be achieved.

the possibility for agents to incentivize each other to

be more cooperative (Schmid et al., 2018; Lupu and

Precup, 2020; Yang et al., 2020), or the integration of

markets to let agents trade for increased overall wel-

fare (Schmid et al., 2021b). In this work, we adopt the

market concept in order to generate increased coop-

eration between independent decision makers. More

speciﬁcally, we propose a method that allows agents

to trade shares of their own rewards. In the presence

of such a participation market, we argue that a bet-

ter equilibrium can be reached by letting agents di-

rectly participate in other agents’ rewards. A socially

optimal equilibrium may be established due to the di-

rect incorporation of all global rewards instead of only

incorporating individual rewards. Enabling agents to

trade their shares at a fair price, while at the same time

creating a trading path that in fact leads to a beneﬁcial

distribution of shares, represents the hardest challenge

when implementing this model.

To motivate the effectiveness of participation,

consider the Prisoner’s dilemma (PD) as depicted in

Figure 1. In the standard version of the Prisoner’s

Kölle, M., Matheis, T., Altmann, P. and Schmid, K.

Learning to Participate Through Trading of Reward Shares.

DOI: 10.5220/0011781600003393

In Proceedings of the 15th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2023) - Volume 1, pages 355-362

ISBN: 978-989-758-623-1; ISSN: 2184-433X

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

355

dilemma, each agent receives its individual reward.

For two rational decision makers, the only Nash equi-

librium lies in mutual defection (DD), which is so-

cially undesirable. However, if we introduce the abil-

ity of reward trading and each agent holds shares in

the other agent’s return (say 50% participation), then

the overall dynamic changes and mutual cooperation

becomes a dominant strategy.

In order to train agents to participate, we apply

different methods of reinforcement learning, so that

the agents autonomously ﬁnd strategies. Thus, par-

ticipation in other agents’ returns is realized over the

course of many training episodes. To that end, differ-

ent variants of the market for participation are tested

in this work, which differ in the way the participa-

tion mechanism is implemented. More speciﬁcally,

the following contributions are made:

• A theoretical motivation is given why market par-

ticipation is beneﬁcial.

• The participation mechanism is empirically eval-

uated in the Prisoner’s dilemma as well as in a

complex multi-agent scenario, called the clean-up

game (Yang et al., 2020).

All code for the experiments can be found here

2 RELATED WORK

Despite the increasing success of reinforcement learn-

ing on an expanding set of tasks, most effort has been

devoted to single-agent environments as well as fully

cooperative multi-agent environments (Mnih et al.,

2015; Silver et al., 2017; Vinyals et al., 2019; Ope-

nAI et al., 2019). However, with multiple agents in-

volved their goals are often not (perfectly) aligned,

which renders centralized training techniques in gen-

eral unfeasible. The drawback of fully decentralized

models is that agents focus only on their individual

rewards, which therefore might result in undesirable

collective performance especially in situations of so-

cial dilemmas or with common pool resources (Leibo

et al., 2017; Perolat et al., 2017).

Previously, one way of tackling this problem has

been to give independent agents intrinsic rewards (Ec-

cles et al., 2019; Hughes et al., 2018; Wang et al.,

2019). The concept of intrinsic rewards draws from

concepts in behavioral economics such as altruistic

behavior, reciprocity, inequity aversion, social inﬂu-

ence, and peer evaluations. These intrinsic rewards

are usually either predeﬁned or they evolve based on

the other agents’ performance over the game. Other

https://github.com/TimMatheis/Learning-to-participate

works suggest that reward mechanisms (Yang et al.,

2020) or penalty mechanisms (Schmid et al., 2021a)

may lead to cooperation in sequential social dilem-

mas. The literature distinguishes between selective

incentives and sanctioning mechanisms which incen-

tivize cooperative behavior in social dilemmas (Kol-

lock, 1998). Selective incentives describe methods

that attempt to positively promote cooperation. For

instance, this could occur by giving monetary rewards

to reduce the consumption of common pool goods,

such as water or electricity (Maki et al., 1978). Con-

trarily, penalties could be a method to reduce defec-

tive behavior. In fact, experiments with humans sug-

gest that penalties are effective in reducing defective

behavior (Komorita, 1987).

In LIO (Yang et al., 2020), a reward-giver’s in-

centive function is learned on the same timescale as

policy learning. Adding an incentive function is a de-

viation from classical reinforcement learning, where

the reward function is the exclusive property of the

environment, and is only altered by external factors.

As shown by empirical research (Lupu and Precup,

2020), augmenting an agent’s action space with a

“give-reward” action can improve cooperation during

certain training phases. Through opponent shaping,

an agent can inﬂuence the learning update of other

agents for its own beneﬁt. A different attempt at op-

ponent shaping is to account for the impact of one

agent’s policy on the anticipated parameter update

of the other agents (Foerster et al., 2018). Through

this additional learning component in LOLA, strate-

gies like tit-for-tat can emerge in the iterated Pris-

oner’s dilemma, whereby cooperation can be main-

tained.

Other work in the ﬁeld suggests markets as vehi-

cles for cooperativeness (Schmid et al., 2018; Schmid

et al., 2021b). Usually, agents only receive their in-

dividual rewards. As they are not affected by the

other agents’ individual rewards, they only act in their

own interest. However, by only receiving individual

rewards, the agents are exposed to substantial risk.

According to economic theory, it is usually beneﬁ-

cial to be diversiﬁed. In portfolio theory, there is a

common agreement that diversiﬁcation increases ex-

pected returns. Although people do not seem to diver-

sify enough, which is called the diversiﬁcation puzzle

(Statman, 2004), a rational agent should be perfectly

diversiﬁed and should only hold a combination of the

market portfolio and a risk-free asset (such as a safe

government bond). When applying this to games such

as the Prisoner’s dilemma, the agents should be inter-

ested in receiving a combination of their individual

rewards and the other agents’ individual rewards to

minimize their risk exposure.

ICAART 2023 - 15th International Conference on Agents and Artiﬁcial Intelligence

356

3 LEARNING TO PARTICIPATE

Multi-agent systems consist of multiple agents that

share a common environment. An agent is an au-

tonomous entity with two main capabilities: perceiv-

ing and acting. The perception of the current state of

the environment allows the agent to choose an appro-

priate action out of an action set. The chosen action

depends on an agent’s policy. Reinforcement learning

methods are often applied to teach an agent a good

policy in multi-agent reinforcement learning.

Since cooperative strategies can be difﬁcult to ﬁnd

and maintain, we suggest a participation mechanism.

The idea is to let agents participate in other agents’

environmental rewards directly in order to align their

formerly conﬂicting goals. In the following, we use

the trading of participation shares as an instrument to

make cooperation possible. If the agents are willing

to hold a signiﬁcant amount of shares in all agents’

rewards, they may act in the “society’s interest”.

Namely, the difference of the individual interest of a

single agent and the common interest of the collective

of all agents could vanish. In our implementation, two

agents only trade shares whenever both choose to in-

crease or decrease the amount of their own shares. In

the initial state, agents hold 100% of their own shares.

If the agents want to be perfectly diversiﬁed, each

agent could have

100%

shares, with n denoting the

number of agents in the environment, of every agent’s

rewards after some trading steps.

4 EXPERIMENTS

In this section, we examine our theoretical considera-

tions in the iterated prisoner’s dilemma and the clean-

up game experimentally.

4.1 The Iterated Prisoner’s Dilemma

The prisoner’s dilemma can be thought of as the

action of two burglars. When they get caught,

they can decide between a

= admitting and a

notadmitting a crime. If both admit the crime, they

receive a high punishment. If none of them admits

the crime, they only receive a low punishment. The

dilemma evolves from the case in which only one of

them admits the crime. Then, the admitter is not pun-

ished due to its status as a principal witness, whereas

the denier receives a very high punishment. Admit-

ting is the defective (D) action and not admitting is

the cooperative (C) action in Figure 1. By deﬁnition,

none of the agents can put itself in a better position by

changing its strategy in a Nash equilibrium. Agent 1

knows that agent 2 can defect and cooperate. When

agent 2 defects, agent 1 is better off by defecting as

well. If agent 2 cooperates instead, agent 1 is again

better off by defecting. Thus, agent 1 should defect in

any case, which makes defecting a dominant strategy

in a one-shot game. By symmetry, agent 2 faces the

same problem and should also defect. The tragedy is

that this outcome is not desirable, as mutual coopera-

tion leads to a better payoff for both agents. When the

game is iterated multiple times, defecting is in theory

not necessarily dominant.

Although the agents could defect in every itera-

tion based on backward induction, the agents could

develop strategies to incentivize cooperation.

4.2 Participation in the Iterated

Prisoner’s Dilemma

For testing different implementations of the iterated

prisoner’s dilemma, we use an actor-critic method.

We adapt Algorithm 1 from LIO ((Yang et al., 2020)

page 5) by replacing the incentive function and its pa-

rameter η with a greater action space during the whole

episode or a preliminary trade action. Hence, either

the actions specify an environment action as well as a

trade action during the whole episode, or there is one

additional step at the start of each episode in which

the participation is determined, or both.

We use the same fully-connected neural network

for function approximation as LIO. However, we only

use the policy network, as we do not make use of the

incentive function, for which LIO uses another neu-

ral network. The policy network has a softmax output

for discrete actions in all environments. For all ex-

periments, we use the same architecture and the same

hyperparameters: β = 0.1, ε

start

= 1.0, ε

end

= 0.01,

= 1.00E − 03. As the agents do not participate in

the other agent’s rewards at the start, max steps is set

to 40 instead of 5 in the implementations with trading.

Hence, they have enough steps for trading. We test the

following implementations of the iterated Prisoner’s

dilemma with two agents.

(i) No Participation. The possible environment ac-

tions of each agent are cooperation and defection. In

the implementation, these two actions are encoded as

0 and 1. There are four possible states: (cooperation,

cooperation), (cooperation, defection), (defection, co-

operation), (defection, defection). If the agents chose

their actions randomly, the average individual reward

would be (−1 + 0 − 3 − 2)/4 = −1.5. If instead the

agents learned to maintain a cooperative equilibrium

in which both choose cooperation, the individual re-

wards would converge to −1. But according to theory,

defection is a dominant action in the “classical” Pris-

Learning to Participate Through Trading of Reward Shares

357

oner’s dilemma. In that case, the individual rewards

would converge to −2. Indeed, in the implementa-

tion without participation shares, a stable equilibrium

evolves in which both agents defect. The accumulated

reward converges to −2 + (−2) = −4 (Figure 2).

(ii) Equal Distribution of Individual Rewards. In

this implementation, the agents always receive the av-

erage reward of all individual shares. Mathematically

speaking, this means that all individual rewards are

aggregated and then divided by the number of agents:

∑

reward

for n agents. The action space and ev-

erything else stays the same. An agent cannot choose

between sharing the rewards or receiving the individ-

ual reward. Each agent learns to cooperate, which

leads to an accumulated reward of −1 + (−1) = −2.

In this implementation, cooperating is a dominant ac-

tion. This implementation demonstrates that the shar-

ing of rewards can lead to cooperation. However, it

does not demonstrate whether the agents can actively

ﬁnd and maintain such an equilibrium, when they can

freely trade (Figure 2).

(iii) Choosing Whether to Share Rewards. The

agents can decide to share their individual rewards.

The action space is extended. In addition to the

two environment actions, an agent can choose be-

tween sharing the rewards or receiving the individ-

ual reward. The action space is deﬁned by an ac-

tion tuple: the environment action and the trade ac-

tion. Hence, there are now 2 × 2 = 4 possible actions

per agent. Importantly, an individual agent cannot

determine whether the combined rewards are evenly

divided between both agents. Instead, this is only

the case if both agents decide to share their rewards.

The state space is extended to eight states: sharing ×

env. action 1 × env. action 2 = 2 × 2 × 2 = 8. Both

agents learn to defect, which leads to an accumulated

reward of −2 + (−2) = −4 (Figure 2). In this im-

plementation, defecting without sharing is a dominant

action. Hence, the sole opportunity to share rewards

is not enough.

(iv) Trading 50% Shares. Initially, both agents do

not hold participation shares of the other agent. In

every step, they can choose between six actions. An

action can be regarded as a 3-tuple: the environment

action, whether to increase the shares of the own re-

wards, and whether to increase the shares of the other

agent’s rewards. When an agent decides to cooper-

ate, there are three possible actions: (cooperate, not

buy own shares, not buy other agent’s shares), (coop-

erate, buy own shares, not buy other agent’s shares),

(cooperate, not buy own shares, buy other agent’s

shares). Symmetrically, there are three possible ac-

tions for defection. Again, a trade of shares is only

executed if both agents intend to do so. For instance,

if both choose to buy own shares and they hold 50%

in each agent’s rewards, they exchange shares and

now hold 100% of their own shares. This would

mean that there is no participation anymore. How-

ever, if they now choose to buy own shares again,

this trade cannot occur, as they already hold all of

their own shares. As a consequence, only the envi-

ronment action has an effect. The implication is that

all shares are valued at the same price, they are just

exchanged in the same proportions, and the sum of

shares per agent is always 100%. There are 36 states:

env. actions×portion own shares × trade = 4 ×3 ×3.

The portion of own shares can be 0, 0.5, or 1. The

“trade” variable represents whether there is no trade,

a trade which leads to an increasing amount of own

shares, or a trade which leads to a decreasing amount

of own shares. This implementation is successful in

establishing cooperation. The accumulated rewards

converge to −1 + (−1) = −2 (Figure 2b). However,

the learning process takes rather long. Additionally,

the actions and rewards ﬁrst drift off to the previous

inefﬁcient equilibrium.

(v) Trading 10% Shares. Compared to the trading of

50% shares, the state space increases to 4 × 11 × 3 =

132 because the proportion of own shares can now be

0, 0.1, 0.2, ..., 1. Again, a socially optimal equilibrium

can be established. A proportion of around 40% own

shares and 60% other shares seems to be efﬁcient to

make the agents not deviate to defection (Figure 2).

4.3 The Clean-Up Game

In the clean-up game, multiple agents simultaneously

attempt to collect apples in the same environment. An

agent gets rewarded +1 for each apple that they col-

lect. The apples spawn randomly on the right side

of a quadratic 7x7 or 10x10 map. On the left side

of the map, there is a river that gets increasingly pol-

luted with waste. As the waste level increases and ap-

proaches a depletion threshold, the apple spawn rate

decreases linearly to zero. To avoid a quick end of

the game, an agent can ﬁre a cleaning beam to clear

waste. But an agent can only do so when being in

the river. The cleaning beam then clears all the waste

upwards from the agent. Each agent’s observation is

an egocentric RGB image of the whole map. The

dilemma is that clearing waste by staying in the river

and ﬁring cleaning beams is less attractive than re-

ceiving rewards by collecting apples. However, if all

agents focus on only collecting apples, the game is

quickly over and the total reward of the agents re-

mains very low. Hence, the game is an intertemporal

social dilemma, in which there is a trade-off between

short-term individual incentives and long-term collec-

ICAART 2023 - 15th International Conference on Agents and Artiﬁcial Intelligence

358

(a) Rewards (b) Rewards with trading of 10% or 50%

Figure 2: Results from the iterated Prisoner’s dilemma with and without participation.

tive interest (Hughes et al., 2018). This domain is

especially interesting as models based on behavioral

economics can only explain cooperation in simple,

unrealistic, stateless matrix games. In contrast, the

cleanup game is a temporally and spatially extended

Markov game.

4.4 Participation in the Clean-Up Game

with Two Agents

Figure 3: Small 7x7 map with two agents. Agent 1 gets

spawned by the river, whereas agent 2 gets spawned close

to the apples. If they do not divide their tasks, the game is

likely to stop soon, as the waste expands downwards if it is

not cleared by the agents.

The smaller 7x7 map is used as described in (Yang

et al., 2020). The initial state of the game is displayed

in Figure 3. The blue cells on the left side of the map

represent the river, the green cells indicate the area

where apples are randomly spawned, and the brown

cells represent the waste. Agent 1 gets spawned in the

river, which enables it to clear the waste. Agent 2 gets

spawned close to the green area on the right where

apples are randomly spawned. This already gives a

hint at one possible strategy, which consists of agent

1 clearing the waste, and agent 2 collecting the apples.

However, there is no justifying reason for agent 1 to

clear waste as long as it does not get the chance to col-

lect apples or proﬁts from agent 2’s rewards. Agent 2

is in a similar dilemma. If it attempts to collect apples

by being in the green area, it cannot ﬁre the cleaning

beam. Without a division of work between the agents,

the game must stop early, as the blue and green area

are too far away. The possible environment actions of

each agent are move left, move right, move up, move

down, no operation, and cleaning. In all implemen-

tations of the clean-up domain, an actor-critic method

is used. The optimization is decentralized.

We use the same convolutional networks to pro-

cess image observations in Cleanup as LIO. The pol-

icy network has a softmax output for discrete actions

in all environments. For all experiments, We use the

same neural architecture. We test the following im-

plementations of the clean-up game with two agents.

(i) No Participation. Without any additional incen-

tive structures, the social dilemma seems unsolvable.

As depicted in Figure 4a, the accumulated reward re-

mains at a very low level. Cooperation is not attrac-

tive for any of the agents. It is difﬁcult to learn that

clearing waste is creating value since the apples are

spawned far away from the river. If the agent moves

back from the river to the apples, the game is likely to

stop before it even collects an apple.

(ii) LIO. The baseline scenario is augmented with

the possibility for each agent to give rewards to the

other agent as an additional channel for cooperation.

Importantly, the additional reward payments r

from

agent i to agent j is learned via direct gradient as-

cent on the agent’s own extrinsic objective, involv-

ing its effect on all other agents’ policies. Hence, the

payments are not part of the reinforcement learning,

leaving the action space of the reinforcement learner

unaffected (instead of augmenting it). The agents suc-

cessfully divide their tasks. The one that is closer to

the river and waste becomes the cleaner, whereas the

one that is closer to the apples specializes in collect-

ing apples. See (Yang et al., 2020) for implementation

details.

(iii) Equal Distribution of Individual Rewards. In

this scenario, the baseline scenario is augmented with

the equal distribution of the joint rewards between the

agents after each step. This additional computation

does not affect the state or action space. Similar to

LIO, the one that is closer to the river turns into the

Learning to Participate Through Trading of Reward Shares

359

(a) Two agents (b) Three agents

Figure 4: Results from the clean-up game.

cleaner, whereas the one that is closer to the apples

specializes in collecting apples. Both proﬁt from this

task division, as both are evenly rewarded for any ap-

ple being collected by anyone. In comparison with

LIO, the learning process looks less volatile.

(iv) Pre-Trade of Participation Rights. In this sce-

nario, there is an additional ﬁrst step added to each

episode of the baseline scenario. In this ﬁrst step,

both agents can choose between six actions (0-5),

representing 0%, 20%, 40%, 60%, 80%, and 100%.

The maximum of both agents’ chosen number deter-

mines the participation in their own rewards over the

episode. The remaining portion is the participation in

the other agent’s individual rewards. For instance, if

agent 1 chooses 40%, and agent 2 prefers 80%, they

will receive 80% of their individual rewards and 20%

of the other agent’s individual rewards. The additional

ﬁrst step is part of the reinforcement learning, but

no rewards are distributed in this step. The idea be-

hind the additional trade step is that both agents can

avoid cooperation by choosing 100%. By taking the

maximum, the more conservative action is executed.

Again, the agents successfully divide their tasks. The

amount of waste cleared by agent 1 moves in parallel

with the accumulated rewards. However, the magni-

tude of the joint rewards only reaches around 11, and

it is unclear whether this is a stable level.

5 CONCLUSION

By introducing the idea of participating in other

agents’ rewards, we suggest a new method for co-

ordination and cooperation in shared multi-agent en-

vironments. Agents learn that direct participation in

other agents’ rewards reinforces policies that opti-

mize a global objective. Through this mechanism,

no additional extrinsic incentive structures are needed

such as in LIO (Yang et al., 2020). Other previous

works focused on intrinsic incentives (Eccles et al.,

2019; Hughes et al., 2018; Wang et al., 2019) which

would not be needed either. Especially the simplis-

tic and graspable extension of standard models makes

the participation appealing. In the two tested social

dilemma problems, the Iterated Prisoner’s dilemma

and Cleanup, the opportunity to participate via shares

is used by the agents to discover cooperative behav-

iors. In fact, the division of labor in Cleanup is

effectively enabled. Without the participation, the

agents cannot learn to divide their task into sub-

tasks, as clearing waste does not directly lead to re-

wards. But through the participation, the positive im-

pact of clearing waste is directly observed through the

other agents’ rewards from collecting apples. Impor-

tantly, the other agent can only collect apples thanks

to the clearing of the waste beforehand. The intro-

duced method of participation can achieve optimal

collective performance in the Prisoner’s dilemma. In

Cleanup, the improvement in the collective perfor-

mance through participation is clearly visible. Al-

though it is not clear whether the agents exhaustively

learn the perfect participation allocation, the agents

manage to coordinate and enhance their performance

over time.

Our approach attempts to answer some open ques-

tions regarding cooperation in a decentralized multi-

agent population. Firstly, although the agents must

simultaneously learn how to participate via shares in

addition to learning their environment actions, learn-

ing to keep shares of other agents may be a simpler

task than punishing other agents whenever they act

only in their own interest. Secondly, punishments or

other incentivizing rewards should be earned before-

hand. When working with shares, they can be directly

traded without costs. Thirdly, a share market can be

implemented in various ways. We are certain that a

suitable market structure can be found in most social

dilemmas. Furthermore, LIO assumes that recipients

cannot reject an incentive, but an agent may accept

ICAART 2023 - 15th International Conference on Agents and Artiﬁcial Intelligence

360

(a) In the basic implementation, none

of the agents learns to clear waste.

(b) In the LIO implementation, no

clear strategy evolves.

agents learns to clear waste.

Figure 5: Waste cleared in clean-up with three agents for the basic implementation, LIO, and participation.

only some incentives. In the case of shares, this is

not a problem as the respective rewards are just dis-

tributed according to the owned shares. Another ben-

eﬁt of the participation approach is that there is no

clear strategy for the agents to misuse the additional

market feature to exploit the other agents. There is

an emerging literature on reward tampering (Everitt

et al., 2021), and a participation market could be a

step in the right direction of deploying safe applica-

tions.

Our work contributes to the aim of ensuring

the common good in environments with independent

agents. Although the participation approach works

well in the tested social dilemmas, it remains unclear

whether this is also the case in other environments.

Another open question is if the agents can always ﬁnd

a stable allocation of shares. We suggest experiments

with a broker that sets prices in combination with a

limit order book for matching demand and supply.

Additionally, participation needs to be tested in other

environments with more agents as well as other game

structures.

REFERENCES

Dafoe, A., Hughes, E., Bachrach, Y., Collins, T., Mc-

Kee, K. R., Leibo, J. Z., Larson, K., and Graepel,

T. (2020). Open problems in cooperative AI. CoRR,

abs/2012.08630.

Eccles, T., Hughes, E., Kram

ar, J., Wheelwright, S., and

Leibo, J. Z. (2019). Learning Reciprocity in Complex

Sequential Social Dilemmas. arXiv:1903.08082 [cs].

arXiv: 1903.08082.

Everitt, T., Hutter, M., Kumar, R., and Krakovna, V. (2021).

Reward tampering problems and solutions in rein-

forcement learning: a causal inﬂuence diagram per-

spective. Synthese, 198(27):6435–6467.

Foerster, J. N., Chen, R. Y., Al-Shedivat, M., Whiteson, S.,

Abbeel, P., and Mordatch, I. (2018). Learning with

Opponent-Learning Awareness. arXiv:1709.04326

[cs]. arXiv: 1709.04326.

Hughes, E., Leibo, J. Z., Phillips, M. G., Tuyls, K., Du

nez-

Guzm

an, E. A., Casta

neda, A. G., Dunning, I., Zhu,

T., McKee, K. R., Koster, R., Roff, H., and Graepel,

T. (2018). Inequity aversion improves cooperation in

intertemporal social dilemmas. arXiv:1803.08884 [cs,

q-bio]. arXiv: 1803.08884.

Kollock, P. (1998). Social Dilemmas: The

Anatomy of Cooperation. Annual Re-

view of Sociology, 24(1):183–214.

eprint:

https://doi.org/10.1146/annurev.soc.24.1.183.

Komorita, S. S. (1987). Cooperative Choice in Decomposed

Social Dilemmas. Personality and Social Psychology

Bulletin, 13(1):53–63. Publisher: SAGE Publications

Inc.

Leibo, J. Z., Zambaldi, V., Lanctot, M., Marecki, J., and

Graepel, T. (2017). Multi-agent Reinforcement Learn-

ing in Sequential Social Dilemmas. arXiv:1702.03037

[cs]. arXiv: 1702.03037.

Lerer, A. and Peysakhovich, A. (2018). Maintaining coop-

eration in complex social dilemmas using deep rein-

forcement learning. arXiv:1707.01068 [cs]. arXiv:

1707.01068.

Lupu, A. and Precup, D. (2020). Gifting in Multi-Agent

Reinforcement Learning. New Zealand, page 9.

Maki, J. E., Hoffman, D. M., and Berk, R. A. (1978). A time

series analysis of the impact of a water conservation

campaign. Evaluation Quarterly, 2(1):107–118. Pub-

lisher: Sage Publications Sage CA: Thousand Oaks,

CA.

Learning to Participate Through Trading of Reward Shares

361

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Ve-

ness, J., Bellemare, M. G., Graves, A., Riedmiller,

M., Fidjeland, A. K., Ostrovski, G., Petersen, S.,

Beattie, C., Sadik, A., Antonoglou, I., King, H., Ku-

maran, D., Wierstra, D., Legg, S., and Hassabis,

D. (2015). Human-level control through deep re-

inforcement learning. Nature, 518(7540):529–533.

Bandiera abtest: a Cg type: Nature Research Jour-

nals Number: 7540 Primary atype: Research Pub-

lisher: Nature Publishing Group Subject term: Com-

puter science Subject term id: computer-science.

OpenAI, Berner, C., Brockman, G., Chan, B., Cheung,

V., Debiak, P., Dennison, C., Farhi, D., Fischer, Q.,

Hashme, S., Hesse, C., J

ozefowicz, R., Gray, S., Ols-

son, C., Pachocki, J., Petrov, M., Pinto, H. P. d. O.,

Raiman, J., Salimans, T., Schlatter, J., Schneider, J.,

Sidor, S., Sutskever, I., Tang, J., Wolski, F., and

Zhang, S. (2019). Dota 2 with Large Scale Deep Re-

inforcement Learning. arXiv:1912.06680 [cs, stat].

arXiv: 1912.06680.

Perolat, J., Leibo, J. Z., Zambaldi, V., Beattie, C., Tuyls,

K., and Graepel, T. (2017). A multi-agent reinforce-

ment learning model of common-pool resource ap-

propriation. arXiv:1707.06600 [cs, q-bio]. arXiv:

1707.06600.

Phan, T., Belzner, L., Gabor, T., and Schmid, K.

(2018). Leveraging Statistical Multi-Agent Online

Planning with Emergent Value Function Approxima-

tion. arXiv:1804.06311 [cs]. arXiv: 1804.06311.

Schmid, K., Belzner, L., Gabor, T., and Phan, T. (2018).

Action Markets in Deep Multi-Agent Reinforcement

Learning. In K

urkov

a, V., Manolopoulos, Y., Ham-

mer, B., Iliadis, L., and Maglogiannis, I., editors,

Artiﬁcial Neural Networks and Machine Learning –

ICANN 2018, Lecture Notes in Computer Science,

pages 240–249, Cham. Springer International Pub-

lishing.

Schmid, K., Belzner, L., and Linnhoff-Popien, C. (2021a).

Learning to penalize other learning agents. MIT Press.

Schmid, K., Belzner, L., M

uller, R., Tochtermann, J.,

and Linnhoff-Popien, C. (2021b). Stochastic Market

Games. In Zhou, Z.-H., editor, Proceedings of the

Thirtieth International Joint Conference on Artiﬁcial

Intelligence, IJCAI-21, pages 384–390. International

Joint Conferences on Artiﬁcial Intelligence Organiza-

tion.

Schmid, K., Belzner, L., Phan, T., Gabor, T., and Linnhoff-

Popien, C. (2020). Multi-agent Reinforcement Learn-

ing for Bargaining under Risk and Asymmetric Infor-

mation. Pages: 151.

Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I.,

Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M.,

Bolton, A., Chen, Y., Lillicrap, T., Hui, F., Sifre, L.,

van den Driessche, G., Graepel, T., and Hassabis, D.

(2017). Mastering the game of Go without human

knowledge. Nature, 550(7676):354–359. Bandiera -

abtest: a Cg type: Nature Research Journals Num-

ber: 7676 Primary atype: Research Publisher: Na-

ture Publishing Group Subject term: Computational

science;Computer science;Reward Subject term id:

computational-science;computer-science;reward.

Statman, M. (2004). The Diversiﬁcation Puzzle. Financial

Analysts Journal, 60(4):44–53. Publisher: Routledge

eprint: https://doi.org/10.2469/faj.v60.n4.2636.

Vinyals, O., Babuschkin, I., Czarnecki, W. M., Mathieu,

M., Dudzik, A., Chung, J., Choi, D. H., Powell, R.,

Ewalds, T., Georgiev, P., Oh, J., Horgan, D., Kroiss,

M., Danihelka, I., Huang, A., Sifre, L., Cai, T., Aga-

piou, J. P., Jaderberg, M., Vezhnevets, A. S., Leblond,

R., Pohlen, T., Dalibard, V., Budden, D., Sulsky, Y.,

Molloy, J., Paine, T. L., Gulcehre, C., Wang, Z.,

Pfaff, T., Wu, Y., Ring, R., Yogatama, D., W

unsch,

D., McKinney, K., Smith, O., Schaul, T., Lillicrap,

T., Kavukcuoglu, K., Hassabis, D., Apps, C., and

Silver, D. (2019). Grandmaster level in StarCraft

II using multi-agent reinforcement learning. Nature,

575(7782):350–354. Bandiera abtest: a Cg type: Na-

ture Research Journals Number: 7782 Primary atype:

Research Publisher: Nature Publishing Group Sub-

ject term: Computer science;Statistics Subject term -

id: computer-science;statistics.

Wang, J. X., Hughes, E., Fernando, C., Czarnecki, W. M.,

Duenez-Guzman, E. A., and Leibo, J. Z. (2019).

Evolving intrinsic motivations for altruistic behavior.

arXiv:1811.05931 [cs]. arXiv: 1811.05931.

Yang, J., Li, A., Farajtabar, M., Sunehag, P., Hughes, E.,

and Zha, H. (2020). Learning to Incentivize Other

Learning Agents. arXiv:2006.06051 [cs, stat]. arXiv:

2006.06051.

ICAART 2023 - 15th International Conference on Agents and Artiﬁcial Intelligence

362