A Study on Cooperative Action Selection Considering Unfairness in

Decentralized Multiagent Reinforcement Learning

Toshihiro Matsui and Hiroshi Matsuo

Nagoya Institute of Technology, Gokisyo-cho, Showa-ku, Nagoya, Aichi, 466-8555, Japan

Keywords:

Multiagent System, Reinforcement Learning, Distri buted Constraint Optimization, Unfairness, Leximin.

Abstract:

Reinforcement learning has been studied for cooperative learning and optimization methods in multiagent sys-

tems. In several frameworks of multiagent reinforcement learning, the system’s whole problem is decomposed

into l ocal problems for agents. To choose an appropriate cooperative action, the agents perform an optimiza-

tion method that can be performed in a distributed manner. While the conventional goal of the learning is the

maximization of the total rewards among agents, in practical resource al location problems, unfairness among

agents is critical. In several recent studies of decentralized optimization methods, unfairness was conside-

red a criterion. We address an action selection method based on leximin cri teria, w hich reduces the unfairness

among agents, in decentralized reinforcement learning. We experimentally evaluated the eff ects and inﬂuences

of the proposed approach on classes of sensor network problems.

1 INTRODUCTION

Reinforcement learning has been studied as coopera-

tive learning and optimization methods in multiagent

systems (Hu and Wellman, 2003; Zhang and Lesser,

2011; Nguyen et al., 2014). In several frameworks of

multiagent reinfor c ement learn ing, the system’s en-

tire problem is decomposed into local problems fo r

agents. To choose an appropriate cooperative action,

the agents perform an o ptimization method that can

be perform ed in a distributed manner.

A class of networked distributed POMDPs was

deﬁned for a sensor network domain (Zhang and Les-

ser, 2011), and in the reinforceme nt lea rning for such

problems, the joint actions of agents are selected

using the max-sum algo rithm (Farinelli et al., 2008),

which is a solution method for distributed constraint

optimization problems (DCOPs).

Similarly, Markovian dynamic DCOPs and their

solution method were proposed (Nguyen et al., 2014 ),

where distributed RVI Q -learning and R-learning al-

gorithms employed DPOP (Petcu and Faltings, 2005 ),

which is a lso a solution method of DCOPs, to select

the jo int actions of agents.

While the learning’s c onventional goal is the ma x-

imization of total rewa rds among agents, in practi-

cal resource allocation problem s, unfairness among

agents is crucial. In several recent studies of decen-

tralized optimization methods including D COPs, un-

fairness is deemed an important criterion (Netzer and

Meisels, 2013a; N e tz er an d Meisels, 2013b).

As a criterion of unfairness, leximin has been stu-

died (Moulin, 1988; Bouveret and Lemaˆıtre, 2 009).

The leximin deﬁnes the relationship between two vec-

tors in multi-objective optimization problems. Max-

imization on the leximin improves unfairness among

objectives. Extended classes of DCOPs applying lex-

imin criterio n and solution methods have been propo-

sed (Ma tsui et al., 20 14; Matsui et al., 2015).

In this study, we address an action selection met-

hod based on leximin cr iteria, which reduces the un-

fairness among agents, in decentralized reinforcement

learning. We experimentally evaluated the effects and

inﬂuences of the proposed approach on classes of sen-

sor network problems.

The remainder of this paper is organized as fol-

lows. The next section sh ows the background of our

study including reinforcement le arning with a dis-

tributed setting and criteria for the cooperative acti-

ons of agents. Section 3 describes a class of sen-

sor network problems for our motivatin g domain.

Our proposed approach is sh own in Section 4 . In

Section 5, o ur proposed methods are experimentally

evaluated. Related works and discussions are ad-

dressed in Section 6, and the study is concluded in

Section 7.

88

Matsui T. and Matsuo H.

A Study on Cooperative Action Selection Considering Unfairness in Decentralized Multiagent Reinforcement Learning.

DOI: 10.5220/0006203800880095

In Proceedings of the 9th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2017), pages 88-95

ISBN: 978-989-758-219-6

Copyright

c

2017 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved

2 BACKGROUND

2.1 Reinforcement Learning with

Distributed Setting

Several types of reinforc ement learning have been ap-

plied to multiagent systems to optimize cooperative

policies among age nts. In several recent studies, mul-

tiagent reinforcem ent learning is modeled and solved

using a distributed manner (Zhang and Lesser, 2011;

Nguyen et al., 2014). Basically, these approaches re-

semble standard settings, while the lea rning tables are

distributed a mong agents. The optimal joint action is

determined u sin g a distributed optimiz a tion method.

Here we address a standard Q-learning as a base

of su c h a distributed version. The problem of Q-

learning consists of a set of states S, a set of actions

A, a function tab le of Q-values Q, r eward function R,

and an update rule:

Q

t+1

(s

t

, a

t

) = (1 − α)Q

t

+ α(r + γmax

a

Q

t

(s

t+1

, a)). (1)

s

t

∈ S and a

t

∈ A denote a joint state and a joint action

among agents at time step t. r is the reward that is

received from reward function R in an environme nt

when agents perform joint a ction a

t

at joint state s

t

.

The learning is adjusted by learning rate α and dis-

count rate γ. In each time step t, agents sense current

joint state s

t

and cho ose joint action a

t

from A. Af-

ter the agents perform action a

t

, they sen se next state

s

t+1

and obtain reward r from the environment. Ba-

sed on the reward and the next state, agents update

the Q-values for s

t

and a

t

. In this computation, op-

timal joint action a at joint state s

t+1

is chosen and

the corre sponding Q-value is propagated with the dis-

count ra te . The above processing is repeated until the

agents learn the environment’s informatio n.

To choose the next joint action at joint state s

t

,

several heuristics are employed based o n th e trade-

off between exploration and exploitation. With ε-

greedy heuristic, agents perform random walk with

probability ε and op timal joint action a

∗

such that

a

∗

= argmax

a

Q

t

(s

t

, a) with probability 1 − ε.

In a distributed setting, each agent i has a part of

states S

i

and actions A

i

, where ∪

i

S

i

= S and ∪

i

A

i

= A.

Note that, for i 6= j, S

i

∩ S

j

and A

i

∩ A

j

can be non-

empty sets. Local problems are related by this over-

lap. Each agent i has a function table of Q-values Q

i

for S

i

and A

i

. Also, partial reward function R

i

is de-

ﬁned for S

i

and A

i

. An advantage of this setting is

that huge global jo int states a nd actions are approxi-

mated as sets of local joint states and actions. H owe-

ver, cooper a tion is necessary to cho ose an approp riate

global joint action, while each agent i updates Q

i

in-

dependently.

2.2 Cooperation based on Distributed

Constraint Optimization

To choose an appropriate global joint action, an op-

timization method that resembles distributed p roblem

solving is necessary. In the ca se of a ε-greedy he u-

ristic, age nts p e rform random walk or choose the glo-

bally optimal action. The latter case is represented as

a distributed constraint optimization problem (Modi

et al., 2005; Petcu and Faltings, 2005; Farinelli et al.,

2008; Zivan, 2008), which is a fu ndamental problem

in multiagent cooperation .

A distributed constraint optimization problem

(DCOP) is deﬁne d by (A , X, D, F). Here A is a

set of agents, X is a set of variables, D is a set of

domains of variables, and F is a set of objective

functions. The variables and functions are distri-

buted to the agents in A . Variable x

n

∈ X ta-

kes its values from its domain D

n

∈ D. Function

f

m

∈ F deﬁnes the utility values on several var ia bles.

X

m

⊂ X deﬁnes the set of variables in the scope of

f

m

. F

n

⊂ F deﬁnes a set of functions, where x

n

is

in their scope. f

m

is deﬁned as f

m

(x

m0

, ·· · , x

mk

) :

D

m0

× ··· × D

mk

→ R, w here {x

m0

, ·· · , x

mk

} = X

m

.

f

m

(x

m0

, ·· · , x

mk

) is also denoted by f

m

(X

m

). Aggre-

gation F(X) of all the objective functions is deﬁned

as F(X) =

∑

m s.t. f

m

∈F,X

m

⊆X

f

m

(X

m

). The goal is to

ﬁnd a globally optimal assign ment that maximizes the

value of F(X).

Consider the pro blem for optimal joint action a

∗

such that a

∗

= argmax

a

Q

t

(s

t

, a). In distributed set-

tings, since the Q-values a re approx imated with local

Q-values, each agent i has Q

i,t

at time step t. Here

variable x

n

of a DCOP is deﬁned for A

n

⊂ A

i

and D

n

correspo nds to A

n

. Note that agent i’s A

i

(and S

i

) can

overlap with different agent j’s A

j

(and S

j

). In this

case, partial set A

n

= A

i

∩ A

j

is sha red by both agents

and deﬁned as domain D

n

of variable x

n

. On the ot-

her hand, function f

m

is deﬁned for Q

i,t

. Namely,

f

m

(X

m

) = Q

i,t

(s

i,t

, a

i

), where s

i,t

is a vector of con-

stant values. While X

m

correspo nds to A

i

, X

m

overlaps

with the scopes of the functions in other age nts.

Since this problem resembles the computation of

max

a

Q

t

(s

t+1

, a) in Eq. (1), the same solution meth od

can be applied to both computations of a join t action

and learning.

Each agent locally knows the information of its

own variables and the related fun c tions in the initial

state. An optimization method in a distributed man-

ner computes the globally optimal solution. Several

solution methods have b een proposed f or DCOPs.

Here we employ a dynamic programming method that

computes the exact optimal solution.

A Study on Cooperative Action Selection Considering Unfairness in Decentralized Multiagent Reinforcement Learning

89

2.3 Optimization Criteria

In conventional multiagent reinf orcement learning,

the goal of the problem is to optimize the global sum

of the rewards. On the other hand, different studies

address the individuality of each age nt, such as the

Nash equilibrium (Hu and Wellman, 2003).

In several recent DCOP studies, fairness among

agents was addre ssed (Netzer and M e isels, 2013a;

Netzer and Meisels, 2013b; Matsui et al., 2014; Mat-

sui et al., 2015). Since the solution methods for such

classes of problems a re d esigned as distributed algo-

rithms, the pro blem of joint actions can be replaced

by problems based on fairness. In particular, optimi-

zation with th e leximin criterion (Moulin, 1988 ; Bou-

veret and Lemaˆıtre, 2009; Matsui et al., 2014; Matsu i

et al., 2015) shown below improves fairness among

agents. Here we refer the deﬁnitions in (Matsui et al.,

2015).

To address multiple o bjectives for agents, ob-

jective vectors are deﬁned. Each value of an objective

vector corresponds to a utility value for an agent. Ob-

jective vector v is deﬁned as [v

0

, ·· · , v

K

], wh ere v

j

is

an objective value. Vector F(X ) of objective functi-

ons is deﬁned as [F

0

(X

0

), ··· , F

K

(X

K

)], where X

j

is

the subset of X on which F

j

is deﬁned. F

j

(X

j

) is an

objective fun c tion for objective j. For assignment X ,

vector F(X ) of the functions returns a n objective vec-

tor [v

0

, ·· · , v

K

]. Here v

j

= F

j

(X

j

). In addition, the

objective vector is sorted to employ leximin. Based

on objective vector v, in a sorted objective vector, all

the values of v are sorted in ascending or der.

Based on the sorted objective vectors, the lexi-

min is deﬁned as follows. Let v and v

′

denote the

vectors of identical length K + 1. Let [v

0

, ·· · , v

K

]

and [v

′

0

, ·· · , v

′

K

] denote the sorted vectors of v and v

′

.

Also, let ≺

leximin

denote the relation of the leximin

ordering. v ≺

leximin

v

′

if and only if ∃t, ∀t

′

< t, v

t

′

=

v

′

t

′

∧ v

t

< v

′

t

.

The maximization on leximin redu c es the unfair-

ness among the utility values among the agents by im-

proving the worst case utilities. Also, this criterion

chooses a Pareto optima l objective vector.

In several re source alloc ation problems, reducing

unfairness among age nts is an important criterion,

while such joint actions may not be compatible with

conventional reinforcement learning.

3 MOTIVATING DOMAIN

Considering several related works (Zhang and Les-

ser, 2011; Nguyen et al. , 2014), we deﬁne an example

problem motivated by sensor ne tworks, as shown in

1

32

x

1

f

3

x

3

x

2

f

1

f

2

Sensor Variable

Function

Area/Agent

1

2

3

Targets

Figure 1: Sensor network problem.

Fig. 1. The system consists of sensors, areas, and tar-

gets. Each sensor adjoins a few areas, and each are a

adjoins a few sensors. Here we assume for simplicity

that an area is related to two sensors, e a ch of which

can be simultaneously allocate d to one of the adjoi-

ning are as.

While a target stays in one of the areas, multip le

targets can be in the same area. The sensing event

is active and detected by targets. A target randomly

moves to one of th e other neighborhood areas of sen-

sors adjoining the current staying area when sensors

are alloc ated to the residing area. We only consider

whether an area is occupied by at least one target.

When an area is oc cupied by targets and allocated

with sensors, a r eward is given that corresponds to

the combination of allocated sensors. However, if an

empty area is a llocated by sensors, a small negative

reward is given. The goal of the problem is to improve

the global reward aggregated for all the areas.

This system is represented as a problem of multia-

gent r e inforcem ent lea rning. To represent cooperative

action, agent i corresponds to area i. In actual settings,

such an agent will be operated by a sensor node that

adjoins the a rea. States S

i

of agent i correspond to the

states of sensors adjoining area i. Ea ch pair of sta-

tes (s

a

k

, s

o

k

) ∈ S

i

is deﬁned by the states o f sensor k.

s

a

k

represents the area to which sensor k is allocated.

s

o

k

represents whether sensor k detects that an area is

occupied b y the targets. Similarly, actions A

i

are ba-

sed on the areas to which the sensors are allocated.

Each action a

k

∈ A

i

represents the area to which sen-

sor k is allocated. Q

i

is deﬁned for S

i

and A

i

. The

values of reward function R

i

are deﬁned for area i, as

shown above.

The DCOP for a joint action at time step t is re-

presented as follows. Variable x

k

is deﬁned for a

k

of

sensor k. Function f

i

(X

i

) is d eﬁned for Q

i,t

(s

i,t

, a),

where x

k

∈ X

i

correspo nds to a

k

∈ A

i

.

In the example shown in Fig. 1, the sensor network

consists of three sensor s and three ar eas. The agent

of area 1 has states S

1

, actions A

1

, and Q-values Q

1

.

Since the states and actions are related to sensors 1

and 2, the local problem p artially overlaps with those

of other agents. In a DCOP, a variable corresponds to

the actions of a sensor, and a function corresponds to

a part of a table of Q-values.

ICAART 2017 - 9th International Conference on Agents and Artiﬁcial Intelligence

90

4 COOPERATIVE ACTION

CONSIDERING UNFAIRNESS

4.1 Applying Leximin Optimization

In this study, we apply a leximin operator for action

selection based on the unfairness among agents. In

a sensor network, when an occupied area is observed

by multiple sensors, a h igher reward is given for mo re

informa tion. On the other hand, such a concentra tion

decreases the number of observed areas. Namely, se-

veral occupied are as can b e ignored even if sma ller

rewards are given.

Since the global sum of utilities amon g agents

does not consider individual utilities, other criteria are

required in this situation. A comparison of unfairness

based on leximin is expected to handle th is case. On

the other hand, this action selec tion might be incom-

patible with the original reinforc e ment learning. A s

the ﬁrst study, we experimentally evaluate the effect

and inﬂuence of such action selection.

For multiple ob jective problems, each objective

for agent i of area i corresponds to function f

i

. The-

refore, a sorted objec tive vector consists of the values

of f

i

for all agents, but the conventional case em ploys

the value’s to tal summation.

4.2 Solution Method

To so lve the DCOPs for action selection, we employ

a dynamic programming approach that can be per-

formed in a distributed manner (Matsui and Ma tsu o,

2014; Matsui et al., 20 15). Since the solution met-

hod’s basic framework is the same fo r both criteria,

we ﬁrst brieﬂy sketch the solution method for the

summation. A DCOP is represen te d as a factor graph,

which is a bipartite graph that consists of variable no-

des, function nodes, a nd edges. For the factor graph , a

pseudo- tree, which is a tree-like graph structure deﬁ-

ning the (partial) order of nodes, is generate d (Fig. 2).

Here we assume that the root node is one of variable

nodes. If the topo logy of the original sensor network

is static, a single pseudo-tree can be reused for each

problem solving. The problem is decomposed based

on pseudo-trees. As a result, each node has a partial

problem consisting of the subtree rooted at the node.

The partial problem of the subtree is only related to its

higher nodes with cu t edges between the subtree and

higher nodes. The variables that corresp ond to the cut

edges are called separators.

Based on the se relations, eac h node performs part

of the dynamic programming th at consists of two pha -

ses. The ﬁrst phase is performed in a bottom-up man-

ner. Each node aggregates function tables from its

x

1

f

3

x

3

x

2

f

1

f

2

Variable

Function

f

4

x

4

x

1

f

1

x

2

f

2

x

3

f

3

f

4

x

4

x

1

x

1

, x

2

x

1

, x

2

x

1

, x

3

x

1

, x

3

x

3

x

4

(1) DCOP (2) Pseudo Tree

Separator

Figure 2: Solution method.

child nodes, if they exist, and genera te s a function ta-

ble by adding the function values for each assignment

to the related variables. In addition, a fun c tion node

also aggregates its function with the function table

and maximizes the aggregated function table for non-

separator variables. As a result, a smaller function

table for just the separators is generated and sent to

the parent no de.

The second phase is performed in a top-down

manner. In the root (variable) node, the optimal as-

signment for the variable is determined from the ag-

gregated function table. Then the root agent sends

its optimal assignment to its child nodes. The child

nodes determine the optimal a ssignment to the rela-

ted non-separator variables, if necessary, and propa-

gate the optimal assignment, including those of hig-

her nodes, to its child node. The computational and

space complexity of the solution method is exponen-

tial with the number of sep a rators for each partial pro-

blem. However, we prefer the exact solution method

to choose one optimal solutio n.

By replacing the objective values and th e addi-

tion/maxim ization operator s to the sorted objective

vectors and extended operators, the solution method

solves the maximization problem on leximin crite-

rion. The addition operator is replaced by a co uple

of concatenate and resorting operators for objective

vectors. The maximization is based on leximin. Since

we use real va lues as Q-values, the opportunities for

tiebreaks in leximin comparisons will be less th an the

case of integer values. However, the leximin operator

continues to work.

4.3 Selection of Action in Learning

Equations

Since action selection based on leximin criterion is

different from the original one, it ma y be incompa-

tible with reinfor cement learning. In p articular, the

selection of an optimal joint action in Eq. (1) might

A Study on Cooperative Action Selection Considering Unfairness in Decentralized Multiagent Reinforcement Learning

91

Sensor

Area/Agent

(1) grid

(2) tree

Figure 3: Topologies of problems.

be corrup ted, but the actual action selection can be

considered an exploration strategy.

To remove such inﬂuence from the learning equa-

tion, discount rate γ can be set to zero. Another ap-

proach is to employ the original optimization criterion

for the learning equation. I n this case, how the exp lo-

ration strategy b ased on leximin affects the original

method w ill be evaluated.

5 EVALUATION

5.1 Problem Settings

We experime ntally evaluated the effects and the in-

ﬂuences of our proposed approach and evaluated the

following topologies of senso r network pro blems.

• grid: We placed sensors at the grid vertices and

the areas at the edges (Fig. 3 (1)). Since this to-

pology easily increases the number of separators,

we limit the size to 3 × 3 sensors.

• tree: A small tree is designed based on the capa-

city of the sensor resources. Similar to the grids,

sensors and areas are place d at the tree’s verti-

ces and edges. When two sensors are allocated to

an a rea in the center of this network, at least one

sensor c an be alloca ted to all of the other areas

((Fig. 3 (2)).

• tree-like: In addition to a randomly generated

tree, a few edges/areas are contained to compose

cycles.

To reduce the size of the state and action spaces, we

limited the problems so that an are a adjoins two sen-

sors. The number of targets was set to the number

of areas based on the average occ upancy ratio for the

areas.

We set the rewards f or the areas as follows.

• 0-2: When an occupied are a is sensed by one sen-

sor, no reward is given, and in the case of two sen-

sors a reward of 2 is given.

• 1-2: When an occupied area is sensed by one and

two sensors, rewards of 1 and 2 are given.

50

100

150

200

250

0 5000 10000

global rewards

steps

sum

leximin

Figure 4: Global reward (grid, 0-2, γ = 0).

Table 1: Global reward and allocated sensors to occupied

areas in last quarter steps (grid, 0-2, γ = 0).

alg. rwd. num. of rwd. num. of alc. sns. rate. of

≤0 2 0 1 2 occ.

sum 10722 24636 5364 5997 2332 5364 0.46

leximin 9006 25494 4506 6546 2918 4506 0.47

• 2-2 0-2: The areas are categorized into two types.

In most occupied areas, when one is sensed by

one or two sensors, a reward of 2 is given. On the

other hand, in one area, th e reward is deﬁned as

0-2.

• 2-2 1-2: Similar to 2-2 0-2, the areas are catego-

rized, and the reward is deﬁned as 1-2 in one area.

When a no n-occupied area is sensed, a small negative

reward −1.0

−3

is given.

We compared the following solution methods f or

action selection.

• sum: the summation is employed for both actual

actions and learning.

• leximin: the leximin is employed for both action

selections.

• lxmsum: the leximin and the summation are used

for the actual actions and the learning.

In a ddition, we compa red cases where discount ra te γ

is 0 and 0.5. The probab ility ε of random walk and

learning rate α were set to 0.2 and 0.2. These p ara-

meters are chosen based on preliminary experiments

so that several typical cases ar e shown. We performed

ten trials for the same problem. Each trial consisted of

10000 steps of joint actions. The r e sults were avera-

ged for the trials.

5.2 Results

First we evaluated the case of grid with a set of reward

0-2. He re discount rate γ was set to zero. Fig. 4 shows

ICAART 2017 - 9th International Conference on Agents and Artiﬁcial Intelligence

92

50

100

150

200

250

0 5000 10000

global rewards

steps

sum lxmsum leximin

Figure 5: Global reward (grid, 0-2, γ = 0.5).

Table 2: Global reward and allocated sensors to occupied

areas in last quarter steps (grid, 0-2, γ = 0.5).

alg. rwd. num. of rwd. num. of alc. sns. rate. of

≤0 2 0 1 2 occ.

sum 10830 24582 5418 6209 2255 5418 0.47

lxmsum 6880 26556 3444 7101 2976 3444 0.47

leximin 7163 26415 3585 7027 4494 3585 0.50

Table 3: Global reward and allocated sensors to occupied

areas in last quarter steps (grid, 1-2, γ = 0).

alg. rwd. num. of rwd. num. of alc. sns. rate. of

≤0 1 2 0 1 2 occ.

sum 15270 17976 8773 3251 4863 8773 3251 0.56

leximin 14261 16205 13322 474 3696 13322 474 0.58

the global reward among the agents for 50 steps. Ta-

ble 1 shows the summation of the global rewards in

the last 2500 steps (rwd.), the histograms of the re-

wards (num. of rwd.), the histograms of the number of

allocated sensors to the o ccupied a reas (num. of alc.

sns.), and the occupancy ratio for all the ar e as (rate. of

occ.). Since this problem contains only single types of

rewards, conventional summation well works. In ad-

dition, a few areas can be c overed by two sensors, but

rewards are given only for two allocated sensors. In

such c a ses, a leximin strategy is less effective.

Next we evaluated the same problem with γ = 0.5 .

Fig. 5 and Table 2 show the results. The result also

contains the case of lxmsum. In this case, the reward

of leximin decreased less tha n the case of γ = 0. This

reveals that action selection based on leximin in lear-

ning did not improve the result. Also, lxmsum, which

employed conventional summation for learning, was

not effective in average.

The next case is grid with 1-2. In this p roblem,

sensors can cover most areas earning at least a reward

of 1. Fig. 6, Table 3, Fig. 7, and Tab le 4 show the re-

sults. leximin improved the coverage of the areas by

200

220

240

260

280

300

320

340

0 5000 10000

global rewards

steps

sum

leximin

Figure 6: Global reward (grid, 1-2, γ = 0).

200

250

300

350

0 5000 10000

global rewards

steps

sum lxmsum leximin

Figure 7: Global reward (grid, 1-2, γ = 0.5).

Table 4: Global reward and allocated sensors to occupied

areas in last quarter steps (grid, 1-2, γ = 0.5).

alg. rwd. num. of rwd. num. of alc. sns. rate. of

≤0 1 2 0 1 2 occ.

sum 15454 18513 7516 3972 5206 7516 3972 0.56

lxmsum 13787 17899 10408 1693 4861 10408 1693 0.56

leximin 13401 17699 11192 1108 4688 11192 1108 0.56

decreasing the global rewards. In this case, lxmsum

was slightly closer to sum than leximin. From this

result, action selection based on leximin is relatively

effective in simple cases, where the rewards are ea-

sily decreased to improve the worst case agen ts. For

this kind of settings, similar re sults we re obtained in

several other topologies including tree-like.

As a different setting, we evaluated tree with 2-

2

0-2. In this problem, an area in the center of the net-

work obtains rewards of 0

2, while other ar eas obtain

rewards of 2 2. Table 5 shows that leximin improved

the coverage of the areas. In addition, the total reward

of the ar ea in the center improved. On the other hand,

for γ = 0.5, the leximin result is worse than sum, as

shown in Table 6.

Similarly, we evaluated tree with 2-2

1-2. Ta-

bles 7 and 8 show the results. In this case, the num ber

of reward 1 increa sed, although we expected that the

total r eward of the area in the ce nter would improve.

A Study on Cooperative Action Selection Considering Unfairness in Decentralized Multiagent Reinforcement Learning

93

Table 5: Global reward and allocated sensors to occupied

areas in last quarter steps (tree, 2-2

0-2, γ = 0).

alg. rwd. num. of rwd. num. of alc. sns. rate. of

≤0 2 0 1 2 occ.

sum 20288 7353 10147 447 8148 2246 0.62

leximin 21176 6909 10591 268 8561 2204 0.61

alg. rwd. of

cnt. area

sum 2951

leximin 3454

Table 6: Global reward and allocated sensors to occupied

areas in last quarter steps (tree, 2-2

0-2, γ = 0.5).

alg. rwd. num. of rwd. num. of alc. sns. rate. of

≤0 2 0 1 2 occ.

sum 20889 7052 10448 320 8477 2214 0.63

lxmsum 19953 7520 9980 490 8121 2203 0.61

leximin 20076 7459 10041 469 8128 2231 0.62

alg. rwd. of

cnt. area

sum 3220

lxmsum 2671

leximin 2764

Table 7: Global reward and allocated sensors to occupied

areas in last quarter steps (tree, 2-2

1-2, γ = 0).

alg. rwd. num. of rwd. num. of alc. sns. rate. of

≤0 1 2 0 1 2 occ.

sum 20052 7084 774 9642 420 8356 2060 0.62

leximin 20148 6648 1550 9303 237 9098 1754 0.63

alg. rwd. of

cnt. area

sum 2707

leximin 2318

Instead, the coverage of the are a s improved b y decre-

asing the rewards of the center area. The result reveals

that simple action selection based on leximin does no t

easily adapt to relatively c omplex cases.

Table 9 shows the computational cost of the so-

lution method. Here the maximum degree of sen-

sors in tree-like was limited to three. Five instan-

ces we re also averaged for each setting of tree-like.

For grid, a pseudo tree was generated using a zig-

zag order from the left-top variable, while a maxi-

mum degree heuristic was employed f or other graphs.

The experiments were performed on a single compu-

ter with Core i7-3930K CPU @ 3.20GHz, 16GB me-

Table 8: Global reward and allocated sensors to occupied

areas in last quarter steps (tree, 2-2

1-2, γ = 0.5).

alg. rwd. num. of rwd. num. of alc. sns. rate. of

≤0 1 2 0 1 2 occ.

sum 20863 6754 623 10123 293 8661 2085 0.63

lxmsum 20317 6751 1174 9575 279 8864 1885 0.63

leximin 20338 6752 1151 9597 278 8836 1912 0.63

alg. rwd. of

cnt. area

sum 3126

lxmsum 2610

leximin 2616

Table 9: Computational cost.

prb. alg. max. sz. of separator comp.

variables combinations time [s]

grid sum 4 72 7.37

lxmsum 4 72 10.59

leximin 4 72 13.77

tree sum 1 4 0.56

lxmsum 1 4 0.67

leximin 1 4 0.77

tree-like sum 4 103 8.83

15 sns. lxmsum 4 103 13.28

20 areas leximin 4 103 18.01

tree-like sum 4 92 9.58

20 sns. lxmsum 4 92 15.15

25 areas leximin 4 92 20.63

tree-like sum 3 23 8.62

50 sns. lxmsum 3 23 13.80

55 areas leximin 3 23 19.47

mory, Linux 2.6.32 and g ++ 4.4.7. Note that our cur-

rent experimental impleme ntation can be improved.

The bottleneck of th e solu tion methods is the optimi-

zation method for action selection. Since the time and

space complexity o f dynamic progr amming exponen-

tially increases w ith the number of separators, large

and dense problems will need relaxations.

6 RELATED WORKS AND

DISCUSSIONS

This study was motivated by several previous studies

that addressed sensor network domains (Zhang and

Lesser, 2011; Ngu yen et al., 2014). Since those stu-

dies employed dedicated problems, we designed a re-

ICAART 2017 - 9th International Conference on Agents and Artiﬁcial Intelligence

94

lative simple one. Our setting of learning methods

relatively depend s on the random walk of ε-greedy

heuristics because of the activities of targets. In ad-

dition, we set a number of ta rgets so that almost half

of the areas are occupied, since unfairness depends on

resource capacities. Investigations of different classes

of problems will be future works.

Several studies such a s Nash-Q learning (Hu an d

Wellman, 2003) have addressed the individuality of

agents. While such studies mainly foc us on selﬁsh

agents, we are interested in co operative a ctions by

considerin g u nfairness. As a ﬁrst study, we addressed

the effects and inﬂuence of action selection based on

leximin that can be applied in a dec e ntralized man ner.

On the other hand, the results reveal the necessity

of dedicate d learning rules; in simple cases our pro-

posed approach has some effects. Since unfairness

depends on the values in Q-tables, more discussion s

for the case of leximin are necessary. In particular,

some normalization methods, for different progre sses

of leaning in individual agents, are possibly important

for the case of fairness.

We employed an exact solution method to select

joint ac tions. However, for large and c omplex pro-

blems, approximation methods are necessary, since

the time and space complexity of the exact method ex-

ponen tially incr eases with the nu mber of separators.

Such approximation is a lso considered as a ch allen-

ging problem.

7 CONCLUSIONS

We addressed action selection b ased on unfairness

among agents in a decentralized reinfo rcement lear-

ning framework and experimentally investigated the

effect and the inﬂuen ce of leximin criterion in action

selection. Even thoug h the proposed approach ef-

fectively worked in relatively simple settings, our re-

sults uncovered several exploration and learning is-

sues. Our future works will analyze the rela tionship

between the proposed cooperative action and learning

rules and applications to other problem domains. Im-

provement of learning rules to man a ge informatio n

of unfairness, and scalable solution methods for joint

action selection will also be im portant challenges.

ACKNOWLEDGEMENTS

This work was supported in par t by JSPS KA K ENHI

Grant Number JP16K00301.

REFERENCES

Bouveret, S. and Lemaˆıtre, M. (2009). Computing leximin-

optimal solutions in constraint networks. Artiﬁcial In-

telligence, 173(2):343–364.

Farinelli, A., Rogers, A., Petcu, A., and Jennings, N. R.

(2008). Decentralised coordination of low-power em-

bedded devices using the max-sum algorithm. In 7th

International Joint Conference on Autonomous Agents

and Multiagent Systems, pages 639–646.

Hu, J. and Wel lman, M. P. (2003). Nash q-learning for

general-sum stochastic games. J. Mach. Learn. Res.,

4:1039–1069.

Matsui, T. and Matsuo, H. (2014). Complete dist r ibuted

search algorithm for cyclic factor graphs. In 6th In-

ternational Conference on Agents and Artiﬁcial Intel-

ligence, pages 184–192.

Matsui, T., Silaghi, M., Hirayama, K., Yokoo, M., and Mat-

suo, H. (2014). Leximin multi ple objective optimiza-

tion for preferences of agents. In 17th International

Conference on Principles and Practice of Multi-Agent

Systems, pages 423–438.

Matsui, T., Silaghi, M., Okimoto, T., Hirayama, K., Yokoo,

M., and Matsuo, H. (2015). Leximin asymmetric mul-

tiple objective DCOP on factor graph. In Principles

and Practice of Multi-Agent Systems - 18th Internati-

onal Conference, pages 134–151.

Modi, P. J., Shen, W., Tambe, M., and Yokoo, M. (2005).

Adopt: Asynchronous distributed constraint optimi-

zation with quality guarantees. Artiﬁcial Intelligence,

161(1-2):149–180.

Moulin, H. (1988). Axioms of Cooperative Decision Ma-

king. Cambridge : Cambridge University Press.

Netzer, A. and Meisels, A. (2013a). Distributed Envy Mi-

nimization for Resource Allocation. In 5th Internati-

onal Conference on Agents and Artiﬁcial Intelligence,

volume 1, pages 15–24.

Netzer, A. and Meisels, A. (2013b). Distributed Local Se-

arch for Minimizing Envy. I n 2013 IEEE/WIC/ACM

International Conference on Intelligent Agent Techno-

logy, pages 53–58.

Nguyen, D. T., Yeoh, W., Lau, H. C., Zilberstein, S., and

Zhang, C. (2014). Decentralized multi-agent r einfor-

cement l earning in average-reward dynamic dcops. In

28th AAAI Conference on Artiﬁcial Intelligence, pages

1447–1455.

Petcu, A. and Faltings, B. (2005). A scalable method for

multiagent constraint optimization. In 19th Internati-

onal Joint Conference on Artiﬁcial Intelligence, pages

266–271.

Zhang, C. and Lesser, V. (2011). Coordinated multi-

agent reinforcement learning in networked distributed

pomdps. In 25th AAAI Conference on Artiﬁcial Intel-

ligence, pages 764–770.

Zivan, R. (2008). Anytime local search for distri buted con-

straint optimization. In Twenty-Third AAAI Confe-

rence on Artiﬁcial Intelligence, pages 393–398.

95