Nested Rollout Policy Adaptation for Multiagent System Optimization in

Manufacturing

Stefan Edelkamp

and Christoph Greulich

Faculty 3 – Mathematics and Computer Science, University of Bremen, Bremen, Germany

International Graduate School for Dynamics in Logistics, University of Bremen, Bremen, Germany

Keywords:

Multiagent System Simulation, Optimization, Monte-Carlo Tree Search, Manufacturing.

Abstract:

In manufacturing there are not only ﬂow lines with stations arranged one behind the other, but also more

complex networks of stations where assembly operations are performed. The considerable difference from

sequential ﬂow lines is that a partially ordered set of required components are brought together in order to form

a single unit at the assembly stations in a competitive multiagent system scenario. In this paper we optimize

multiagent control for such ﬂow production units with recent advances of Nested Monte-Carlo Search. The

optimization problem is implemented as a single-agent game in a generic search framework. In particular, we

employ Nested Monte-Carlo Search with Rollout Policy Adaptation and apply it to a modern ﬂow production

unit, comparing it to solutions obtained with a simulator and with a model checker.

1 INTRODUCTION

In this paper, we propose Nested Monte-Carlo Search

for solving multiagent optimization problems by app-

plying a search framework that links a (domain-

speciﬁc) combinatorial problem to an implemented

(domain-independent) search algorithm. To solve the

problem, we selected a recent variant of Nested Roll-

out with Policy Adaptation (NRPA) (Edelkamp and

Cazenave, 2016).

The application scenario we consider is an

assembly-line network that is represented as a di-

rected graph. Between any two successive nodes in

the network, which represent entrances and exits of

assembly stations as well as junctions, we assume a

buffer of ﬁnite capacity. In those buffers, work pieces

are stored, waiting for service. At assembly stations,

service is given to work pieces. Travel time is mea-

sured and overall time optimized.

Especially in open, unpredictable, and complex

environments, Multiagent Systems (MASs) deter-

mine adequate solutions for transport problems. For

example, agent-based commercial systems are used

within the planning and control of industrial pro-

cesses (Dorer and Calisti, 2005; Himoff et al., 2006),

as well as within other areas of logistics (Fischer et al.,

1996; Bürckert et al., 2000), see (Parragh et al., 2008)

for a survey.

Flow line analysis is often done with queuing the-

ory (Manitz, 2008; Burman, 1995). Pioneering work

in analyzing assembly queuing systems with syn-

chronization constraints studies assembly-like queues

with unlimited buffer capacities (Harrison, 1973). It

shows that the time an item has to wait for synchro-

nization may grow without bound, while limitation of

the number of items in the system works as a control

mechanism and ensures stability. Work on assembly-

like queues with ﬁnite buffers all assume exponen-

tial service times (Bhat, 1986; Lipper and Sengupta,

1986; Hopp and Simon, 1989).

Our running case study is the so-called Z2, a phys-

ical monorail system for the assembling of tail-lights.

Unlike most production systems, Z2 employs agent

technology to represent autonomous products and as-

sembly stations. The techniques we develop, how-

ever, will be applicable to most ﬂow production sys-

tems. We formalize the production ﬂoor as a system

of communicating agents and apply NRPA for ana-

lyzing its behavior optimizing the ﬂow of production.

To make the paper self-contained we repeated the

description of the Z2 and of the multiagent system.

The contributions of this paper are: the encoding of

the optimization problem as a single-player game, a

ﬂexible framework implementation, and a discussion

of the advantages of employing an MCTS optimizer.

The paper is structured as follows. We kick off by

introducing the Z2 as an example of a multiagent sim-

ulation system and formalize the multiagent optimiza-

284

Edelkamp S. and Greulich C.

Nested Rollout Policy Adaptation for Multiagent System Optimization in Manufacturing.

DOI: 10.5220/0006204502840290

In Proceedings of the 9th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2017), pages 284-290

ISBN: 978-989-758-219-6

Figure 1: Assembly scenario for tail-lights (Morales Kluge

et al., 2010).

tion problem we face. Next, we present the family

of MCTS rollout algorithms including UCT, NMCS

and NRPA. In the following section, we map the opti-

mization problem to a single-player game which can

be solved in our NMCS framework. We will see that

the results compare positively with other optimization

approaches for the same multiagent system. Finally,

we conclude and discuss the impact of the work.

2 CASE STUDY: Z2

One of the few successful real-world implementations

of a multiagent ﬂow production is the so called Z2

production ﬂoor unit (Ganji et al., 2010; Morales

Kluge et al., 2010). The Z2 unit consists of six work-

stations where human workers assemble parts of au-

tomotive tail-lights. The system allows production of

certain product variations as illustrated in Fig. 2 and

reacts dynamically to any change in the current or-

der situation, e.g., a decrease or an increase in the

number of orders of a certain variant. At the ﬁrst sta-

tion, the basic metal-cast parts enter the manufactur-

ing system on a dedicated shuttle. A monorail con-

nects all stations, each station is assigned to one spe-

ciﬁc task, such as adding bulbs or electronics. The

structure of the transport system is shown in Fig. 1.

Each tail-light is transported from station to station

until it is assembled completely. The monorail sys-

tem has multiple switches which allow the shuttles

to enter, leave or pass workstations and the central

hubs. The goods transported by the shuttles are also

autonomous, which means that each product decides

on its own which variant to become and which sta-

tion to visit. This way, a decentralized control of the

production system is possible.

From the given case study, we derive a more

general notation of an assembly-line network. Sys-

tem progress is non-deterministic and asynchronous,

Figure 2: Assembly states of tail lights.(Ganji et al., 2010).

while the progress of time is monitored.

Deﬁnition 1 (Flow Production). A ﬂow production

ﬂoor is tuple F = (A, P, G, ≺, S, Q) where

• A is a set of all possible assembling actions

• P is a set of n products; each P

∈ P, i ∈ {1, . . . , n},

is a set of assembling actions, i.e., P

⊆ A

• G = (V, E, w, s, t) is a graph with start node s, goal

node t, and weight function w : E → R

≥0

• ≺ = (≺

, . . . , ≺

) is a vector of assembling plans

with each ≺

⊆ A × A, i ∈ {1, . . . , n}, being a par-

tial order

• S ⊆ E is the set of assembling stations induced by

a labeling ρ : E → A ∪

0, i.e., S = {e ∈ E | ρ(e) 6=

• Q is a set of (FIFO) queues, all of ﬁnite size to-

gether with a labeling ψ : E → Q

Products P

, i ∈ {1, . . . , n}, travel through the net-

work G, meeting their assembling plans/order ≺

⊆

A × A of the assembling actions A. The cost func-

tion uses a set of predecessor edges Pred(e) = {e

(u, v) ∈ E | e = (v, w)}.

Deﬁnition 2 (Run, Plan, and Path). Let F =

(A, P, G, ≺, S, Q) be a ﬂow production ﬂoor. A run π

is a schedule of triples (e

, l

) of edges e

, queue

insertion positions l

, and execution time-stamp t

j ∈ {1, . . . , n}. The set of all runs is denoted as

Π. The run partitions into a set of n plans π

, l

), . . . , (e

, l

), one for each product P

, i ∈

{1, . . . , n}. Each plan π

corresponds to a path, start-

ing at the initial node s and terminating at goal node

t in G.

3 MULTIAGENT SYSTEM

In the real-world implementation of the Z2 system,

every assembly station, every monorail shuttle and ev-

Nested Rollout Policy Adaptation for Multiagent System Optimization in Manufacturing

285

Figure 3: Preconditions of the various manufacturing

stages.

ery product is represented by a software agent. Most

agents in this MAS just react to requests or events

which were caused by other agents or the human

workers involved in the manufacturing process. In

contrast, the agents which represent products are ac-

tively working towards their individual goal of be-

coming a complete tail-light and reaching the storage

station. In order to complete its task, each product has

to reach sub-goals which may change during produc-

tion as the order situation may change. The number of

possible actions is limited by sub-goals which already

have been reached, since every possible production

step has preconditions as illustrated in Fig. 3.

The product agents constantly request updates re-

garding queue lengths at the various stations and the

overall order situation. The information is used to

compute the utility of the expected outcome of every

action which is currently available to the agent. High

utility is given when an action leads to fulﬁllment of

an outstanding order and takes as little time as possi-

ble. Time, in this case, is spent either on actions, such

as moving along the railway or being processed, or on

waiting in line at a station or a switch.

More generally, the objective of products in such

a ﬂow production system can be formally described

as follows.

Deﬁnition 3 (Product Objective, Travel and Waiting

Time). The objective for product i is to minimize

max

1≤i≤n

wait(π

) + time(π

over all possible paths with initial node s and goal

node t, where

• time(π

) is the travel time of product P

, deﬁned as

the sum of edge costs time(π

) =

∑

e∈π

w(e), and

• wait(π

) the waiting time, deﬁned as wait(π

) =

∑

(e,t,l),(e

)∈π

∈Pred(e)

t − (t

+ w(e

)).

For this study, we provided the MAS model with

timers to measure the time taken between two graph

nodes. Since the hardware includes many RFID read-

ers along the monorail, which all are represented by

an agent and a node within the simulation, we sim-

pliﬁed the graph and kept only three types of nodes:

switches, production station entrances and production

station exits. The resulting abstract model of the sys-

tem is a weighted graph, where the weight of an edge

denotes the traveling/processing time of the shuttle

between two respective nodes (Greulich et al., 2015).

4 MONTE-CARLO TREE

The randomized optimization scheme we consider be-

longs to the wider class of Monte-Carlo tree search

(MCTS) algorithms (Browne et al., 2004). The main

concept of MCTS is the random playout (or rollout)

of a position, whose outcome, in turn, changes the

likelihood of generating successors in subsequent tri-

als. Prominent members in this class of reinforcement

learning algorithms are upper conﬁdence bounds ap-

plied to trees (UCT) (Kocsis and Szepesvári, 2006),

and nested monte-carlo search (NMCS) (Cazenave,

2009). MCTS is state-of-the-art in playing many two-

player games (Huang et al., 2013) or puzzles (Bouzy,

2016), and has been applied also to other prob-

lems than games like mixed-integer programming,

constraint problems, function approximation, physics

simulation, cooperative path ﬁnding, as well as plan-

ning and scheduling.

Cumulating in the success of AlphaGo (Silver

et al., 2016) in winning a match of Go against a pro-

fessional human player, the importance of MCTS in

playing games and AI search is no longer doubted.

5 NESTED MONTE-CARLO

Nested Monte-Carlo Search (NMCS) is a randomized

search method that has been successfully applied to

solve many challenging combinatorial problems, in-

cluding Klondike Solitaire, Morpion Solitaire, Same

Game, just to name a few. Recently, a large fraction

of TSP instances have been solved efﬁciently at or

close to the optimum (Cazenave and Teytaud, 2012a).

NMCS compares well with other heuristic methods

that include much more domain-speciﬁc information.

NMCS is parameterized with the recursion level of the

search which denotes the depth of the recursion tree

is, and with the number of iterations, that shows the

branching of the search tree. At each leaf of the recur-

sive search a rollout, which performs and evaluates a

random run.

What makes Nested Rollout Policy Adaptation

(NRPA) (Rosin, 2011) notably different to UCT and

ICAART 2017 - 9th International Conference on Agents and Artiﬁcial Intelligence

286

NMCS is the concept of learning a policy through an

explicit mapping of moves to selection probabilities.

Beam-NRPA (Cazenave and Teytaud, 2012b) is an

extension of NRPA that maintains B instead of one

best solution in each level of the recursion. The mo-

tivation behind this is to warrant search progress by

an increased diversity of existing solutions to prevent

the algorithm from getting stuck in local optima. As

the NRPA recursion otherwise remains the same, the

number of playouts to a search with level L and (iter-

ation) width N rises from N

to (N · B)

. To control

the size of the beam, we allow different beam widths

in each level l of the tree (our values for B

in a

level 5 search were (1, 1, 10, 10, 10)). At the end of

the procedure, B

best solutions together with their

scores and policies are returned to the next higher re-

cursion level. For each level l of the search, one may

also allow the user to specify a varying iteration width

. This yields the algorithm Beam-NRPA to perform

∏

l=1

rollouts.

Beam-NRPA itself is inspired by the objective of

higher diversity in the solution space of NRPA. Still,

in very larger search spaces NRPA often dwells on

inferior solutions. It simply takes too long to back-

track to less determined policies in order to visit

other parts in the search space. High-Diversity NPRA

(HD-NRPA) (Edelkamp and Cazenave, 2016) elabo-

rates on this observation to increase the diversity of

the search and provides some further algorithmic ad-

vances (e.g., instead of the moves executed in a roll-

out the policy table address of the chosen move and

the code of its successors are stored, or the length of

the rollout and its score are stored for each bucket in

the beam).

6 GAME ENCODING

In the encoding as a single-player game, the amount

of acting agents is signiﬁcantly reduced in compar-

ison to the original MAS. Similar to the encoding

for model-checker-based approaches (Edelkamp and

Greulich, 2016), decision making is modeled into the

nodes while shuttles are merely integer values which

are passed along the edges. Each edge is modeled

as a queue to make sure that no shuttle can pass an-

other. When put on an edge, a shuttle receives a wait-

ing time which corresponds to the cost of the spe-

ciﬁc edge. A synchronizing function (Greulich and

Edelkamp, 2016) ensures that time progresses for all

shuttles. The node at the end of a directed edge is al-

lowed to receive a shuttle only if it is ﬁrst in its queue

and its waiting time has passed. If a shuttle can be re-

ceived by a node, the node provides a legal move for

cl as s A re na {

pu bl ic :

Mo ve rol lou t [ M ax L en g th ] ;

int leng th ;

Ar en a () {

le ng th = 0;

for ( i nt i = 0; i < S TA T IO NS ; i ++) {

sw i t ch 2 en t r an c e [ i ] - > c le ar ();

ex i t2 s wi t ch [ i ] - > c le ar ( );

en t ran c e2 e xi t [ i ]-> c le ar ( );

}

for ( i nt i = 0; i < S TA T IO NS + 2* H UBS ; i + +)

sw i tch 2 sw i tc h [ i ]-> c le ar ( );

for ( i nt i = 0; i < S H UT TLE S ; i ++ ) {

wa it [ i ] = 0; c os t [ i ] = i * 70;

go al s [ i ] = 0; c ol or [ i] = i %2 ;

me t al c as t [i] = 1; dif f us or [ i ] = 0 ;

el e ct r on i cs [ i ] = bu lb [ i ] = se al [ i ] = 0;

sw i t ch 2 en t r an c e [5] - > pu sh ( i );

}

% int cod e ( Mo ve m ) { r et ur n m ; }

int le ga l Mo v es (M ov e m ov es [ M a xL e ga l Mov e s ]) {

int m [3] , m vs = 0;

wh il e ( mv s == 0) {

for ( i nt p = 0; p < a ge nt . si ze ( ); p ++ ) {

int k = a ge nt [ p ]-> n e xt L eg a lMo v e ( m );

for ( i nt l =0 ; l < k ;l ++ )

mo ve s [ mvs + +] = p *3 + m [ l ];

}

if ( mv s == 0) i n cr e a se _ ti m e ();

}

re tu rn mv s ;

}

vo id p la y ( Move m ) {

ro ll o ut [ le ngt h ++ ] = m ;

ag en t [ m /3] - > e x ec u te M ove ( m % 3) ;

}

bo ol ter m in al () {

int re ac he d = 1 ;

for ( i nt j = 0; j < S HUT TL E S ; j ++ )

re ac h ed &= goa ls [ j ];

re tu rn ( r ea che d ) || l en g th = = Ma xL en gt h - 1;

}

do ub le s co re ( ) {

int ma xi mu m = 0 , to ta l = 0;

for ( i nt j = 0; j < S HUT TL E S ; j ++ )

if ( co st [ j ] > ma x im um ) m ax imu m = c os t [ j ];

int re ac he d = 0 ;

for ( i nt j = 0; j < S HUT TL E S ; j ++ )

re ac h ed += ! go al s [ j ];

re tu rn ( r ea che d * 1 000) + m axi mu m ;

}

Figure 4: Code for Z2 multiagent system optimization.

each outgoing edge. Hence, a set of all legal moves

over all active agents can be obtained.

To play the game, the player has to choose one

of the agents and one of its actions as the next move.

Goal of the game is to ﬁnish a predeﬁned number of

Nested Rollout Policy Adaptation for Multiagent System Optimization in Manufacturing

287

cl as s A ge nt {

pu bl ic :

Ag en t () {}

vi rt u al voi d e x ec u te M ov e ( i nt m ) = 0;

vi rt u al int next L eg a lM o ve ( int * mo ve s ) = 0;

};

Figure 5: Code for abstract agent class.

cl as s S wit ch : p ub li c Age nt {

pu bl ic :

int In , Out , Sta ti on , B , C ;

Sw it ch ( i nt in , i nt out , in t s , in t b , in t c ) :

Ag en t () , In ( in ) , O ut ( ou t ) , S t at io n ( s) , B ( b ),C ( c ) {}

vo id ex ec u te M ov e ( i nt m ov e ) {

if ( mo ve == 0) {

int Sh ut tl e = s wi t c h2 s wi t ch [ In ] -> po p ( );

wa it [ S hu t tl e ] += C ; c os t [ Shu tt le ] += C;

sw i t ch 2 en t r an c e [ S ta t io n ] - > pu sh ( S hut tl e );

}

if ( mo ve == 1) {

int Sh ut tl e = s wi t c h2 s wi t ch [ In ] -> po p ( );

wa it [ S hu t tl e ] += B ; c os t [ Shu tt le ] += B;

sw i tch 2 sw i tc h [ O ut ] - > pu sh ( S hu t tl e );

}

if ( mo ve == 2) {

int Sh ut tl e = e xi t 2s w it c h [ S tat io n ] - > p op ( );

wa it [ S hu t tl e ] += B ; c os t [ Shu tt le ] += B;

sw i tch 2 sw i tc h [ O ut ] - > pu sh ( S hu t tl e );

}

int ne x tL e ga l Mov e ( int * m ov es ) {

int mvs = 0;

if ( r ec e iv es ( SW 2S W_E N , In , S t at io n ))

mo ve s [ mvs + +] = 0;

if ( r ec e iv es ( SW 2S W_ PA SS ,I n , S ta ti o n ))

mo ve s [ mvs + +] = 1;

if ( r ec e iv es ( E X2SW , S tat io n , S ta t io n ))

mo ve s [ mvs + +] = 2;

re tu rn mv s ;

}

};

Figure 6: Code for one agent.

products in the shortest possible time before a prede-

ﬁned length is exceeded. The smaller the makespan

for each agent found by the algorithm the higher the

score of the play.

More formally, the (board) game is deﬁned as

(B, b

, d, F, r) where B is the set of (board) positions,

in our case consisting of all queue content, shuttle lo-

cations, and their respective cost values. The start po-

sition s

has all shuttles and all queues being empty,

d : B → 2

speciﬁes the set of allowed actions for each

q ∈ B, The set of ﬁnal positions F consists of all states

in which either all the individual goals or the maximal

step sized is reached, and r : B → N is the score func-

tion adding a constant (e.g., 1000) for each individual

unreached goal, on top of the maximum of the indi-

vo id in cr e as e _ ti m e () {

int min = INF , d = 1;

for ( i nt p = 0; p < S H UT TLE S ; p ++ )

if (0 < w ai t [ p ] && wait [ p ] < min ) min = wa it [ p ];

if ( mi n < IN F ) d = mi n ;

for ( i nt p = 0; p < S H UT TLE S ; p ++ )

if ( wa it [ p ] - d >= 0 ) {

wa it [ p ] -= d ; cost [ p ] += d ;

}

el se w ai t [ p ] = 0;

}

bo ol rec e iv es ( int cha nne lt yp e , int i , int st at io n ) {

int resu lt = 0;

Ch an n el * c ha nn e l = N UL L ;

sw it ch ( c ha n ne l ty p e ) {

ca se EN2 EX : ch a nn el = ent r anc e 2e x it [ i ];

br ea k ;

ca se EX2 SW : ch a nn el = exi t 2s w it c h [ i ] ;

br ea k ;

ca se SW2S W_P ASS : c ha nn e l = swit c h2 s wi t ch [ i ];

br ea k ;

ca se SW2 S W_ EN : c ha nne l = swi t ch 2 sw i tch [ i ];

br ea k ;

ca se SW2 EN :

if ( e n tr a nc e 2ex i t [ s ta tio n ] - > le ng th ( ) >= 1 )

ch an n el = N ULL ;

el se cha nne l = s wi t ch 2 e nt r an c e [ s t at io n ];

br ea k ;

}

if ( c ha nne l != N UL L && c han ne l - > l en gt h () > 0) {

int sh ut tl e = ch ann el - > fr on t () ;

if ( wa it [ sh utt le ] <= 0)

re su lt = 1;

}

re tu rn r es ul t ;

}

Figure 7: Code for increase-time and receive action.

vidual cost values.

The components of the game induce a tree in the

natural way with B as nodes, root b

, d as edges and

the ﬁnal positions as leaves. A play(out) is then a path

in the tree from b

to some leaf.

The software implementation (see Fig. 4–Fig. 7)

is based on a framework which allows to employ

several search algorithms such as MCTS, NMCS,

NRPA, BEAM-NRPA and HD-NRPA (Edelkamp and

Cazenave, 2016). For our experiments, we only fo-

cused on HD-NRPA since it is the most advanced im-

plementation and provided the best results.

7 EXPERIMENTS

For the evaluation we used a single core of a personal

computer infrastructure (Ubuntu 14.04 LTS (x64), In-

tel Core i7-4500U, 1.8 GHz, 8 GB).

We experiment with a rising number of vehicles

ICAART 2017 - 9th International Conference on Agents and Artiﬁcial Intelligence

288

Table 1: Efﬁciency of MCTS for a rising number of shuttles,

compared to previously published results.

MCTS LVT DES

N Length Cost CPU Cost Cost

2 48 2:54 1s 3:24 2:53

3 72 2:59 1s 3:34 3:04

4 99 3:08 2s 3:56 3:13

5 123 3:13 2s 4:31 3:25

6 153 3:22 5s 4:31 3:34

7 186 3:38 5s 5:08 3:45

8 213 3:45 5s 5:43 3:55

9 240 3:52 5s 5:43 4:06

10 267 3:52 5s 5:43 4:15

20 540 5:16 5s 8:59 5:59

and compare the results with the discrete event sys-

tem (DES) model (Edelkamp and Greulich, 2016)

and local virtual time (LVT) model (Greulich and

Edelkamp, 2016), both implemented in the SPIN

model checker using its branch-and-bound facility.

While DES was faster than LVT, it had semantical

problems with the proper progression of time. There-

fore, in the MCTS implementation we decided to use

the semantics of the LVT model.

Table 1 shows that the MCTS implementation

scales best. It shows the length of the plan, the sim-

ulation time (Cost), and the runtime for a growing

number of vehicles. Here we do not enforce prereq-

uisites, namely that shuttles are protected from driv-

ing into a station if they have not all required com-

ponents available. Given that SPIN is a full-ﬂedged

model checker that analyzes the encoding of the prob-

lem on the source-code level (resulting of traces that

have thousands of steps) the result could have been

expected, even though the search space is huge.

The CPU time bound for MCTS was 5 seconds,

the RAM requirements remained rather small, less

than 4MB for the largest instance, while the competi-

tors require hundreds of MBs. As with the DES/LVT

model, in cost we measure travel time plus some ini-

tial waiting time.

To help the solver to ﬁnd valid solutions, we

extended the objective fuction (reached ∗ 1000) +

maximum by the term (e

∗10)+(b

∗10)+(s

∗10)+

∗100), where e

, b

, s

, and d

are the violations to

the assembling status of electronics, bulbs, seals, and

diffusors, respectively.

We observe that there is a difference in the simu-

lation times of LVT and DES even for two shuttles.

Hence, we decided to reimplement LVT and have the

two cost functions in a close match. Table 2 shows

that (due to RAM usage) this implementation of LVT

has difﬁculties to scale and failed for four vehicles,

while the MCTS remained sufﬁciently fast (for larger

Table 2: Efﬁciency of MCTS for a rising number of shuttles,

compared to reimplementation.

MAS MCTS LVT

N Cost Cost CPU Cost CPU

2 4:01 3:17 5s 3:03 <1s

3 4:06 3:23 5s 3:19 79s

4 4:46 3:41 5s – –

5 4:16 3:59 5s – –

6 5:29 4:30 5s – –

models the bound of 5s turned out to be insufﬁcient to

solve all models with no constraint violations).

In Table 2 we also added the results of the simu-

lated multiagent system, where the agents chose the

color of the lamp dynamically based on fuzzy logic

decision rules that take the incoming orders and ob-

served current queue lengths into account.

8 CONCLUSION AND

DISCUSSION

Monte-Carlo Tree Search is a general exploration

strategy that leads to concise solver prototypes not

only for games but for many combinatorial opti-

mization problems including multiagent optimization

problems.

We proposed the application of MCTS to evaluate

amultiagent system that controls the industrial pro-

duction of autonomous products. As the ﬂow of ma-

terial is asynchronous at each station, queuing effects

arise and additional constraints make the problem NP-

hard. Besides validating the design of the system, the

core objective of this work was to ﬁnd plans that op-

timize the throughput of the system.

We modeled the production line as a set of

communicating agents, with the movement of items

modeled as communication channels. Experiments

showed that the implementation is able to analyze the

movements of autonomous products for the model,

subject to the partial ordering of the product parts. It

derived valid and optimized plans with several hun-

dreds of steps using NRPA. A generic search frame-

work helped to perform policy-based benchmarking.

Considering the simplicity of the code, the sequen-

tiality of the execution on one CPU core, the obtained

results are promising.

NRPA is one means to ﬁnd such needle in the

haystack. It intensify the search with increasing recur-

sion depth. The nestedness and policy refreshments

relate to exponential restarting strategies known to be

effective in the SAT community (Gomes et al., 2000).

An open problem is to ﬁnd necessary/sufﬁcient

Nested Rollout Policy Adaptation for Multiagent System Optimization in Manufacturing

289

criteria for the convergence of NMCS/NRPA. While

as in most MCTS algorithms based on rollouts, we

have probabilistic completeness in the sense that an

optimal solution can always be found by chance.

However, through nesting and adapting policies the

success likelihood can become arbitrarily small, so

that for now we cannot say by certain, that the op-

timum will be reached.

REFERENCES

Bhat, U. (1986). Finite capacity assembly-like queues.

Queueing Systems, 1:85–101.

Bouzy, B. (2016). An experimental investigation on the

pancake problem. In Computer Games: Fourth

Workshop on Computer Games, pages 30–43, Cham.

Springer International Publishing.

Browne, C. B., Powley, E., Whitehouse, D., Lucas, S. M.,

Cowling, P., Rohlfshagen, P., Tavener, S., Perez, D.,

Samothrakis, S., and Colton, S. (2004). A survey of

Monte Carlo tree search methods. 4(1):1–43.

Bürckert, H.-J., Fischer, K., and Vierke, G. (2000). Holonic

transport scheduling with teletruck. Applied Artiﬁcial

Intelligence, 14(7):697–725.

Burman, M. (1995). New results in ﬂow line analysis. PhD

thesis, MIT.

Cazenave, T. (2009). Nested monte–carlo search. In IJCAI,

pages 456–461.

Cazenave, T. and Teytaud, F. (2012a). Application of

the Nested Rollout Policy Adaptation Algorithm to

the Traveling Salesman Problem with Time Windows,

pages 42–54. Springer.

Cazenave, T. and Teytaud, F. (2012b). Beam nested rollout

policy adaptation. In ECAI-Workshop on Computer

Games, pages 1–12.

Dorer, K. and Calisti, M. (2005). An adaptive solution to

dynamic transport optimization. In Proceedings of the

fourth international joint conference on Autonomous

agents and multiagent systems, pages 45–51. ACM.

Edelkamp, S. and Cazenave, T. (2016). Improved diversity

in nested rollout policy adaptation. In German Con-

ference on AI (KI 2016).

Edelkamp, S. and Greulich, C. (2016). Using SPIN for

the optimized scheduling of discrete event systems in

manufacturing. In SPIN 2016, pages 57–77. Springer.

Fischer, K., Müller, J. R. P., and Pischel, M. (1996). Coop-

erative transportation scheduling: an application do-

main for dai. Applied Artiﬁcial Intelligence, 10(1):1–

34.

Ganji, F., Morales Kluge, E., and Scholz-Reiter, B. (2010).

Bringing Agents into Application: Intelligent Prod-

ucts in Autonomous Logistics. In Artiﬁcial intel-

ligence and Logistics (AiLog) - Workshop at ECAI

2010, pages 37–42.

Gomes, C. P., Selman, B., Crato, N., and Kautz, H.

(2000). Heavy-tailed phenomena in satisﬁability and

constraint satisfaction problems. J. Autom. Reason.,

24(1-2):67–100.

Greulich, C. and Edelkamp, S. (2016). Branch-and-bound

optimization of a multiagent system for ﬂow produc-

tion using model checking. In ICAART 2016.

Greulich, C., Edelkamp, S., and Eicke, N. (2015). Cyber-

physical multiagent simulation in production logistics.

In MATES 2015.

Harrison, J. (1973). Assembly-like queues. Journal of Ap-

plied Probability, 10:354–367.

Himoff, J., Rzevski, G., and Skobelev, P. (2006). Ma-

genta technology multi-agent logistics i-scheduler for

road transportation. In AAMAS 06, pages 1514–1521.

ACM.

Hopp, W. and Simon, J. (1989). Bounds and heuristics for

assembly-like queues. Queueing Systems, 4:137–156.

Huang, S.-C., Arneson, B., Hayward, R. B., Mueller, M.,

and Pawlewicz, J. (2013). Mohex 2.0: A pattern-based

MCTS Hex player. In Computers and Games, pages

60–71.

Kocsis, L. and Szepesvári, C. (2006). Bandit based Monte-

Carlo planning. In ECML, pages 282–293.

Lipper, E. and Sengupta, E. (1986). Assembly-like queues

with ﬁnite capacity: bounds, asymptotics and approx-

imations. Queueing Systems, pages 67–83.

Manitz, M. (2008). Queueing-model based analysis of as-

sembly lines with ﬁnite buffers and general service

times. Computers & Operations Research, 35(8):2520

– 2536.

Morales Kluge, E., Ganji, F., and Scholz-Reiter, B. (2010).

Intelligent products - towards autonomous logistic

processes - a work in progress paper. In Intern. PLM

Conf.

Parragh, S. N., Doerner, K. F., and Hartl, R. F. (2008).

A Survey on Pickup and Delivery Problems Part II:

Transportation between Pickup and Delivery Loca-

tions. Journal für Betriebswirtschaft, 58(2):81–117.

Rosin, C. D. (2011). Nested rollout policy adaptation for

monte carlo tree search. In IJCAI, pages 649–654.

Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L.,

van den Driessche, G., Schrittwieser, J., Antonoglou,

I., Panneershelvam, V., Lanctot, M., Dieleman, S.,

Grewe, D., Nham, J., Kalchbrenner, N., Sutskever, I.,

Lillicrap, T., Leach, M., Kavukcuoglu, K., Graepel,

T., and Hassabis, D. (2016). Mastering the game of

go with deep neural networks and tree search. Nature,

529:484–503.

ICAART 2017 - 9th International Conference on Agents and Artiﬁcial Intelligence

290