Multi-Agent Path Finding: Policies Instead of Plans

Jakub Mestek

and Roman Bart

Charles University, Faculty of Mathematics and Physics, Prague, Czech Republic

Keywords:

Multi-Agent Path Finding, Non-Deterministic Environment, Policies.

Abstract:

The task of Multi-Agent Path Finding (MAPF) problem is to ﬁnd collision-free plans for a set of agents moving

from their starting locations to their destinations. In the classical variant of MAPF, a plan for an agent is a

sequence of actions. In this paper, we suggest a novel approach to solving this problem in a non-deterministic

environment – constructing a solution in the form of policies (one for each agent). The policy prescribes the

agent which action it should take in a given situation described by a location and a timestep.

1 INTRODUCTION

Multi-Agent Pathﬁnding is a problem of constructing

collision-free plans for a group of agents. The prob-

lem has many applications in various areas, including

automated warehouses, aircraft taxiing on airﬁelds,

optimal crossroads controlling, and units navigation

in video games.

Classical MAPF instance (Stern et al., 2019) con-

sists of a graph G = (V,E) and a set of agents A. For

each agent a ∈ A, its source and destination (goal)

vertices are given. We denoted them as s(a) and

g(a), respectively. Time is discretized into unit-long

timesteps. During every timestep, each agent can ei-

ther move to an adjacent vertex or stay in its current

vertex. A plan for an agent is a sequence of actions

that leads the agent from its source to its destination.

A solution of the MAPF instance is then a set of plans

for individual agents. The main requirement for a

solution is the nonexistence of conﬂicts (collisions)

during execution. We use the common three types of

conﬂicts which are vertex, edge, and swapping con-

ﬂict (Stern et al., 2019). A vertex conﬂict is a situa-

tion when two agents are located in the same vertex

at the same timestep. An edge or swapping conﬂict

happens when two agents travel the same edge at the

same timestep in the same or opposite direction, re-

spectively.

In many real-world environments, agents are not

able to follow the plans precisely – they might get de-

layed or even perform a wrong action (e.g., do not

https://orcid.org/0009-0003-8765-5092

https://orcid.org/0000-0002-6717-8175

turn) (Bart

ak et al., 2019), (Li et al., 2021a). There-

fore, some kind of robustness is required – a (minor)

deviation from the plan should not cause a collision

with another agent.

This paper introduces a new method of achieving

robustness via looking for a solution in the form of a

policy that allows the agent to take different actions

based on the current time. This approach enables a

delayed agent both to avoid a location that is already

occupied and to use a shorter path that has become

free (due to arriving later than expected).

1.1 Related Work

Probably the simplest form of non-determinism in

an environment is the possibility of agents being de-

layed. Such delay might then cause a collision, even

though the original solution is conﬂict-free. There-

fore, the problem of ﬁnding a robust solution has been

already quite extensively studied.

Atzmon et al. (Atzmon et al., 2018) formalized

the concept of delay as an insertion of unexpected

wait action into the agent’s plan and introduced the

notion of k-robustness. A solution of the MAPF prob-

lem is k-robust if any set of at most k delays (for

each agent) does not cause a collision. The approach

is quite successful in avoiding collisions ((Atzmon

et al., 2018), (Bart

ak et al., 2019)). However, the main

drawback is the prolongation of agents’ plans – the

condition basically requires each vertex to be free for

at least k timesteps before another agent is allowed to

enter it.

Nekvinda and Bart

ak (Nekvinda and Bart

ak,

2021) suggested achieving k-robustness by using al-

Mestek, J. and Barták, R.

Multi-Agent Path Finding: Policies Instead of Plans.

DOI: 10.5220/0012397200003636

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 16th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2024) - Volume 1, pages 95-104

ISBN: 978-989-758-680-4; ISSN: 2184-433X

ternative plans, instead of just time-separating the

agents (i.e., adding wait actions). Plan with alterna-

tives has a tree-like structure with one main branch –

the original classical plan – and several other branches

(alternatives) rooting from the main branch. Dur-

ing execution, each agent starts by following its main

plan. In case of experiencing a delay (big enough

to cause a collision), an agent might be detoured to

one of the alternative branches. Mainly due to com-

putability reasons, their method only provides alter-

natives for the main plan, not alternatives for alterna-

tives. Furthermore, no collisions are guaranteed only

among pairs of agents both following the main plan

and among pairs of agents, where one follows its main

plan and the second one its alternative plan. There is

no guarantee for agents who both follow an alterna-

tive plan.

Another approach to combat unexpected delays

is p-robustness (Atzmon et al., 2020) which aims to

ﬁnd a solution that will be executed successfully (i.e.,

without collisions) with a probability at least p.

Shahar et al. (Shahar et al., 2021) worked with

a different model of delays, called MAPF with Time

Uncertainty (MAPF-TU). In a MAPF-TU instance, an

interval of possible duration (in timesteps) is assigned

to each action (edge in the graph). The real duration

of the action during execution is then a number from

the interval. The task of MAPF-TU is to ﬁnd a safe

solution. A solution is safe if it guarantees no col-

lisions during execution for any possible set of real

action durations.

2 PRELIMINARIES

2.1 Conﬂict Based Search

Conﬂict Based Search (CBS), introduced by (Sharon

et al., 2012), is one of the most widely-used algo-

rithms for solving MAPF. CBS is a two-level algo-

rithm. The upper level solves coordination of agents,

while the lower level searches for plans (shortest

paths) for each agent individually. This modularity

makes CBS easily modiﬁable for different variants of

MAPF, including ours. In this section, we describe

the basic variant of CBS in detail.

Firstly, let us deﬁne some basic notions. A con-

ﬂict of agents a

and a

who should be (given current

solution) simultaneously in a timestep t located in a

vertex v is denoted by (a

,t,v). A constraint for

an agent a

is a tuple (a

,v,t) denoting that the agent

is prohibited from being in vertex v in timestep t. A

solution (set of plans) is consistent with a given set of

constraints C if all plans respect given constraints. A

solution is valid if there are no conﬂicts among agents.

The idea of CBS is as follows. We maintain a set

of constraints, initially empty. We ﬁnd the cheapest

consistent solution and check it for validity. If there is

any conﬂict, the conﬂict is resolved by adding a con-

straint that prevents that conﬂict. We iterate this step

until a valid solution is found. A conﬂict (a

,t,v)

can be resolved by adding one of the constraints

,v,t) or (a

,v,t). In order to obtain an optimal solu-

tion, we need to explore both options. Ultimately, this

leads to the so-called Constraint Tree. Optimization is

handled by the higher level of CBS that searches the

Constraint Tree using best-ﬁrst search which ensures

ﬁnding an optimal solution.

The task of the low-level algorithm is to compute

a consistent solution in each step. This can be done

individually for each agent – basically, it is a shortest-

path problem, that just has to avoid the prohibited

states.

Over recent years, CBS has received several im-

provements that make it a state-of-the-art algorithm

for classical MAPF, such as Conﬂict Avoidance Ta-

ble (CAT) (Sharon et al., 2012), Priority Conﬂict (PC)

and Conﬂict Bypassing (Li et al., 2019), or Corridor

Reasoning (Li et al., 2021b).

2.2 Markov Decision Process

Markov Decision Process is a sequential decision pro-

cess in a non-deterministic (stochastic) environment.

Following Sutton and Barto (Sutton and Barto, 2018),

MDP is a tuple (S, A,R, p). S denotes a ﬁnite set of

states, including starting state s

, A a ﬁnite set of ac-

tions, R is a set of possible rewards and a function

p: S × R × S × A → [0,1] describes dynamics of the

system – the value p(s

′

,r|s, a) is the probability of

getting to state s

′

and obtaining reward r when the

agent executes action a in state s. Therefore, p is re-

quired to satisfy the following property:

∑

′

∈S

∑

r∈R

p(s

′

,r|s, a) = 1, for all s ∈ S,a ∈ A.

In other words, for each choice of s and a, p speciﬁes

a probability distribution over {(s

′

,r)}.

A solution to the MDP is a policy π : S → A, i.e.,

a function that for each state s recommends an action

π(s) that the agent should perform in that state.

Quality of a policy π is measured using return –

the cumulative sum of rewards obtained by the agent

in the environment, possibly discounted by a discount

factor γ (real number between 0 and 1). The purpose

of the discount factor is to decrease the effect of dis-

tant actions on the actual value of return and to ensure

that the return is a ﬁnite number even in the case of an

inﬁnite horizon.

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

Due to the stochasticity of the environment, the

return might differ each time the agent runs in the en-

vironment. We deﬁne a value function v

(s) of a state

s under a policy π as the expected return when starting

in s and following π.

The optimal policy π

⋆

is then such a policy that

maximizes the value function v

⋆

) of the starting

state s

. The value function of the optimal policy is

called utility, denoted U (s) = v

⋆

(s).

The relation between rewards and value functions

is called the Bellman equation. For the utilities (of

optimal policy), it holds

U(s) = max

a∈A(s)

∑

′

∑

p(s

′

,r|s, a)



r + γU(s

′

)



, (1)

which we can rewrite using expectation over p to

U(s) = max

a∈A(s)

′

,r∼P(s,a)



r + γU(s

′

)



. (2)

Using the Bellman equation, we can compute the

optimal policy, for example, by the value iteration al-

gorithm (Howard, 1960).

For details on MDP, see, e.g., Sutton and Barto

(Sutton and Barto, 2018), or Russell and Norvig (Rus-

sell and Norvig, 2010).

3 OUR APPROACH

3.1 Non-Deterministic MAPF

Environment. We use the following model of a

non-deterministic environment, inspired by the model

used in MAPF-TU (Shahar et al., 2021).

The environment is represented in a standard way

as graph G = (V,E). Each edge (u,v), representing an

action available in vertex u, is associated with a set of

possible outcomes {(p

)}

. Outcome (p

)

means that with probability p

, the agent ends up in

vertex v

, possibly different from v, and the real dura-

tion (length) of the action is l

timesteps.

This representation enables us to express both the

stochasticity of real durations of actions (modeling

delays) and non-determinism in real action outcomes

(modeling wrong turn). In the former case, the result-

ing vertex v

is v for all outcomes of (u,v); while in

the latter, resulting vertices of some outcomes are set

to other vertices than v – e.g., other neighbors of u.

Furthermore, as the explicit distribution of possi-

ble action results is given, we can directly compute

(and optimize) the expected length of the given plan.

Collision-Free Solution. A solution of nondeter-

ministic MAPF is a set of policies {π

}. A (single-

agent) policy π

is a function that for current timestep

t and current location v of the agent a outputs an ac-

tion π

(v,t). The solution is required to be collision-

free. In order to formalize this property, we adopt a

notion of safe solution (Shahar et al., 2021). A solu-

tion is called safe if there exists no such combination

of real outcomes of executed actions that would lead

to a vertex, edge, or swapping conﬂict.

Cost of Solution. Real cost of a single-agent pol-

icy π

is the least timestep t when the agent arrives

at its destination and will never leave it – stay-on-

target scenario (Stern et al., 2019). Because of a non-

deterministic environment, the real cost depends on

the actual trajectory of the agent (actual outcomes of

performed actions) and thus might differ if the policy

is executed multiple times. Therefore we deﬁne also

the expected cost of the policy as the expected value

of real cost over all possible agent’s trajectories.

For a complete solution of a given nondetermin-

istic MAPF instance (set of policies), we then deﬁne

expected SoC of the solution to be the sum of expected

costs of individual policies.

3.2 DeltaPolicyCBS

We present a DeltaPolicyCBS, an algorithm that ﬁnds

a safe optimal (in terms of expected SoC) solution of a

given instance of nondeterministic MAPF. DeltaPol-

icyCBS is a modiﬁcation of classical Conﬂict-Based

Search that outputs single agent plans in the form of

policy instead of action sequence.

The most important modiﬁcation is the lower level

of CBS – instead of the shortest path we need to

output a minimum-expected-cost policy that respects

given constraints. Another necessary modiﬁcation is

the addition of a procedure that computes potential

presence of an agent under the given policy. The high

level of DeltaPolicyCBS then looks for potential con-

ﬂicts among agents and resolves them in a standard

way by adding constraints. A potential conﬂict is an

element of a non-empty intersection of the potential

presences of two agents.

3.2.1 Computing Minimum Cost Policy

In case of no constraints, the problem of computing

the minimum expected cost policy for a given agent a

can be easily formulated as a Markov Decision Pro-

cess with an inﬁnite horizon.

We set the set of states to be V (vertices of the

graph), set of available actions in vertex v to A

{(v, w) | w ∈ V, (v,w) ∈ E} (all neighbors, including a

loop (v,v) that represents wait action) and the prob-

ability P(v

′

|v, a) of ending up in vertex v

′

and ob-

taining reward l

when action a is performed in vertex

Multi-Agent Path Finding: Policies Instead of Plans

v to be equal to p

if a ∈ A

and (p

′

) is a possible

outcome of the action. Otherwise, P(v

′

|v, a) = 0.

We consider wait actions to be always determin-

istic, therefore for each vertex v the wait action (v,v)

has exactly one possible outcome (1,v, 1) – with prob-

ability 1, the agent will stay in vertex v and the dura-

tion of the action is 1 timestep. So, P(v,1|v, (v,v)) = 1.

The only exception is the wait action in agent’s

destination vertex g = g(a). We set the reward for

such action to be 0, i.e., P(g,0|g,(g,g)) = 1 – when

the agent executes wait action in g, it stays in g but re-

ceives reward 0 (instead of 1). Effectively, this means

that waiting at the destination does not increase the

total return and makes the destination an absorbing

state of the MDP – note, that we minimize the return

(utility), see next paragraph.

For given destination vertex g (= g(a)) and pol-

icy π : V → A, the expected travel time d

(v) from

a vertex v to g is equal to E

′

,l∼P

π(v)

l + d

′

), i.e.,

an expectation over possible action outcomes of the

sum of the action duration and expected travel time

from next vertex. Therefore, d

(v) is equal to the ex-

pected utility of state v in the above-deﬁned MDP if

one is minimizing utility. Note that minimization of

utility in MDP is equivalent to maximization of utility

in MDP with negative rewards.

The optimal policy for an agent can therefore be

computed as the optimal policy of above mentioned

MDP. We can use standard algorithms, such as value

iteration.

In the case of a non-empty set of constraints, the

computation of the policy is a bit complicated. We

have to ensure that an agent a following its policy π

will never be in a vertex or edge x in timestep t if

(a,t,x) is one of the constraints.

We can still formulate the problem as MDP, as

follows: The set of states is V × T , initial state is

(s(a),0). We require the optimal policy to bring the

agent to a state (g(a),t) (for any t) as soon as possible

(and then keep the agent in the vertex g(a)). There-

fore, we set the rewards in the same way (to be the

real duration of the action), except for waiting in the

agent’s destination vertex, where the reward (penal-

ization) will be 0 only for t ≥ T

goal

(until that it will

be 1 as for other wait actions). T

goal

is a timestep since

which the agent can stay at its destination forever, i.e.,

it is the timestep of the last constraint denying the des-

tination vertex.

Formally,

• set of states is V × T ,

• set of available actions in given state is A

(v,t)

⊆ A

– given the constraints, an action is not available

at given timestep t if any of its possible outcomes

would lead to banned state (v

′

• probability of ending in state (v

′

,t +l

) and obtain-

ing reward l

(when executing an action a in state

(v,t)) is P((v

′

,t + l

),l

|(v,t), a) = p

, if a ∈ A

(v,t)

and (p

′

) is a possible outcome of a; 0 other-

wise.

Still, we need to work with inﬁnite horizon MDP.

Hence, the presented formulation would lead to an in-

ﬁnite set of timesteps T and to an inﬁnite set of MDP

states. However, we need the timesteps in states just

to be able to express variable (un)availability of ac-

tions. An action a from vertex v is unavailable only

if using it would lead to a violation of any constraint.

Thus, A

(v,t)

depends on t only for t less than or equal

to the timestep of the last constraint, let us denote it

max

. Later (for t > t

max

), all actions are available, i.e.,

(v,t)

= A

Therefore, for any t > t

max

, the expected travel

time d

(v,t) is equal to the expected travel time from

the version of MDP without constraints. Thus, the

optimal policy is the same as the optimal policy for

MDP without constraints (for t > t

max

Provided that the action durations are positive,

an optimal policy can be computed as follows: For

t > t

max

we use the precomputed policy from the ver-

sion without constraints. For t ≤ t

max

we compute

the policy using the Bellman equation. We use dy-

namic programming and compute the policy ”back-

wards” – successively for t

max

, t

max

− 1, . . . , 0). Due

to the positivity of action duration, all required val-

ues d

⋆

′

,t + d) at each step have already been com-

puted.

The method is summarized in algorithm 1. To

prove correctness, let us note that the algorithm is in

fact value iteration where the individual states are it-

erated in topological order. Thus, one iteration is suf-

ﬁcient to compute the exact utilities of states (and op-

timal policy as well).

3.2.2 Incremental Approach

After adding a new constraint c = (a

) to C,

computation of a new policy π

′

for agent a

satisfy-

ing all constraints in C is required. However, it is

not necessary to compute the policy from scratch, it is

sufﬁcient just to modify the existing policy π. Specif-

ically, it is sufﬁcient to recompute π

′

(v,t) for such

states (v,t) in which the optimal action might change.

Let us deﬁne a set of successor states N(v,t) as a

set of states (v

′

) such that (v

′

) is reachable from

(v,t) using an action. Therefore, according to the

Bellman equation, the value d

′

) inﬂuences the

choice of optimal action in (v,t).

The optimal action is chosen as

argmin

a∈A

(v,t)

′

,l∼P

π(v,t)

l + d

′

,t + l). There

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

Data: Graph G = (V,E), goal vertex, set of

constraints C

Result: π, d

– optimal policy and

corresponding expected travel times

to goal

φ,d

← optimal policy and corresponding

expected travel times to goal in case

constraints are ignored;

max

← max{constraint.t | constraint ∈ C};

for t > t

max

for v ∈ V do

π(v,t), d

(v,t) ← φ(v), d

(v);

end

for t in max dynamic t,...,0 do

for vertex v ∈ V do

(v,t) ←

min

a∈A

(v,t)

′

,l∼P

π(v,t)

l + d

′

,t + l);

π(v,t) ← corresponding action;

end

return π, d

Algorithm 1: Computation of optimal policy.

are two reasons why the value might change. Firstly,

if the set of allowed actions A

(v,t)

changes. That

happens if (v

) ∈ N(v,t), i.e. one or more actions

lead to the newly prohibited state. Secondly, if the

value of d

′

,t + l) changes. That means, there

exists a (v

′

) ∈ N(v,t) such that d

′

) ̸= d

′

Due to this fact, we may compute the new policy

′

more efﬁciently using the algorithm 2.

Recall, for each successor (v

′

) ∈ N(v,t) holds

′

> t (assuming a positive duration of actions).

Hence, the timestep t

′

of each state in N

−1

is less than

the timestep t of the currently processed state (v,t).

Therefore, each state (v,t) is processed at most once.

All states in which the new policy π

′

or distance

′

might differ from π or d

, respectively, are recom-

puted. Hence, the algorithm 2 returns the same result

as the original algorithm 1.

3.2.3 Conﬂict Detection

To be able to implement the DeltaPolicyCBS algo-

rithm, we need a method to detect possible conﬂicts

between agents (following their policies). We pro-

pose using a concept of potential presence (Shahar

et al., 2021). The idea is to ”simulate” the agent fol-

lowing its policy and mark the visited timestep-vertex

and timestep-edge pairs.

The method is shown in algorithm 3. We show di-

rectly a version extended by also marking the proba-

bility of presence. The pseudocode uses the following

Data: Graph G = (V,E), destination goal, set

of constraints C, new constraint c,

previous optimal policy π and

corresponding expected travel times d

(π respects C \ c)

Result: π

′

, d

′

– optimal policy and

corresponding expected travel times

to goal respecting all constraints

(including c)

′

← π,d

;

q ← empty queue // states to recompute;

−1

= {(v

′

) | (c.v, c.t) ∈ N(v

′

)};

for (v

′

) ∈ N

−1

if (v

′

) /∈ q then

q.Enqueue(v

′

);

end

while q not empty do

v,t ← q.Dequeue();

′

(v,t) ←

min

a∈A

(v,t)

′

,l∼P

π(v,t)

l + d

′

,t + l);

′

(v,t) ← corresponding action;

if d

′

(v,t) ̸= d

′

(v,t) then

−1

= {(v

′

) | (v,t) ∈ N(v

′

)};

for (v

′

) ∈ N

−1

if (v

′

) /∈ q then

q.Enqueue(v

′

);

end

return π

′

, d

′

Algorithm 2: More effective computation of opti-

mal policy using incremental approach.

notation:

• PP potential presence in vertices, PPE potential

presence in edges,

• PP [t] = {(v

, p

)} is a list of vertices where the

agent might be situated in timestep t, including

probability,

• method PP.add((v,t, p)) marks new option how

the agent might arrive to state (v,t) with probabil-

ity p – if v is not in PP [t], it is added with prob-

ability p; otherwise, probability is increased by p

(probability of independent ways how to arrive to

(v,t) is summed),

• PPE [t,(v, v

′

)] probability of agent being situated

on edge (v, v

′

) in timestep t

• method PPE.add((v,v

′

),(t,t

′

), p) similarily

marks new presence of the agent on the edge

(v, v

′

) between timesteps t and t

′

(during travel

Multi-Agent Path Finding: Policies Instead of Plans

over the edge), with probability p – again, p is

added to PPE [

t,(v,v

′

)] for all

t = t, ...t

′

Data: Graph G = (V,E), agent’s starting

vertex start, policy π

Result: set of states (and probability

distribution), where the agent might

locate during execution of π

PP.add((start,0,1)) // in timestep 0, agent is

in its source vertex (with probability 1);

t = 0;

while PP not stable do

for (v, p) ∈ PP[t] do

for possible outcome (p

′

) of the

action π(v,t) do

PP.add((v

′

,t + l

′

, p · p

′

));

PPE.add((v,v

′

),(t,t + l

′

), p · p

′

);

end

t ← t +1;

end

return PP, PPE

Algorithm 3: PotentialPresence.

The simulation runs until PP stabilizes, i.e., until

PP[t

′

] = PP [t

′

+ 1] is quarantied for all t

′

> t. Suf-

ﬁcient condition is that the agent is with probability

1 in its destination ( PP[t] = (goal,t,1)) and will not

leave it (t > t

max

Once the potential presence of all agents is com-

puted, potential conﬂicts can be found easily as an

intersection of the potential presences of two agents

(see algorithm 4). Similarly to other algorithms, we

show a version extended by outputting the probability

of each potential conﬂict (in addition to the conﬂict it-

self). The probability of the conﬂict is equal to p

· p

because both agents need to be situated in the given

state (v,t) simultaneously.

3.2.4 Complete Algorithm

The complete high-level algorithm of DeltaPolicy-

CBS is shown in algorithm 5.

4 EXPERIMENTS

We conduct several experiments to compare the

proposed policy approach to MAPF to classical k-

robustness and to explore the inﬂuence of pruning (ig-

noring not much probable conﬂicts) on runtime and

quality of solution.

By the quality of a solution, we refer to

• expected cost (expected SOC),

Data: Potential presence PP

,PPE

for each

agent a ∈ Ag

Result: List of potential conﬂicts (optionally

including their probability)

forall a

∈ Ag do

forall a

∈ Ag; j < i do

forall t do

forall v: (v, p

) ∈ PP

[t] and

(v, p

) ∈ PP

[t] do

Add (a

,v,t, p

· p

) (or

only (a

,v,t)) to C;

end

forall

(v, v

′

): PPE

[t,(v,v

′

)] = p

0 ∧ PPE

[t,(v,v

′

)] = p

> 0 do

Add (a

,(v,v

′

),t, p

· p

)

(or only (a

,v,t)) to C;

end

return C

Algorithm 4: FindConﬂicts.

Data: MAPF instance

Root.constraints ←

Root.solution ← ﬁnd individual policies by

the low level;

Root.cost ← Cost(Root.solution);

insert Root to OPEN;

while OPEN not empty do

N ← best node from OPEN;

pp ← PotentialPresence(N.solution);

con f licts ← FindCon f licts(pp);

if no conﬂict then

return N.solution

end

C ← a conﬂict (a

,x,t) from con f licts

forall agent a in C do

NN ← new node;

NN.constraints ←

N.constraints + (a,v,t);

NN.solution ← N.solution;

Update NN.solution by low level –

update a’s policy;

NN.cost ← Cost(NN.solution);

if NN.cost < ∞ then

Insert NN to OPEN;

end

Algorithm 5: PolicyCBS – high level – pseu-

docode.

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

100

Figure 1: Warehouse map.

• real cost (evaluated by simulation),

• success rate of execution – whether any collision

happened during simulation.

Experiments were run on a computer with AMD

Ryzen 5 3600 CPU (frequency 3.9 GHz) and 32

GB RAM. We used our implementation of DeltaPol-

icyCBS in Python 3.11.0. To obtain k-robust so-

lutions, we used Improved k-Robust CBS (Atzmon

et al., 2018), implemented by Nekvinda and Bart

(Nekvinda and Bart

ak, 2021).

In our experiments, we abbreviate DeltaPolicy-

CBS as DPCBS. DPCBS-0.001 then denotes the

pruned variant of DeltaPolicyCBS that ignores con-

ﬂicts whose probability is less than 0.001. Of course,

the pruned variant guarantees neither optimality nor

safeness of the solution.

4.1 Instances

MAPF Benchmark (Stern et al., 2019) contains sev-

eral different kinds of maps commonly used for test-

ing MAPF algorithms. For our experiments, we use

the empty grid map 8 × 8 and a warehouse map

(namely, warehouse-10-20-10-2-1). The warehouse

map is shown in Figure 1. Instances are created by

taking the ﬁrst n agents from each of 25 random sce-

narios for a given map. Solutions of successfully

solved instances are then simulated on 50 presampled

sets of real action results. During each simulation,

real SOC cost and number of collisions are observed.

We used two modes of incorporating nondeter-

minism. In the ﬁrst set of instances, we set two possi-

ble outcomes to every non-deterministic edge – length

1 with probability 1− p and length 2 with probability

p. The parameter p can be seen as the probability of

delay.

In the second set of instances, edges (actions)

from selected vertices only are set as nondeterminis-

tic, we refer to those vertices as locations with higher

delay probability. In particular, in the map Grid, all

vertices in the 3rd and 5th rows were selected. In the

map Warehouse, inner crossroads in the 5th, 10th, and

15th rows were selected. Other actions are determin-

istic – have one and only possible outcome with travel

time 1. We refer to this set of instances as mixed – as

there are both deterministic and nondeterministic ac-

tions.

4.2 Results

For each map, number of agents, and value of p (delay

probability) we present, for each algorithm, the ratio

of solved instances (out of the 25), the success rate

of executions (ratio of simulations that ended without

any collision) and the mean of measured real costs (in

terms of SOC). In order not to penalize an algorithm

that solved more instances by including real costs of

those, possibly more difﬁcult, instances, the execution

success rate and real costs are computed using only

instances solved by all algorithms.

Table 1 shows the results on the map Grid with 10

agents. DPCBS achieves a 100% success rate, sim-

ilarly to 1- and 2-robust classic plans. However, the

real cost of DPCBS solutions is on average lower, es-

pecially compared to 2-robust plans.

Tables 2 and 3 show the results on the map Grid–

mixed with 10 and 13 agents, respectively. Similarly

as in the previous case, DPCBS is able to reach a suc-

cess rate of 100% while achieving low real costs –

similar to the 0-robust solution. k-robust solutions are

either not successful on all instances (for k ≤ 1) or

have signiﬁcantly higher real costs (k = 2). The major

drawback of DPCBS is not being able to solve all in-

stances in the given time limit, as opposed to k-robust

solutions (at least for smaller k).

The same observations hold also for the bigger

map, Warehouse–mixed, as shown in the table 4.

Apart from the optimal DPCBS, tables contain re-

sults of its pruned variant DPCBS-0.001. Recall, that

DPCBS-0.001 ignores conﬂicts whose probability is

less than 0.001 and therefore is not optimal and does

not guarantee the safety of the solution. Notwith-

standing, no collisions occurred during simulations

and the real costs of the solutions were almost iden-

tical to those of DPCBS (on instances solved by both

algorithms. On the other hand, due to ignoring some

conﬂicts, DPCBS-0.001 was able to solve more in-

stances than DPCBS, especially when the delay prob-

ability p was rather small.

Multi-Agent Path Finding: Policies Instead of Plans

101

Table 1: Quality of solutions, instances on the map Grid, 10 agents.

Instances Execution Real

Algorithm

solved success rate cost

p = 0.1 0.2 0.5 0.1 0.2 0.5 0.1 0.2 0.5

DPCBS 0.52 0.56 0.68 1.00 1.00 1.00 50.22 54.82 69.79

DPCBS-0.001 0.72 0.64 0.68 1.00 1.00 1.00 50.21 54.81 69.84

0r-plan 1.00 1.00 1.00 0.72 0.61 0.64 49.31 53.84 68.91

1r-plan 1.00 1.00 1.00 0.98 0.97 0.97 50.41 55.03 70.40

2r-plan 1.00 1.00 1.00 1.00 1.00 1.00 52.78 57.49 73.14

Table 2: Quality of solutions, instances on the map Grid–mixed, 10 agents.

Instances Execution Real

Algorithm

solved success rate cost

p = 0.1 0.2 0.5 0.1 0.2 0.5 0.1 0.2 0.5

DPCBS 0.88 0.88 1.0 1.00 1.00 1.00 49.67 51.27 57.05

DPCBS-0.001 1.00 1.00 1.0 1.00 1.00 1.00 49.65 51.27 57.05

0r-plan 1.00 1.00 1.0 0.78 0.66 0.53 49.59 51.70 59.14

1r-plan 1.00 1.00 1.0 0.99 0.97 0.92 51.06 53.14 60.66

2r-plan 1.00 1.00 1.0 1.00 1.00 0.99 53.83 55.95 63.93

Table 3: Quality of solutions, instances on the map Grid–mixed, 13 agents.

Instances Execution Real

Algorithm

solved success rate cost

p = 0.1 0.2 0.5 0.1 0.2 0.5 0.1 0.2 0.5

DPCBS 0.68 0.68 0.8 1.00 1.00 1.00 62.29 64.38 70.81

DPCBS-0.001 0.76 0.72 0.8 1.00 1.00 1.00 62.29 64.38 70.81

0r-plan 1.00 1.00 1.0 0.69 0.49 0.35 62.03 64.80 72.97

1r-plan 1.00 1.00 1.0 0.99 0.95 0.85 65.03 67.69 75.86

2r-plan 0.60 0.60 0.6 1.00 1.00 0.97 70.57 73.37 81.69

Table 4: Quality of solutions, instances on the map Warehouse–mixed, 20 agents.

Instances Execution Real

Algorithm

solved success rate cost

p = 0.1 0.2 0.5 0.1 0.2 0.5 0.1 0.2 0.5

DPCBS 0.72 0.76 0.80 1.00 1.00 1.00 1537.67 1539.99 1555.67

DPCBS-0.001 0.80 0.84 0.80 1.00 1.00 1.00 1537.67 1539.99 1555.68

0r-plan 0.88 0.88 0.88 0.84 0.76 0.70 1540.51 1545.94 1571.55

1r-plan 0.80 0.80 0.80 1.00 0.99 0.92 1541.16 1546.77 1572.55

2r-plan 0.72 0.72 0.72 1.00 0.99 0.93 1541.17 1546.66 1572.17

Table 5: Quality of solutions, instances on the map Grid–mixed with non-deterministic resulting vertex, 7 agents.

Success Real costs

Algorithm rate (average)

p = 0.1 p = 0.5 p = 0.1 p = 0.5

DPCBS 0.00 0.21 —- 43.57

DPCBS-0.001 0.36 0.36 36.58 40.72

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

102

Table 6: Quality of solutions, instances on the map Warehouse–mixed with non-deterministic resulting vertex, 10 agents.

Success Real costs

Algorithm rate (average)

p = 0.1 p = 0.5 p = 0.1 p = 0.5

DPCBS 0.52 0.72 671.72 661.09

DPCBS-0.001 0.96 0.96 660.42 667.47

4.3 Non-Deterministic Resulting Vertex

Additionally, we conduct another set of experiments

in environments with non-deterministic results of ac-

tions. Recall, that in our model of environment, a set

of possible outcomes {(p

)}

is assigned to each

action. In this set of experiments, outcomes might

have different vertex v

in (which agent ends up).

This set of experiments is conducted on the maps

Grid–mixed and Warehouse–mixed. Possible out-

comes of a move action (u,v) from vertex u are set

as follows. Duration l

is always set to 1. If u is not

marked (as a location with higher delay probability),

the resulting vertex v

is equal to v. If u is marked,

the action has 3 possible resulting vertices (3 possible

outcomes): v with probability 1−2p, v

−1

with proba-

bility p, and v

with probability p. Vertices v

and

−1

are neighbors of u located clockwise, or counter-

clockwise, from v.

Possible outcomes correspond to the possibility

that an agent traveling through a marked vertex turns

90° left or right from the desired direction. Wait ac-

tions are always deterministic.

Results on the map Grid–mixed with 7 agents

are shown in Table 5. Table 6 shows results on

Warehouse–mixed with 10 agents. Contrary to the

previous section, we present a success rate which

is a ratio of simulations in which both the instance

was solved by the algorithm and the simulation ended

without any collisions. Both success rate and real

costs are computed from all instances and simulation,

for each algorithm independently (i.e., it is not lim-

ited to instances solved by both algorithms only). The

reason is the ratio of solved instances by one of the al-

gorithms.

Results show that with p = 0.5, algorithms

are more successful but the cost of solutions is

higher. This is expected because, with p = 0.5, non-

deterministic actions have in fact two possible results

only. On the other hand, they lead to opposite ver-

tices.

DPCBS (without pruning) is not very successful,

especially on the smaller map with more agent inter-

actions and a higher density of non-deterministic ac-

tions. We assume it might be caused simply by the

non-existence of a safe solution. The existence of a

safe solution in non-deterministic MAPF remains an

open problem.

Nevertheless, results illustrate that by using our

approach (policies instead of plans) it is possible to

solve at least some instances of MAPF in an environ-

ment with non-deterministic resulting vertices. Let us

recall that classical plans (sequences of actions) can-

not be used in such environments.

5 CONCLUSIONS

In this paper, we introduced a novel idea of represent-

ing MAPF solution by policies, instead of sequences

of actions or sequences of visited vertices. Policies al-

low the agent to take different actions based on actual

timesteps of arrival into a vertex. Therefore, agents

are able to react to experienced delay during execu-

tion while we are still in an area of ofﬂine MAPF –

the solution (policies) is computed completely before

the execution.

The use of policies is beneﬁcial only in non-

deterministic environments where agents might expe-

rience some delays or other unpredicted events. We

formalized such a non-deterministic environment in a

way similar to MDP – a set of possible outcomes is

assigned to each action (move to a neighbor vertex).

The possible outcome consists of the duration of the

action (which allows us to model a delay) and the re-

sulting vertex (to model for example a wrong turn).

The main part of this paper presents an algorithm,

called DeltaPolicyCBS (DPCBS), that optimally (in

terms of expected cost) solves MAPF in the above-

mentioned version of a non-deterministic environ-

ment and returns the solution in the form of policies.

The algorithm is a modiﬁcation of a well-known Con-

ﬂict Based Search in which individual shortest-path

problems are formulated as Markov Decision Pro-

cesses. A solution of the MAPF problem then consists

of (optimal) policies for those MDPs.

In the last section, DPCBS was experimentally

evaluated on several types of MAPF instances. So-

lutions were executed in a simulation and compared

to classical k-robust plans. We showed that usage

of policy-based solutions found by DPCBS leads to

Multi-Agent Path Finding: Policies Instead of Plans

103

no collisions during execution while keeping low real

costs of the solutions.

The intent of our work was to explore a novel ap-

proach to solving MAPF in a non-deterministic en-

vironment. We have shown that using policies is

beneﬁcial not only in environments in which the re-

sults of actions are nondeterministic – as those en-

vironments cannot be solved using standard plans –

but also in environments in which the only possible

non-determinism is delays – using policies reduces

real costs while ensuring no collisions. Our work

opened several directions for future research. For

the approach to be usable in practice, a more efﬁ-

cient method (than our current version of DPCBS)

for ﬁnding a MAPF solution in the form of policies

is needed, possibly accompanied by some clever efﬁ-

cient data structure for storing policies in a compact

way. The incapability of DPCBS to solve even some

rather small instances opens the question of the actual

existence of a safe policy solution of a given MAPF

instance, especially in highly nondeterministic envi-

ronments, such as our not-mixed Grid map with some

delay probability on each edge.

ACKNOWLEDGEMENTS

Research is supported by project No 23-05104S of

Czech Science Foundation.

REFERENCES

Atzmon, D., Stern, R., Felner, A., Sturtevant, N. R.,

and Koenig, S. (2020). Probabilistic Robust Multi-

Agent Path Finding. Proceedings of the International

Conference on Automated Planning and Scheduling,

30(1):29–37.

Atzmon, D., Stern, R., Felner, A., Wagner, G., Bart

ak,

R., and Zhou, N.-F. (2018). Robust Multi-Agent

Path Finding. Symposium on Combinatorial Search

(SoCS), pages 2–9.

Bart

ak, R.,

Svancara, J.,

Skopkov

a, V., Nohejl, D., and

Krasi

cenko, I. (2019). Multi-agent path ﬁnding on real

robots. AI Communications, 32(3):175–189.

Howard, R. (1960). Dynamic programming and Markov

processes. Technology Press of Massachusetts Insti-

tute of Technology.

Li, J., Chen, Z., Zheng, Y., Chan, S.-H., Harabor, D.,

Stuckey, P. J., Ma, H., and Koenig, S. (2021a). Scal-

able rail planning and replanning: Winning the 2020

ﬂatland challenge. Proceedings of the International

Conference on Automated Planning and Scheduling,

31(1):477–485.

Li, J., Felner, A., Boyarski, E., Ma, H., and Koenig, S.

(2019). Improved Heuristics for Multi-Agent Path

Finding with Conﬂict-Based Search. In Proceedings

of the Twenty-Eighth International Joint Conference

on Artiﬁcial Intelligence, IJCAI-19, pages 442–449.

International Joint Conferences on Artiﬁcial Intelli-

gence Organization.

Li, J., Harabor, D., Stuckey, P. J., Ma, H., Gange, G., and

Koenig, S. (2021b). Pairwise symmetry reasoning for

multi-agent path ﬁnding search. Artiﬁcial Intelligence,

301:103574.

Nekvinda, M. and Bart

ak, R. (2021). Contingent Planning

for Robust Multi-Agent Path Finding. In 2021 IEEE

33rd International Conference on Tools with Artiﬁcial

Intelligence (ICTAI), pages 487–492.

Russell, S. J. and Norvig, P. (2010). Artiﬁcial Intelligence:

A Modern Approach. Prentice Hall.

Shahar, T., Shekhar, S., Atzmon, D., Safﬁdine, A., Juba, B.,

and Stern, R. (2021). Safe Multi-Agent Pathﬁnding

with Time Uncertainty. J. Artif. Int. Res., 70:923–954.

Sharon, G., Stern, R., Felner, A., and Sturtevant, N. (2012).

Conﬂict-Based Search For Optimal Multi-Agent Path

Finding. Proceedings of the AAAI Conference on Ar-

tiﬁcial Intelligence, 26(1):563–569.

Stern, R., Sturtevant, N., Felner, A., Koenig, S., Ma, H.,

Walker, T., Li, J., Atzmon, D., Cohen, L., Kumar, T.

K. S., Boyarski, E., and Bart

ak, R. (2019). Multi-

Agent Pathﬁnding: Deﬁnitions, Variants, and Bench-

marks. Symposium on Combinatorial Search (SoCS),

pages 151–158.

Sutton, R. S. and Barto, A. G. (2018). Reinforcement Learn-

ing: An Introduction. The MIT Press, 2 edition.

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

104