Multi-Agent Deep Reinforcement Learning for Collaborative Task

Scheduling

Mali Imre Gergely

Faculty of Mathematics and Computer Science, Babes

-Bolyai University, Kog

alniceanu Street, Cluj-Napoca, Romania

Keywords:

Multi-Agent Reinforcement Learning, Multi-Agent Systems, Resource Management, Task-Scheduling.

Abstract:

Efﬁcient, scalable and cost-efﬁcient resource management is a multi-faceted online decision making problem,

faced more and more in networking and cloud computing. More speciﬁcally, task-scheduling stands out as

a complex challenge, solving which is critical for the optimal functioning of today’s systems. Traditional

heuristic approaches to scheduling are laborious to design and especially difﬁcult to tune, therefore various

machine-learning based methods have been proposed. Reinforcement Learning (RL) showed great results

in similar decision making problems, and many existing approaches employ RL to solve task scheduling

problems. Most of these works consider either single-agent scenarios (and thus suffer from scalability issues),

or the existing multi-agent applications are heavily specialised. We propose a general-purpose multi-agent

RL framework that can successfully learn collaborative optimal scheduling policies, making one step further

towards clouds and networks that are both scalable and autonomous. Our experiments show that these agents

can collaboratively learn optimal scheduling policies for dynamic workloads.

1 INTRODUCTION

Optimal resource management is a critical nuance

in the operation of various real-life systems such as

in logistics, computer networks, cloud-based appli-

cations, transportation, manufacturing. In particu-

lar, computer networks and cloud-based applications

have experienced an immense growth in popularity

recently. These systems typically must serve large

numbers of users, and must manage substantial vol-

umes of data and signiﬁcant computational demands.

Effective task scheduling is a vital aspect of the opti-

mal functioning of these systems.

Task scheduling itself is a well-studied NP-

hard problem, which under certain formulations is

equivalent to the intensely scrutinized ”Job-Shop-

Scheduling” problem (Dauz

ere-P

es et al., 2023).

Efﬁcient task-scheduling in computer networks en-

sures low makespan of tasks, thus achieving low la-

tency, high reliability and good performance. Classi-

cal approaches to task-scheduling generally use com-

plex hand-crafted heuristics that are not only tedious

to conceive and difﬁcult to calibrate, they also lack

ﬂexibility with respect to changes in the metrics of in-

terest and other aspects of the environment. To com-

bat this, RL-based solution proposals have been more

and more frequent, especially since DeepRM was in-

troduced (Mao et al., 2016).

While some approaches are directly implemented

for practical use in real-world cluster management

scenarios such as machine-learning clusters (Zhao

and Wu, 2021), a signiﬁcant portion of existing stud-

ies approach these problems by developing and using

simulated environments (Mao et al., 2016; Fan et al.,

2022). These simulators and testbeds offer a control-

lable setting that ease the process of experimentation

with different algorithms and strategies, and allow for

more rigorous analysis of results. This can ultimately

be useful in developing efﬁcient and robust solutions

that later can be incorporated into real clusters.

Many of the existing research considers a sin-

gle agent, generally handling task scheduling for an

entire cluster. The existing exceptions that employ

multi-agent methods (Zhao and Wu, 2021) are largely

situation-speciﬁc and are heavily intertwined with the

intricacies of the type of cluster or system they con-

sider, lacking generality. Single-agent approaches are

problematic since the growth in the number of ma-

chines/instances typically increases the action space

for RL-based approaches, making them more and

more inefﬁcient as these systems scale.

This paper considers a novel approach that formu-

lates multi-resource, multi-machine and multi-agent

task scheduling as a collaborative Partially Observ-

1076

Gergely, M.

Multi-Agent Deep Reinforcement Learning for Collaborative Task Scheduling.

DOI: 10.5220/0012434700003636

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 16th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2024) - Volume 3, pages 1076-1083

ISBN: 978-989-758-680-4; ISSN: 2184-433X

able Markov Game (POSG). However, we use an

implementation based on a novel pattern and pro-

vide a conﬁgurable environment based on the Agent-

Environment-Cycle (AEC) model (Terry et al., 2021).

Since AEC games are also proved to be equivalent to

POSG-s and Extensive Form Games (EFG), this ab-

straction is judicious.

We then apply Proximal Policy Optimization

(PPO) (Schulman et al., 2017) in a decentralised way

and experimentally prove that multiple homogeneous

independent RL agents can jointly learn comparable

to or better scheduling policies than heuristic-based

approaches. Additionally, we explore the effects of

local vs. global observations and rewards on perfor-

mance.

2 BACKGROUND

2.1 Multi-Agent Reinforcement

Learning

Reinforcement Learning is one of the main paradigms

of machine learning that models decision-making

problems as Markov Decision Processes and builds

on the idea of learning from interaction (Sutton and

Barto, 2018).

Formally, a Markov Decision Process can be char-

acterized as a tuple (S, A, P, R, γ), where S is the set of

possible states, A is the set of possible actions, P en-

capsulates the transition probabilities to describe the

dynamics of the environment, R is a reward function

and γ is called the discount factor that controls the

short- or farsightedness of the agent.

The agent interacts with the environment by ob-

serving at each timestep t a state s

(or a local obser-

vation o

in partially observable settings), based on

which it decides on action a

and sends it to the envi-

ronment. The environment then transitions to the next

state according to P, and the agent receives reward

t+1

and observes the next state s

t+1

. The agent’s goal

is to learn a policy π, according to which action selec-

tions are made in a way that maximises cumulative

rewards.

In multi-agent scenarios, each agent might have a

different set of actions, and might observe different

subsets of the state. What makes multi-agent scenar-

ios more challenging is that from the perspective of

each agent, the rest are part of the environment. Thus,

when multiple agents are learning at the same time,

the environment exposes an inherent non-stationarity.

Regarding RL methods, Proximal Policy Opti-

misation stands out as an advanced policy-gradient

method with a unique approach to the balance be-

tween exploitation and exploration. Unlike classical

methods, PPO optimises a clipped surrogate objec-

tive function. The clipping mechanism is central to

this approach as this effectively prevents the algo-

rithm from committing updates so large that would

change the policy acutely (thus the name ”proximal”).

The objective function can be written as follows:

L(θ) =

E[min(r

(θ)

, clip(r

(θ), 1 − ε, 1 + ε)

)]

Here,

is an approximator of the advantage func-

tion (Schulman et al., 2015), r

(θ) is the ratio between

the new and the old policy, and ε is a hyperparameter,

generally set to 0.2. The clip function modiﬁes the

surrogate objective and prevents it from jumping out-

side of the interval [1 − ε, 1 + ε].

PPO has gained immense popularity due to its

simplicity, applicability and effectiveness in various

scenarios, from chip design (Mirhoseini et al., 2021)

to robotics (OpenAI et al., 2018). Furthermore, PPO

has proved to be surprisingly effective in cooperative

multi-agent games, (Yu et al., 2022) which further

validates our idea to apply multi-agent PPO to task

scheduling. To our best knowledge however, this has

not yet been done in the literature for this particular

problem in a multi-agent setting.

2.2 Agent Environment Cycle Games

Due to recent advances, RL has gained wide recog-

nition and popularity, especially in the ﬁeld of games

and game theory (Silver et al., 2017). This has led

to the development of tools, frameworks and libraries

such as OpenAI gym (Brockman et al., 2016), Open-

Spiel (Lanctot et al., 2019), and PettingZoo along

with the Agent Environment Cycle (AEC) model

(Terry et al., 2021). These tools prove to be useful

in rigorous modeling and experimentation in RL, and

are applicable to scheduling simulators as well.

Among these, the AEC model and the correspond-

ing library, PettingZoo stand out in that it successfully

addresses common errors and provides clearer attri-

bution of rewards by providing a more conceptually

consistent approach. Agents in the AEC model re-

ceive observations, take actions, and receive rewards

sequentially, which mirrors the sequential nature of

decision-making in most real-world scenarios. This

approach not only simpliﬁes the attribution of re-

wards and actions but also implicitly prevents the in-

troduction of race conditions, frequently encountered

in models where agent actions are committed simulta-

neously. Moreover, AEC games are also proved to be

Multi-Agent Deep Reinforcement Learning for Collaborative Task Scheduling

1077

equivalent to POSG games, which makes the model

widely applicable.

Formally, AEC games can be characterized

as a tuple containing (S, s

, N, (A

)

i∈[N]

, (T

)

i∈[N]

, P,

)

i∈[N]

, (Ω

)

i∈[N]

, (O

)

i∈[N]

, v), where:

• S is the set of states, s

is the initial state.

• N is the number of agents. Typically, agents are

numbered from 1 to N, and the environment agent

is denoted with 0.

• A

is the set of actions for agent i. For conve-

nience, A

is generally set to be void.

• T

is a state transition function for agent i, while P

is the transition function for the environment.

• R

is the set of possible rewards for agent i.

• Ω

is the set of possible observations for agent i,

while O

is the observation function.

• v is a special function that computes the next agent

in the agent environment cycle.

Under this formulation, games start in state s

The ﬁrst agent to act is the environment agent, after

which the game continues in turns, agents receive an

observation ω

, and choose action a

. Then the game

transitions deterministically to the next state accord-

ing to T

. When i = 0, a stochastic transition according

to P happens. Agents are taking their turns in order

deﬁned by v. Analogously, at every step agents re-

ceive a partial reward r

which are aggregated at the

end of each cycle. This model allows for turn-based

as well as for synchronous games.

For our purposes, the AEC game will mean that

each scheduler agent will take turn in choosing tasks

to schedule, after which the environment agent will

transition to the next timestep. This ensures that all

agents have their inner clocks synchronized and we

avoid the necessity of complex time synchronization

mechanisms.

2.3 Task Scheduling

Efﬁcient task scheduling has a pivotal role in optimiz-

ing resource utilization and maintaining high quality

of service, upholding service-level agreements and

meeting user requirements. These aspects play an

increasingly important role in environments experi-

encing ﬂuctuating workloads and changing priorities.

The task scheduling problem is generally framed in

terms of the following components:

1. jobs, which are waiting to be processed. Jobs of-

ten have known resource proﬁles

2. machines, which process jobs according to their

resource constraints, and can only process tasks

that ﬁt within their resource constraints

3. processing time

4. schedule, which is a plan detailing which tasks to

be processed on which machines at what times

5. scheduler, the entity that ﬁnds a schedule

6. objective, typically the minimisation of some

metric like job slowdown, makespan, etc.

Under this formulation, our goal is to come up with

a set of RL-based scheduler agents, each handling a

subset of machines of a cluster, concurrently and col-

laboratively ﬁnding scheduling plans that are optimal

under dynamic workloads.

3 RELATED WORK

The ﬁeld of resource management has gained signif-

icant attention, focusing particularly on task schedul-

ing in cloud environments, as well as job-shop

scheduling challenges. Reinforcement learning has

served as a key technique in recent advancements

(Wang et al., 2022; Liu et al., 2020).

DRAS (Fan et al., 2022) aims to solve the prob-

lem of task scheduling, specialised in High Perfor-

mance Computing (HPC) networks. The authors also

provide a gym-like simulation environment for train-

ing called CQGym. They present multiple single-

agent RL methods including Deep Q-Learning, A2C,

Vanilla Policy-Gradient and PPO, and evaluate them

on real HPC cluster traces. Their experiments show

the PPO scheduler agent to perform the best.

The work of Tassel et al. also comes close to ours

(Tassel et al., 2021) in that it both addresses job-shop

scheduling, and employs PPO, but only considers a

single-agent scenario. Instead of randomly generated

data, this work considers benchmark datasets instead.

Another recent study (Zhu et al., 2023) presents

a double-deep Q-network based Multi-Agent Rein-

forcement Learning (MARL) approach to solve real-

time scheduling in a robot-assisted ﬂexible job shop,

using centralized training and decentralized execu-

tion on job sequencing, process planning and machine

selection in a simulated environment. Their results

show improvements up to 50% as compared to heuris-

tics. Another popular approach to job-scheduling

with MARL is using homogeneous agents and Graph

Neural Networks to harness topology-awareness in

large clusters (Zhao and Wu, 2021).

DeepMAG (Zhang et al., 2023) applies a Deep Q-

Network based cooperative multi-agent method, op-

erating over a multi-agent graph based on interop-

eration relationships between machines. While this

work addresses manufacturing settings primarily, it

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

1078

also comes close to our approach as the problem of

cooperation is directly addressed.

DeepRM and DeepRM 2 (Mao et al., 2016; Ye

et al., 2018) are the most prominent works in the area

of task scheduling with RL, one of their most signif-

icant contribution being a ﬂexible, intuitive and ex-

tensible simulation environment, on which our envi-

ronment is based as well. DeepRM ﬁrst demonstrated

the efﬁcacy of RL methods over traditional heuristics.

It is also the ﬁrst approach that employs Policy Gra-

dient methods, having used the REINFORCE algo-

rithm (Sutton et al., 1999). While other works base

their simulation environment on the one described in

DeepRM as well, including the extension to multiple

machines (Paduraru et al., 2023), to our best knowl-

edge ours is the ﬁrst one to extend it to the multi-agent

setting.

4 DESIGN

This section elaborates on the design of the main com-

ponents of our approach, including the environment,

agents and rewards.

4.1 Environment

We describe the simulation environment provided,

along with the evaluation methods and bases of com-

parison, highlighting similarities and differences be-

tween our approach and existing ones. Inspired by

DeepRM, we provide a simulation environment that

models a multi-machine and multi-resource cluster.

Resources, job slots and the backlog are modeled

equivalently to DeepRM. Each machine has multi-

ple types of resources. Then, tasks are arriving to

a ﬁxed number of slots, from which the scheduler

agents choose candidates to allocate. Once a job has

been allocated, it’s moved to the afferent machine’s

running jobs, its resources are marked as used, and

the job is eliminated from the job slots.

New jobs are arriving with a frequency accord-

ing to a Bernoulli process. The resource demands

and time lengths of the jobs are chosen randomly ac-

cording to a pre-computed expected load/stress factor.

Newly arriving jobs are placed in one of the available

job slots. If none is available, the new job is added to

the backlog, from which jobs are placed into the slots

in a ﬁrst-come ﬁrst-served manner as soon as job slots

free up.

Additionally, we provide not only one, but mul-

tiple distinct machines within the cluster. These ma-

chines are partitioned evenly across the agents, each

agent being responsible for scheduling tasks on their

machines. Handling not only one but possibly multi-

ple machines by one agent further increases the gener-

ality of our simulator. Our experiments only consider

homogeneous agents, meaning each handles the same

number of machines of the same number of resources,

capacities and having identical time horizons.

Further, we encapsulate the aforementioned sys-

tem in an AEC environment for the MARL setup. The

usage of the AEC model is pivotal in this setting, since

this model disambiguates the possible conﬂicts aris-

ing between the agents from having a shared resource

pool, the job slots.

4.2 Agents

We follow a decentralized training and decentralized

execution pattern. We train a separate PPO model

for each agent. In each epoch, we choose one agent

and train it for one episode with the rest of the agents

ﬁxed, then repeat for a number of episodes before pro-

ceeding to the next epoch.

Figure 1: Sample cluster state with two resources, two

pending job slots, two machines, two machines per agent, a

time horizon of four discrete units and a maximum resource

capacity of three units. The top row shows a possible lo-

cal observation for Agent 1, while the bottom row contains

the full global state. The job slot and backlog are shared

between both.

The actions of the agents consist of numbers be-

tween 0 and NxM, where N is the number of machines

per agent, and M is the size of the job slots. Having

action a, the scheduler will allocate the job having in-

dex a mod M in the job slot to machine with index

a ÷ M of the scheduler. This effectively reduces the

action space for agents, since a single-agent approach

Multi-Agent Deep Reinforcement Learning for Collaborative Task Scheduling

1079

would require having to deal with NxMxK actions,

where K is the total number of agents (having NxK

machines in the system).

Having multiple agents also has the additional

possibility of reduced observation spaces. Our state

representation is similar to DeepRM in that we use

images to describe the current cluster state. The cur-

rent state of the environment can be described by the

job allocation statuses of each machine for each re-

source, the resource demands of the ﬁrst M waiting

jobs (the job slot) and the number of waiting jobs in

the queue. When using local observations, each agent

will only observe the state of its own machines and

their job allocations, along with seeing the resource

proﬁles of the M jobs in the job slot. It is also im-

portant to note that since the job slots and backlog are

shared among all agents, this part of the observation is

common to all. De-facto locality of the observations

is attributable to each agent only observing its own

machines. Figure 1 shows examples of global state

and local observation.

4.3 Data

We simulate job arrivals according to a Bernoulli pro-

cess. Jobs are generated in the following way: short,

and long jobs, where short jobs have a time length

between L

= 1 and L

H, where H is the time

horizon, while long jobs have a time length between

H and L

= H. Jobs are taken to be short

or long with a probability p

long

. Further, jobs have

a dominant resource and a recessive one, which are

chosen at random. The dominant resource demands

are chosen randomly between D

C and D

du=

where C is the maximum resource capacity, while

non-dominant resource demands are chosen between

= 1 and D

C. The expected job length then can

be computed as follows:

E[Length] = (1 − p

long

)

+ L

+ p

long

+ L

Analogously, in a system with two resources, the

expected resource demand for resource r, D(r) can be

computed as:

E[D(r)] = p

dom

+ D

+ (1 − p

dom

)

+ D

Here, p

dom

is the probability of a resource being

dominant, and the expected load on resource r on one

machine then can be computed as:

E[Load(r)] = q

E[D(r)]

E[Length]

Here, B

is the number of maximum available re-

source units for resource r, and q is the job arrival rate.

Note however, that the job arrival rate is raised to the

second power. This is due to the fact that whenever a

job arrives, its demands and length are independently

generated, and both depend on the job arrival rate.

Another important detail to note is that this formula

gives the expected load for one machine. To compute

the total expected load on the system for a certain re-

source,

E[Load(r)] can be further divided by the total

number of machines in the system. Using this for-

mula we craft data and tweak the job arrival rate such

that the expected total load varies from 10% to 120%.

4.4 Rewards

The ﬂexibility of RL schedulers lies in the fact that

they can be trained for multiple objectives. For each

individual agent, we use a generic reward function

that can be parameterized for several objectives:

= α

∑

j∈A

+ β

∑

j∈B

+ γ

∑

j∈G

Here A is the set of jobs waiting in the job slot al-

ready, B is the set of jobs in the backlog and G is the

set of jobs running on the machines under the current

agent’s management. This set can be further extended

to all machines in the system, which will control the

locality or globality of the rewards. Intuitively, this

means that an agent will either receive penalties for

the delays on its own machines, or it will be penal-

ized for all. α, β and γ are hyperparameters, generally

taken to be negative. In our experiments, we consider

α = β = γ = −1. The generality of this reward scheme

allows for multiple different objectives.

5 EXPERIMENTS

Our experiments employ PPO in a decentralized

multi-agent setting to the problem of task scheduling

in a DeepRM-like simulator. A PyTorch-based PPO

implementation is used for the agents, and the envi-

ronment is provided as a Python library with an AEC

interface. Our experiments are guided by the follow-

ing research questions:

• R1: Can independent PPO agents collaboratively

learn resource management solely from experi-

ence?

• R2: How does the locality and globality of re-

wards and observations inﬂuence performance?

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

1080

• R3: How does the joint policy of multiple inde-

pendent PPO agents compare to heuristics?

6 ENVIRONMENT PARAMETERS

Inspired by multiple works in the literature and the

parameters used in DeepRM and DeepMAG, the pa-

rameters chosen for the experiments in our environ-

ment are summarized in Table 1.

Table 1: Environment parameters.

Parameter Value

time horizon 20

job slot size 5

number of resources 2

resource capacity 10

backlog size 60

number of machines per agent 1,2

number of agents 2,3

6.1 Agent Parameters

Our agents use PPO to learn scheduling policies. The

sizes of state and action spaces are derived from the

environment parameters as described in Section 4.2.

Agents are uniform in size, values of parameters and

number of machines they handle. Table 2 summarizes

the relevant parameters used by each PPO agent.

Table 2: PPO Agent Parameters.

Parameter Value Description

learning rate 3 · 10

−3

batch size 64

γ 0.99 discount factor

gae-λ 0.95 bias-variance

tradeoff factor

(Schulman et al.,

2015)

clip-range 0.2 clipping parameter

for the surrogate

loss

entropy coeff. 0.0 Used in computing

the loss

value-function

coeff.

0.5

6.2 Results

To showcase the convergence of our agents, we show

the evolution of the episode mean reward as obtained

through time, when co-training multiple PPO agents

for job-slowdown.

Figure 2: Mean reward obtained over time by three PPO

agents, using global observations, global rewards and hav-

ing one machine per agent.

Figure 2 and 3 show the difference in convergence

and obtained reward when using global vs. local ob-

servations and rewards respectively. It is noticeable

how considering only local information introduces a

variance and slows down convergence. In contrast,

incorporating global information makes agents much

more homogeneous and speeds up convergence. The

difference between the absolute values of rewards be-

tween global vs local runs is expected, as global re-

ward means each agent’s reward contains the penal-

ties for all the other agents as well. This is also visible

in the ﬂuctuations of the rewards after the policies sta-

bilised, as they are roughly aligned across all agents.

The reward values on Figure 2 are not equal across

agents due to the training procedure: only rewards

during the learning process are included. When one

agent is learning, the others are ﬁxed for that episode

and their rewards do not contribute to the learning

curve. Analogously, in considering only local obser-

vations, the rewards ﬂuctuate quite independently.

Further we explore the cumulative job slowdowns

for our agents, and compare it with heuristics. We

provide multi-machine generalisations of the SJF

(slowest job ﬁrst), packer, and Tetris agents from

DeepRM. The original algorithm chooses a job to

schedule according to a score. Extending these algo-

rithms is however straightforward: the multi-machine

algorithm not only chooses the best job to allocate,

it also maximizes this score over the available ma-

chines.

For a fair comparison, we use the same number of

heuristic agents in each case. That is, the slowdown

of a system with three PPO agents is compared to the

slowdown of three Packer or Random agents, for ex-

Multi-Agent Deep Reinforcement Learning for Collaborative Task Scheduling

1081

Figure 3: Running mean reward obtained over time by three

PPO agents, using local observations, local rewards and

having one machine per agent.

ample. We run 10 sample episodes of 200 timesteps

on trained PPO agents with both local and global in-

formation, as well as heuristics, and average the slow-

downs for comparison. We do this over different load

factors, and the results are summarized in Figure 4.

We note how there isn’t a signiﬁcant difference be-

tween the slowdowns of PPO agents trained with lo-

cal or global information. Also, the generalisations of

packer and SJF agents do not scale well to the multi-

agent scenario.

Figure 4: Average job slowdowns over load factors of dif-

ferent methods in an environment with 3 machines and 1

machine per agent.

7 DISCUSSION

In summary, answering Q1 one can conﬁdently state

that independent PPO agents can successfully inter-

nalise joint scheduling policies in a multi-agent multi-

machine setup that are comparable to or superseed

heuristics, as measured by job slowdown. This re-

sult is particularly signiﬁcant given the complexity

of coordinating multiple agents and resources with-

out predeﬁned rules or direct communication between

agents. Moreover, accessing global states and rewards

by individual agents facilitates faster and more stable

learning. The state and reward representations implic-

itly ensures that agents will not set different goals for

themselves, but learn to interact productively.

In response to Q2, our ﬁndings reveal that the dis-

tinction between global and local observations and re-

wards does not signiﬁcantly impact job slowdowns.

This could partly be due to our experiments be-

ing conducted involving relatively small numbers of

agents and machines. However, we observed a no-

ticeable effect on the stability and speed of conver-

gence, which varies based on whether the observa-

tions and rewards are local or global. Our experi-

ments suggest that access to global information facili-

tates faster and more stable convergence, but does not

imply signiﬁcant performance gains. This increase

in convergence speed implies that every learning up-

date contributes to the agents’ ability to learn col-

laboratively, and that successfully incorporating other

agents’s feedback proves to be useful on an individ-

ual level. In contrast, relying solely on local informa-

tion seems to highlight the non-stationary aspects of

the environment, impacting the speed of the learning

process. In terms of performance however, both are

capable of reaching comparable levels.

Moving on to Q3, we see how our agents super-

seed heuristics. This superiority is particularly evi-

dent in complex scenarios where multiple agents in-

teract and collaborate. It is crucial to note that while

heuristics provide a baseline for decision-making,

they often fall short in multi-agent environments, es-

pecially as the number of agents increases.

8 CONCLUSIONS

In this paper we provided an Agent Environment

Cycle game framework that successfully extends a

general-purpose task scheduling simulator, able to

handle multiple agents and multiple machines per

agents. We also proved that in this context, Proxi-

mal Policy Optimisation is a feasible method that can

be successfully applied in a multi-agent setting in or-

der to ﬁnd scheduling policies that are both plausibly

good as compared to heuristics and scalable as well.

While the achieved results seem promising, our

experiments are still conducted in a simulated envi-

ronment, thus an obvious line of future work that we

ﬁnd important to outline is placing this in a practi-

cal context. One challenge of this is the assumption

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

1082

of knowing resource proﬁles in advance, which rarely

holds in practice, but tackling this is a necessity in

every practical approach.

In conclusion, our ﬁndings suggest that while

cluster scheduling remains a particularly difﬁcult

problem, modern MARL methods are not only effec-

tive in managing resources, they are also a step to-

wards realizing fully autonomous and scalable cloud

systems. The ability of RL agents to learn and im-

prove over time, without human intervention or hand-

crafted features, aligns with the goal of developing

cloud systems that can self-manage and dynamically

adapt to changing conditions and requirements.

REFERENCES

Brockman, G., Cheung, V., Pettersson, L., Schneider, J.,

Schulman, J., Tang, J., and Zaremba, W. (2016). Ope-

nai gym. arxiv. arXiv preprint arXiv:1606.01540, 10.

Dauz

ere-P

es, S., Ding, J., Shen, L., and Tamssaouet, K.

(2023). The ﬂexible job shop scheduling problem: A

review. European Journal of Operational Research.

Fan, Y., Li, B., Favorite, D., Singh, N., Childers, T., Rich, P.,

Allcock, W., Papka, M. E., and Lan, Z. (2022). Dras:

Deep reinforcement learning for cluster scheduling in

high performance computing. IEEE Transactions on

Parallel and Distributed Systems, 33(12):4903–4917.

Lanctot, M., Lockhart, E., Lespiau, J.-B., Zambaldi, V.,

Upadhyay, S., P

erolat, J., Srinivasan, S., Timbers, F.,

Tuyls, K., Omidshaﬁei, S., et al. (2019). Openspiel: A

framework for reinforcement learning in games. arXiv

preprint arXiv:1908.09453.

Liu, C.-L., Chang, C.-C., and Tseng, C.-J. (2020). Actor-

critic deep reinforcement learning for solving job shop

scheduling problems. Ieee Access, 8:71752–71762.

Mao, H., Alizadeh, M., Menache, I., and Kandula, S.

(2016). Resource management with deep reinforce-

ment learning. In Proceedings of the 15th ACM work-

shop on hot topics in networks, pages 50–56.

Mirhoseini, A., Goldie, A., Yazgan, M., Jiang, J. W.,

Songhori, E., Wang, S., Lee, Y.-J., Johnson, E.,

Pathak, O., Nazi, A., et al. (2021). A graph place-

ment methodology for fast chip design. Nature,

594(7862):207–212.

OpenAI, M. A., Baker, B., Chociej, M., J

ozefowicz, R.,

McGrew, B., Pachocki, J. W., Pachocki, J., Petron,

A., Plappert, M., Powell, G., et al. (2018). Learning

dexterous in-hand manipulation. corr abs/1808.00177

(2018). arXiv preprint arXiv:1808.00177.

Paduraru, C., Patilea, C. C., and Iordache, S. (2023). Task

scheduling: A reinforcement learning based approach.

In ICAART (3), pages 948–955.

Schulman, J., Moritz, P., Levine, S., Jordan, M., and

Abbeel, P. (2015). High-dimensional continuous con-

trol using generalized advantage estimation. arXiv

preprint arXiv:1506.02438.

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and

Klimov, O. (2017). Proximal policy optimization al-

gorithms. arXiv preprint arXiv:1707.06347.

Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I.,

Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M.,

Bolton, A., et al. (2017). Mastering the game of go

without human knowledge. nature, 550(7676):354–

359.

Sutton, R. S. and Barto, A. G. (2018). Reinforcement learn-

ing: An introduction. MIT press.

Sutton, R. S., McAllester, D., Singh, S., and Mansour, Y.

(1999). Policy gradient methods for reinforcement

learning with function approximation. Advances in

neural information processing systems, 12.

Tassel, P., Gebser, M., and Schekotihin, K. (2021). A rein-

forcement learning environment for job-shop schedul-

ing. arXiv preprint arXiv:2104.03760.

Terry, J., Black, B., Grammel, N., Jayakumar, M., Hari, A.,

Sullivan, R., Santos, L. S., Dieffendahl, C., Horsch,

C., Perez-Vicente, R., et al. (2021). Pettingzoo: Gym

for multi-agent reinforcement learning. Advances in

Neural Information Processing Systems, 34:15032–

15043.

Wang, M., Zhang, J., Zhang, P., Cui, L., and Zhang, G.

(2022). Independent double dqn-based multi-agent

reinforcement learning approach for online two-stage

hybrid ﬂow shop scheduling with batch machines.

Journal of Manufacturing Systems, 65:694–708.

Ye, Y., Ren, X., Wang, J., Xu, L., Guo, W., Huang, W.,

and Tian, W. (2018). A new approach for resource

scheduling with deep reinforcement learning. arXiv

preprint arXiv:1806.08122.

Yu, C., Velu, A., Vinitsky, E., Gao, J., Wang, Y., Bayen, A.,

and Wu, Y. (2022). The surprising effectiveness of ppo

in cooperative multi-agent games. Advances in Neural

Information Processing Systems, 35:24611–24624.

Zhang, J.-D., He, Z., Chan, W.-H., and Chow, C.-Y.

(2023). Deepmag: Deep reinforcement learning with

multi-agent graphs for ﬂexible job shop scheduling.

Knowledge-Based Systems, 259:110083.

Zhao, X. and Wu, C. (2021). Large-scale machine learn-

ing cluster scheduling via multi-agent graph reinforce-

ment learning. IEEE Transactions on Network and

Service Management.

Zhu, X., Xu, J., Ge, J., Wang, Y., and Xie, Z. (2023). Multi-

task multi-agent reinforcement learning for real-time

scheduling of a dual-resource ﬂexible job shop with

robots. Processes, 11(1):267.

Multi-Agent Deep Reinforcement Learning for Collaborative Task Scheduling

1083