DRAGON: A Dynamic Scheduling and Scaling Controller for Managing

Distributed Deep Learning Jobs in Kubernetes Cluster

Chan-Yi Lin, Ting-An Yeh and Jerry Chou

Computer Science Department, National Tsing Hua University, Computer Science Department,

Hsinchu Taiwan (R.O.C), Taiwan

Keywords:

Deep Learning, Resource Orchestration, Deep Learning, Job Scheduling, Autoscaling.

Abstract:

With the fast growing trend in deep learning driven AI services over the past decade, deep learning, espe-

cially the resource-intensive and time-consuming training jobs, have become one of the main workload in

today’s production clusters. However, due to the complex workload characteristics of deep learning, and the

dynamic natural of shared resource environment, managing the resource allocation and execution lifecycle of

distributed training jobs in cluster can be challenging. This work aims to address these issues by developing

and implementing a scheduling and scaling controller to dynamically manage distributed training jobs on a

Kubernetes (K8S) cluster, which is a broadly used platform for managing containerized workloads and ser-

vices. The objectives of our proposed approach is to enhance K8S with three capabilities: (1) Task dependency

aware gang scheduling to avoid idle resources. (2) Locality aware task placement to minimize communication

overhead. (3) Load aware job scaling to improve cost efﬁciency. Our approach is evaluated by real testbed and

simulator using a set of TensorFlow jobs. Comparing to the default K8S scheduler, our approach successfully

improved resource utilization by 20% ∼ 30% and reduced job elapsed time by over 65%.

1 INTRODUCTION

To obtain sufﬁcient computing resources for training

a large scale deep learning model in a timely fash-

ion, one must rely on distributed training technique

on a parallel computing systems, like GPU cluster. A

distributed training job is consisted of multiple com-

puting tasks, and each task can utilize a computing

device to train a single model together. As shown

by a recent study (Goyal et al., 2017), a 50 layers

of ResNet model can be trained in 1 hour by using

256 GPUs. Hence, building and managing a resource

pool for distributed deep learning training is essential.

Furthermore, due to the expensive cost of these high

performance computing devices, such as GPU, the re-

source pool must be shared among users and multi-

plex between jobs in order to improve overall system

utilization and cost efﬁciency.

The parallel deep learning computing has been ex-

tensively studied (Krizhevsky, 2014; Li et al., 2014;

Zhang et al., 2016), and it has been implemented in

many machine learning computing frameworks, such

as TensorFlow (Abadi et al., 2016), PyTorch (Paszke

et al., 2017), CNTK (Yu et al., 2014). But these dis-

tributed training jobs are commonly managed like the

traditional parallel batch jobs or big data applications

by the cluster resource managers and job schedulers,

such as Kubernetes (Burns et al., 2016), YARN (Vav-

ilapalli et al., 2013), Mesos (Hindman et al., 2011),

and SLURM (Jette et al., 2002), where jobs can keep

holding on GPUs until their training complete, and

each task of a jobs is scheduled independently onto

the ﬁrst available computer slot. As a result, a system

could suffer from signiﬁcant performance degrada-

tion and low resource utilization due to the strong de-

pendency and heavy communication among the com-

puting tasks of a distributed training job. Optimiz-

ing the performance of distributed training jobs in

a shared or cloud resource environment is particu-

larly challenging because of the resource contention

between jobs in terms of GPU devices and network

bandwidth, etc. Therefore, this work aims to address

these issues by designing and implementing a con-

troller in Kubernetes to manage the resource alloca-

tion and life-cycle of distributed training jobs.

Kubernetes (K8S) is a open source platform for

managing containerized workloads and services. Its

main purpose is to orchestrate computing, network-

ing, and storage infrastructure on behalf of user work-

loads, and its container-centric management environ-

Lin, C., Yeh, T. and Chou, J.

DRAGON: A Dynamic Scheduling and Scaling Controller for Managing Distributed Deep Learning Jobs in Kubernetes Cluster.

DOI: 10.5220/0007707605690577

In Proceedings of the 9th International Conference on Cloud Computing and Services Science (CLOSER 2019), pages 569-577

ISBN: 978-989-758-365-0

569

ment enables portability and agility across infrastruc-

ture providers. It has been widely used in indus-

try production systems. Many deep learning service

platforms, including Kubeﬂow (Kubeﬂow, 2017),

RiseML (RiseML, 2017), Microsoft OpenPAI (Mi-

crosoft, 2016), IBM FfDL (IBM, 2018), are also built

on top of Kubernetes to help users deploy and launch

deep learning jobs on K8S with ease. However, as

shown by our study, the current deployment features

for distributed training jobs on K8S are at the task

level not at the job level. As a result, the existing

management functionality, like auto-scaling and load-

balancing, cannot be applied to distributed training

jobs, and the default K8S FIFO scheduler could cause

signiﬁcant network overhead and even deadlock prob-

lem.

To address the aforementioned problems, we pro-

posed DRAGON, named after ”Deep LeaRning with

Auto-scale and GAng-schedule On KuberNetes”. It

is designed and implemented as an extended con-

troller component in K8S to provide the ability of

managing distributed training at job level. Therefore,

we were able to design more sophisticate scheduling

and scaling strategies based on job level or even sys-

tem level information for performance optimization.

Speciﬁcally, DRAGON enhances K8S with three ca-

pabilities: (1) Task dependency aware gang schedul-

ing to avoid idle resources. (2) Locality aware task

placement to minimize communication overhead. (3)

Load aware job scaling to improve cost efﬁciency. As

shown by our experiments in a small scaled 2 nodes

8 GPUs cluster, and a larger scaled 4 nodes 32 GPUs

simulator, compared to the default K8S manager, our

approach signiﬁcantly improved resource utilization

and reduced job elapsed time.

The rest of paper is structured as follows. Sec-

tion 2 brieﬂy introduce TensorFlow and Kubernetes as

the case study subjects in our work, and highlights the

problems of existing approaches. The design and im-

plement of our proposed solution, DRAGON, is de-

tailed in Section 3 followed by the experimental eval-

uations in Section 4. Finally Section 5 gives related

work, and Section 6 concludes the paper.

2 BACKGROUND

This work investigates the problems of distributed

training on shared and cloud resource environment

by studying and optimizing the performance of Ten-

sorFlow on Kubernetes. Hence, in this section, we

brieﬂy introduce these two software tools and high-

light the potential problems and limitations of exist-

ing approaches.

worker

Parameter Servers

௧ାଵ

׏݂

௜

(ݔ)

௧ାଵ

= ߱

௧

+ ߟ ȉ ׏݂

௜

(ݔ)

Figure 1: Data parallelism using parameter servers to ex-

change gradients.

2.1 TensorFlow: Distributed Training

Variant distributed computing techniques have been

proposed and implemented for deep learning. In this

work, we consider the parameter server strategy im-

plemented by TensorFlow (Abadi et al., 2016) as

a study case to investigate the performance of dis-

tributed training. We chose TensorFlow because it is

one of most popular computing frameworks for deep

learning. TensorFlow provides a set of high-level API

for users to build their neural network models. The

computation on the model is then translated into a

dataﬂow job executed and managed by TensorFlow

on any computing devices, including GPUs. Hence it

eases the efforts of users to develop and compute deep

learning models.

The parameter server implementation of Tensor-

Flow is shown in Figure 1. A distributed training job

is consisted of a set of tasks in two different roles.

One of the role is parameter server which stores the

global parameters of the model and send the current

values to each worker for parallel training. The other

one role is worker which use the values from param-

eter server to compute the gradients in each iteration.

At the end of each training iteration, workers send the

gradients to parameter servers for updating the global

parameters before the start of next iteration. The pa-

rameter update can be done asynchronously among

workers, so the workers can run independently to each

other throughout the whole training process and ob-

tain better scalability in large scale training.

The number of parameter servers and workers can

be controlled by users, and this setting plays a cru-

cial role to the performance of distributed training

jobs. As shown by our experimental results in Fig-

ure 2, increasing the number of workers can acceler-

ate the training speed for all three image classiﬁcation

models (InceptionV3, RestNet-50, and AlexNet) we

tested. But the speedup is not linear, and the scala-

bility of each model can be different from each other.

This is because the communication overhead between

CLOSER 2019 - 9th International Conference on Cloud Computing and Services Science

570

92.12

115.06

38.09

154.14

176.65

59.1

192.01

277.37

77.76

280.79

346.51

97.69

0.5

1.5

2.5

3.5

InceptionV3 ResNet-50 AlexNet

imgs/sec

(normalized to 1 worker)

1 2 3 4 workers

Figure 2: Scalability with images/sec throughput. Higher

values means higher performance.

<ϴ^

DĂƐƚĞƌ

^ĐŚĞĚƵůĞƌ

ŽŶƚƌŽůůĞƌ

EŽĚĞ

WK

ĐŽŶƚĂŝŶĞƌ

WK

ĐŽŶƚĂŝŶĞƌ

:Žď

ĞƐĐƌŝƉƚŝŽŶ

;zD>ĨŝůĞͿ

W/^ĞƌǀĞƌ

ŬƵďĞůĞƚ

WK

ĚĞƐĐƌŝƉƚŝŽŶ

;zD>ĨŝůĞͿ

WK

ĚĞƐĐƌŝƉƚŝŽŶ

;zD>ĨŝůĞͿ

WK

ĞƐĐƌŝƉƚŝŽŶ

;zD>ĨŝůĞͿ

ŽĐŬĞƌ,Ƶď

;/ŵĂŐĞZĞƉŽƐŝƚŽƌǇͿ

ŽŶƚĂŝŶĞƌƌƵŶƚŝŵĞ;ŽĐŬĞƌͿ

EŽĚĞ

WK

ĐŽŶƚĂŝŶĞƌ

WK

ĐŽŶƚĂŝŶĞƌ

ŬƵďĞůĞƚ

ŽŶƚĂŝŶĞƌƌƵŶƚŝŵĞ;ŽĐŬĞƌͿ

Figure 3: System architecture and components of K8S.

workers and parameter servers for exchanging the

gradients and weights of all the parameters after every

training iterations. Furthermore, all the processes of

workers and parameter servers must be launched be-

fore the training can be started by TensorFlow. As a

result, distributed training jobs tend have strong task

dependency and heavy communication trafﬁc among

their tasks. So they should be scheduled as a unit and

avoided been placed across compute nodes by the re-

source managers.

2.2 Kubernetes: Cluster Manager

Kubernetes (K8S) is a open source platform for man-

aging containerized workloads and services on a

shared resource pool. It has been broadly used to

deploy software across infrastructure providers. The

system architecture of K8S is shown in Figure 3.

It consists of master nodes and worker nodes. The

master nodes host the management components of

K8S, and the worker nodes provide the computing

resources requested by user’s containers. The small-

est deployable units managed by K8S is called POD,

which is a set of tightly coupled containers with

shared network namespace and storage among them.

A POD can be replicated, recovered, and scheduled

by K8S according to its desired state as described by

users.

The API server provides the interface for users to

interact with K8S. One of the main functions is to let

users submit a job description ﬁle which contains all

the necessary information for deploying and manag-

ing the resources of a job throughout its life-cycle,

such as container images, resources requirements,

storage volume types and so on. The job description

ﬁle will be handled by a deployment controller, which

in turn generates the a description ﬁle for each of re-

quested PODs of the job. Then the scheduler is re-

sponsible for selecting the worker node for each POD.

The default K8S scheduler implements a FIFO queue,

and binds PODs to workers in a round-robin manner

with the aim of balancing the load among workers.

Once the POD requests is received by Kubelet, a node

agent on workers, it launches its assigned PODs when

there is sufﬁcient residual capacity, and monitors their

status and resource usage during execution. Kuber-

netes is highly reconﬁgurable and extensible by al-

lowing cluster administrators to deﬁne new resource

object managed a customized resource controllers. As

detailed in Section 3.2, we deﬁned our own resource

type for TensorFlow jobs, and implemented our cus-

tomized controller to optimize job management.

2.3 Challenges

Both TensorFlow and Kubernetes are powerful tools

that are broadly used in their respective worlds. But

together, there are a few critical problems that could

hamper the performance of distributed training in

cluster.

First, users must submit a deployment request for

each individual workers of a training job. Kube-

ﬂow (Kubeﬂow, 2017) is a recent work that develop a

controller to simply the deployment process by gener-

ating all the deployment ﬁles for users automatically.

But it doesn’t provide any resource allocation opti-

mization or job life-cycle management ability com-

paring to our work.

Second, the default K8S scheduler is at POD level,

and it tends to spread PODs evenly across all the

nodes. But as described in Section 2.1, distributed

training jobs has strong dependency among its tasks.

So the jobs may suffer from signiﬁcant communi-

cation overhead. Even worse, resources might be

wasted when a training task cannot start computing

before all its peer tasks are launched. Therefore, both

job performance and resource utilization can be suf-

fered when the scheduler is not aware the task depen-

dency of a job.

Last but not least, the existing solution provided

by K8S is called Horizontal Pod Autoscaler, which

controls the number of PODs for a deployment object

according to the CPU utilization of a POD in response

to the sudden surge of workload demands. But a sin-

gle distributed training job can be consisted of mul-

tiple PODs, and deep learning computations are of-

DRAGON: A Dynamic Scheduling and Scaling Controller for Managing Distributed Deep Learning Jobs in Kubernetes Cluster

571

ten communication not cmoputation intensive. There-

fore, new control policy must be designed for this al-

ternative objective.

3 DRAGON

3.1 Overview

To address the challenges discussed in Section 2.3,

we designed and implemented DRAGON, which is an

extended controller for managing distributed training

jobs on K8S. The key design principal of DRAGON

is to enhance job scheduling and scaling decisions by

utilizing job and system level information. Speciﬁ-

cally, DRAGON is based on the following three man-

agement strategies for minimizing job execution time

and maximizing resource utilization. Detailed imple-

mentations and algorithms are given in Section 3.2

and Section 3.3, respectively.

Task Dependency Aware Gang Scheduling:

DRAGON is a job level scheduler. It schedules all the

tasks of a job as a single unit like a gang scheduler.

A job is only launched when the system has enough

residual capacity to run all its tasks simultaneously.

Thus, resources will not be occupied by idle tasks,

and deadlock among tasks can be avoided.

Locality Aware Task Placement: To minimize

the communication overhead of a job, DRAGON

tends to place all the tasks of a job on as fewer number

of compute nodes as possible. A more sophisticate

placement algorithm with considerations of network

topology and performance interference could also be

implemented in our controller. But we use a simple

greedy algorithm in this work to demonstrate the ca-

pability of our controller. Besides, those sophisticate

placement algorithms (Amaral et al., 2017) often re-

quired prior knowledge of the job execution time or

pre-built performance models which may not be avail-

able in practice.

Load Aware Job Scaling: Distributed training

are often long jobs with stable workload. But the

overall system loading can still varied over time as

jobs arrive and leave the system. Hence, DRAGON

tends to scale-up jobs when the system loading is low,

so that jobs can take advantage of the residual capac-

ity to reduce their execution time. On the other hand,

DRAGON tends to scale-down jobs when the system

loading is high, so that the resources can be reclaimed

from the running jobs to launch the waiting jobs as

soon as possible.

ŽŶƚƌŽůůĞƌ

 ĐŽŶƚƌŽůůĞƌĐĂŶŚĂǀĞ

ŽŶĞƚŽŵƵůƚŝƉůĞ

ŝŶĨŽƌŵĞƌƐƚŽǁĂƚĐŚ

ƚŚĞƌĞƐŽƵƌĐĞĂŶĚ

ƐƚŽƌĞƚŚĞƐƚĂƚƵƐŝŶƚŚĞ

ůŽĐĂůĐĂĐŚĞ

 /ĨƚŚĞƌĞŝƐĂŶǇĐŚĂŶŐĞ͕

ŝŶĨŽƌŵĞƌǁŝůůĐĂůůƚŚĞ

ĐĂůůďĂĐŬĂŶĚƉƵƚƚŚĞ

ŽďũĞĐƚŝŶƚŽ

ǁŽƌŬƋƵĞƵĞ

 ĐŽŶƚƌŽůůĞƌůĂƵŶĐŚĞƐ

ŐŽƌŽƵƚŝŶĞƐ ƚŽƌƵŶ

ǁŽƌŬĞƌƐĂŶĚŚĂŶĚůĞƐ

ƚŚĞŝƚĞŵŝŶƚŚĞ

ǁŽƌŬƋƵĞƵĞ

 hƐĞĐůŝĞŶƚƚŽĐĂůůW/

ƐĞƌǀĞƌĂŶĚĐŚĂŶŐĞƚŚĞ

ĚĞƐŝƌĞĚƐƚĂƚƵƐŽĨ

ŽďũĞĐƚƐ

W/^ĞƌǀĞƌ

ZĞĨůĞĐƚŽƌ

ĞůƚĂ&/&K

>ŽĐĂů

^ƚŽƌĞ

>ŝƐƚͬǁĂƚĐŚ

ZĞƐŽƵƌĐĞŽďũĞĐƚƐ

tƌŝƚĞ

/ŶĨŽƌŵĞƌ

ĂůůďĂĐŬƐ͗

KŶĚĚ

KŶhƉĚĂƚĞ

KŶĞůĞƚĞ

tŽƌŬƋƵĞƵĞ

tŽƌŬĞƌ

ZĞĂĚKŶůǇ KďũĞĐƚ/ŶĨŽ

ůŝĞŶƚƐ

ŽŶƚƌŽůůĞƌͲƐƉĞĐŝĨŝĐ

ŽŶƚƌŽůůĞƌ

ZĞĐŽŶĨŝŐƵƌĞƌĞƐŽƵƌĐĞĂůůŽĐĂƚŝŽŶ

Figure 4: The interaction between a custom controller and

K8S. The two gray boxes are the controller-speciﬁc compo-

nents that were re-implemented in our approach.

3.2 Implementation

Our implementation follows the operator pattern de-

ﬁned by K8S. Operator pattern is a way to extend

K8S by combining a custom controller along with a

custom resource type which speciﬁes the user-deﬁned

objects that will be managed by the custom controller.

In our implementation, we deﬁned distributed train-

ing job as a custom resource, and we implemented a

custom controller with the logic to make the schedul-

ing and scaling decisions for these training jobs. With

our custom resource, users can describe a distributed

training job in a single YAML ﬁle, and specify the

maximum and minimum number of workers allowed

for scaling the job. Hence our approach not only helps

users simplify the deployment process like Kubeﬂow,

but also gives users the ability to provide information

about how their jobs want to be controlled. For in-

stance, we can easily extend our scheduler to support

priority scheduling according to the job priority given

by users.

The interaction between a custom controller and

K8S is illustrated in Figure 4. Controller relies on an-

other component called informer to register and mon-

itor the events that trigger the controller. In our im-

plementation, we registered all three default setting

events (OnAdd, OnUpdata, OnDelete) for our cus-

tom resource. OnAdd and OnDelete event indicates

a distributed training job arrives or leaves the system.

OnUpdate means users manually changes the spec of

a job, such as number of workers, job name, etc.

When a registered event occurs on one of the cus-

tom resource object, informer adds this object into the

workqueue of the custom controller. Then the con-

troller creates a worker process to handle the event on

the object according to the control logic implemented

by the controller. In our work, we implemented the

algorithm described in next subsection. Finally, the

controller attempts to change the system state through

the command of API server. In our implementation,

we use the controller to do job scheduling and scaling

by generating new deployment ﬁles for the computing

tasks that need to be added or changed by our control

CLOSER 2019 - 9th International Conference on Cloud Computing and Services Science

572

d&:ŽďƐƌƌŝǀĞ

;KŶĚĚͿ

d&:ŽďƐĞƉĂƌƚƵƌĞ

;KŶĞůĞƚĞͿ

ĚĚũŽďƚŽƋƵĞƵĞ

;ϭͿ^ĐĂůĞͲĚŽǁŶĂůĂƌŵ͗

ŶǇƵƌŐĞŶƚũŽďǁŚŽƐĞ

ǁĂŝƚŝŶŐƚŝŵĞхƚŚƌĞƐŚŽůĚ;ʏͿ͍

;ϮͿ^ĐĂůĞͲĚŽǁŶƉŽůŝĐǇ͗

&ŝŶĚƐĐĂůĞͲĚŽǁŶ

ǀŝĐƚŝŵƐĨŽƌƌƵŶŶŝŶŐƚŚĞ

ƐŵĂůůĞƐƚƵƌŐĞŶƚũŽď

;ϯͿ:ŽďƐĐŚĞĚƵůŝŶŐƉŽůŝĐǇ͗

^ĞůĞĐƚĂũŽďďĂƐĞĚŽŶ&&^

ΘWĂĐŬŝƚƐƚĂƐŬƚŽƚŚĞůĞĂƐƚ

ŶƵŵďĞƌŽĨŶŽĚĞƐ

zĞƐ

EŽ

/ƐƚŚĞƌĞĞŶŽƵŐŚ

ƌĞƐŝĚƵĂůĐĂƉĂĐŝƚǇƚŽƌƵŶ

ƚŚĞƐŵĂůůĞƐƚƵƌŐĞŶƚũŽď͍

zĞƐ

EŽ

^ƵďŵŝƚŶĞǁĚĞƉůŽǇŵĞŶƚ

ĨŝůĞƐĨŽƌƚŚĞƵƌŐĞŶƚũŽď

ĂŶĚƚŚĞǀŝĐƚŝŵũŽďƐ

^ĐĂůĞͲĚŽǁŶƚŚĞ

ǀŝĐƚŝŵũŽďƐ

^ƵďŵŝƚŶĞǁ

ĚĞƉůŽǇŵĞŶƚĨŝůĞƐ

ĨŽƌƚŚĞƐĞůĞĐƚĞĚũŽď

ũŽďŝƐ

ƐĞůĞĐƚĞĚ

EŽũŽďŝƐ

ƐĞůĞĐƚĞĚ

;ϰͿ^ĐĂůĞͲƵƉĂůĂƌŵ͗

ŶǇŶŽĚĞǁŝƚŚƌĞƐŽƵƌĐĞƐǁŚŽƐĞ

ŝĚůĞƚŝŵĞхƚŚƌĞƐŚŽůĚ;ʅͿ͍

^ƵďŵŝƚŶĞǁ

ĚĞƉůŽǇŵĞŶƚĨŝůĞƐĨŽƌ

ƚŚĞƐĐĂůĞͲƵƉũŽďƐ

EŽ

ŶĚŽĨĐŽŶƚƌŽůůŽŽƉ

dƌǇƚŽƐĐŚĞĚƵůĞ

ƚŚĞŶĞǆƚũŽď

;ϱͿ^ĐĂůĞͲƵƉƉŽůŝĐǇ͗

&ŝŶĚƐĐĂůŝŶŐũŽďƐƚŽ

ƵƚŝůŝǌĞŝĚůĞƌĞƐŽƵƌĐĞƐ

EŽƐĐĂůĞͲĚŽǁŶ

ƉůĂŶĐĂŶďĞĨŽƵŶĚ

ůŝƐƚŽĨ

ƐĐĂůĞͲĚŽǁŶ

ũŽďƐ

zĞƐ

ůŝƐƚŽĨ

ƐĐĂůĞͲƵƉũŽďƐ

EŽũŽďŝƐ

ƐĞůĞĐƚĞĚ

^ĐĂůĞͲĚŽǁŶ

ĂůĂƌŵΘƉŽůŝĐǇ

^ĐĂůĞͲƵƉ

ĂůĂƌŵΘƉŽůŝĐǇ

^ĐŚĞĚƵůŝŶŐ

Figure 5: The control ﬂow and policy of DRAGON.

decision.

To enable job level gang scheduling, our controller

maintains an internal job queue, and keeps track of

current system resource usage. When a job is allowed

to be launched, the controller generates deployment

requests for each of its computing tasks at once; oth-

erwise, the job simply waits in the queue. Noted, our

controller must co-exist with the default K8S sched-

uler, thus it uses the node selector label in the deploy-

ment description ﬁle to force K8S scheduler to place

computing tasks (i.e., PODs) onto the assigned node

locations.

3.3 Algorithm

Here, we introduce our proposed control algorithm

that achieves the three design goals mentioned in Sec-

tion 3.1 under the K8S architecture described in Sec-

tion 3.2. The overall control ﬂow of our algorithm is

shown in Figure 5. The algorithm is consisted of three

procedures as follows, and it is called by the controller

whenever distributed training jobs arrive or depart in

the system.

First, a scale-down procedure that aims to launch

urgent jobs in the waiting queue immediately by re-

leasing resources from current running jobs. We de-

ﬁne a job is urgent when its queuing wait time exceeds

a pre-deﬁned threshold value. We tend to schedule

the smallest job with the least amount of resource re-

quirements from the urgent ﬁrst. If we cannot release

enough resources for the smallest job, then the larger

urgent jobs certainly cannot be launched as well. Cur-

rent implementation selects the scale-down victim

jobs as the ones with the most surplus resources. Both

the scale-down alarm and policy can be re-designed

under our control ﬂow for extension.

After the scale-down procedure for urgent jobs,

the controller then starts scheduling the rest of wait-

ing jobs with the residual capacity. We implemented

an adapted ﬁrst come ﬁrst served (AFCFS) schedul-

ing algorithm, which simply selects the ﬁrst arrival

job in the queue whose required resource capacity is

less than then residual capacity. To preserve job local-

ity and minimize communication overhead, we pack

the tasks of a job onto the least number of compute

nodes. This locality aware scheduling algorithm can

also be extended with the information provided by the

systems or users, such as the network topology, per-

formance interference model, job priority, etc.

Finally, when no job in the queue can be sched-

uled, the controller tries to further utilize the leftover

resource capacity through the scale-up procedure for

maximizing resource utilization and minimizing job

execution time. However, we also want to prevent

jobs suffer from network overhead and reserve some

resources for the future arrival jobs. Therefore, the

scale-up alarm is only triggered when a resource (i.e.,

GPU) is idle over a time threshold, and the idle re-

source can only be allocated to a job that already ran

on the same node to maintain job locality.

4 EVALUATIONS

4.1 Experimental Environment

We evaluated our implementation by running real

TensorFlow workload on a small scale GPU cluster

with 2 GPU nodes connected by 1Gb Ethernet ca-

bles. Each node has 4 Nivida P100 GPUs, 16 Intel

2.1GHz CPU cores, and 64GB memory. As sum-

DRAGON: A Dynamic Scheduling and Scaling Controller for Managing Distributed Deep Learning Jobs in Kubernetes Cluster

573

Table 1: Conﬁguration of the jobs. A=AlexNet, I=InceptionV3, R=ResNet50.

Conﬁg. Job1 Job2 Job3 Job4 Job5 Job6 Job7 Job8 Job9 Job10 Job11 Job12

Models R I I A R A I R I A I A

Num. Workers 4 4 2 2 3 2 2 1 2 2 1 1

Min. Workers 2 2 2 2 2 2 2 1 2 2 1 1

Max. Workers 4

Steps 2800 3600 1000 450 2100 1000 2000 2400 2500 800 1000 1200

Arrival Time 0 0.56s 2.53s 6.34s 7.58s 13.11s 19.01s 27.54s 34.56s 42.58s 47.58s 51.19s

marized in Table 1, the test workload is consisted of

12 distributed TensorFlow jobs that train 3 common-

seen image classiﬁcation models (AlexNet, Incep-

tionV3, ResNet-50) with variant settings in scaling

range, training steps, and arrival time. To observe the

behavior of our controller under various system load-

ing, the ﬁrst 5 jobs was generated with a much shorter

inter-arrival time than the rest of jobs. The results are

presented in Section 4.2.

To further validate the robustness of our approach,

we implemented a simulator to evaluate our control

algorithm in a larger scale testbed with 4 compute

nodes, each has 8 GPUs. The test workload contains

100 randomly generated jobs with Poisson inter-arival

times. The mean inter-arrival time is adjust from 62

seconds to 400 seconds to simulate the workload of

various loading. The scalability of the jobs is mod-

eled based on our execution time observations from

the 3 real NN models used in the real experiments.

Each job has its own setting for the maximum, min-

imum and initial number of workers, and the values

were randomly selected between 2 to 8 GPUs. The

simulation results are presented in Section 4.3.

Throughout all the experiments, we compare our

approach, DRAGON, with two other existing schedul-

ing approaches. One is to deploy a distributed train-

ing jobs using the Kubeﬂow toolkit and let the tasks

scheduled by the default K8S scheduler. We use the

name Kubeﬂow to refer this approach in the rest of

this section. The other approach is named locality-

aware scheduler, which is a simpliﬁed implementa-

tion of DRAGON without the feature of auto-scaling.

So this approach also acts the same as a greedy pack-

ing scheduling algorithm that only tends to launch all

the tasks of a job on a single node.

4.2 Real Testbed Results

The job running time comparison can be seen in Fig-

ure 6. The time of each job is normalized to its results

from DRAGON, and noted the y-axis is in log

scale.

In general, DRAGON and locality-aware scheduler

have similar job running time for the ﬁrst 5 jobs when

the system loading is relatively high. But the running

time of job6 ∼ job12 were reduced by DRAGON be-

0.5

1.0

2.0

4.0

8.0

1 2 3 4 5 6 7 8 9 10 11 12

job running time normalized by

DRAGON (in log2 scale)

job ID

Locality-aware scheduler

Kubeflow

Figure 6: The running time of each job normalized to the

results of DRAGON.

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

DRAGON Locality-aware

scheduler

Kubeflow

average job elapsed time

(normalized to Kubeflow)

waiting

scaling

running

0.66

0.34

0.38

0.04

0.03

0.02

0.30

Figure 7: The comparison of the average job elapsed time.

The elapsed time is consist of the time for running, waiting

and scaling.

cause of the effect of scale-up strategy under low sys-

tem loading. The job running time of Kubeﬂow is

higher than DRAGON across the board by a factor of

1.2 to 6.3 due to communication overhead.

Figure 7 compares the average job elapsed time

which includes the time for running, waiting and scal-

ing. The elapsed time of Kubeﬂow is signiﬁcantly re-

duced by Kubeﬂow and DRAGON by 60% and 65%,

respectively. As shown in the plot, this time reduction

comes from both running time and waiting time. The

running time is reduced because the network commu-

nication time is minimized by the job locality aware-

ness of both locality-aware scheduler and DRAGON.

On the other hand, the waiting time is reduced be-

cause of the reduction in job running time allows re-

sources to be released sooner, so waiting jobs can be

launched earlier as well. In our experiment, more than

half of jobs were launched immediately by locality-

aware scheduler and DRAGON.

CLOSER 2019 - 9th International Conference on Cloud Computing and Services Science

574

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 20 40 60 80 100 120

resource utilization

time (minutes)

DRAGON

Kubeflow

Locality-aware scheduler

Figure 8: Resource utilization over time.

Because of the ability of job scaling, Figure 7

shows that DRAGON has better results than locality-

aware scheduler in terms of running time, waiting

time and elapsed time. However, scaling job at run-

time does cause some overhead to DRAGON because

we had to checkpoint and restart the whole job in or-

der to adjust its number of workers. As observed from

our real experiment, restarting a job took about 40

seconds to 1 minute. This overhead is still accept-

able when the average job running time is long, and

the workload variation is not high enough to cause

frequent job scaling actions. But, we do think the

scaling overhead shoulf be considered and minimized

when dynamic resource management approaches are

applied, so it will be an important direction of our fu-

ture work.

Finally, Figure 8 shows the resource utilization

over time. Locality-aware scheduler has the lowest

resource utilization because it cannot utilize residual

capacity unless new job arrives. Kubeﬂow almost

fully utilizes the resources for the ﬁrst hour because

jobs ran much longer and consumes more resources

due to communication overhead. But that also causes

long job waiting time as shown in Figure 7. Then the

utilization of Kubeﬂow drops extremely low for the

next hour because only one job left to run on a single

GPU. Hence, the utilization of Kubeﬂow also can be

highly inﬂuenced by the system loading. In contrary,

DRAGON not only has the shortest job completion

time, it is also able to consistently maintain a high

resource utilization regardless the variation of system

loading. In average, the utilization of DRAGON is

79%, while the utilization of locality-aware sched-

uler and Kubeﬂow are 45% and 57%, respectively.

So DRAGON achieved higher job execution perfor-

mance and better resource utilization at the same time.

4.3 Large scale Simulation Results

We use simulator to validate the robustness of our

control algorithm under various system workload in-

tensity. Figure 9 and Figure 10 show the amount of

-10

100

0.5 1 1.5 2 2.5 3 3.5 4

time reduction (%)

workload intensity

elapsed time

running time

waiting time

Figure 9: The reduced elapsed time of locality-aware sched-

uler by DRAGON.

100

0.5 1 1.5 2 2.5 3 3.5 4

time reduction (%)

workload intensity

elapsed time

running time

waiting time

Figure 10: The reduced elapsed time of Kubeﬂow by

DRAGON.

time that can be reduced by DRAGON over locality-

aware scheduler and Kubeﬂow, respectively. Clearly,

DRAGON was able to reduce the average job elapsed

time across a wide range of workload intensity. To

analyze the results more deeply, we can also observe

that DRAGON has more opportunity to reduce job

waiting time under higher system loading by scal-

ing down jobs. However, if the system loading is too

high, all three scheduling algorithms suffer long wait-

ing time, so the reduction ratio becomes more lim-

ited. Also, the job running time could be increased

by DRAGON when jobs are scaled down like a few

occasions shown in Figure 9 when the workload in-

tensity is higher than 3. But even under high system

load, scale-up strategy could still be triggered to uti-

lize some fragmented resources that cannot be used to

launch new jobs.

On the other hand, when system is less loaded,

DRAGON scales up jobs to utilize the residual capac-

ity for reducing job running time. But if the load-

ing is low, the jobs will be bounded by the maximum

worker setting and even DRAGON cannot utilize all

the resources. Therefore, the running time reduction

also become less in Figure 9 when the workload inten-

sity is lower than 1.5. By combining the time reduc-

tion gains from both running time and waiting time,

DRAGON has the ability to reduce the overall elapsed

time under any system loading circumstances.

DRAGON: A Dynamic Scheduling and Scaling Controller for Managing Distributed Deep Learning Jobs in Kubernetes Cluster

575

5 RELATED WORK

Over the past decade, great effort has been made

from research community to address the scalability

and performance problem of distributed deep learn-

ing algorithms. Both data parallelism and model par-

allelism have been extensively studied (Krizhevsky,

2014). For model parallelism, many model parti-

tion and device placement strategies have been pro-

posed (Mirhoseini et al., 2017; Mayer et al., 2017).

For data parallelism, variant stochastic gradient de-

scent (SGD) algorithms have been developed to re-

duce communication and synchronization overhead,

such as asynchronous SGD (Dean et al., 2012; Zhang

et al., 2016), low-precision SGD (De Sa et al., 2017)

and lock-free SGD (Niu et al., 2011). Decentral-

ized Parameter server (Li et al., 2014), ring allre-

duce (Sergeev and Balso, 2018) are the two com-

munication method that have been broadly used for

exchanging parameters. Finally, many training skills

have been discussed for large scale training, including

learning rate, mini-batch size, and the ratio between

parameters and workers, etc. For instance, greater

mini-batch size is needed to speedup large scale train-

ing, but unusually large minibatch size can affect

learning accuracy. So (Goyal et al., 2017) proposed

a warmup scheme by using less aggressive learning

rates at the start of training, and used the LARS (You

et al., 2017) algorithm to assign a separate learning

rate for each layer instead of each weight. The grow-

ing development of these techniques show the need of

distributed training, and the importance of managing

these distributed training jobs.

Not until recently, people start pay attention to

the problem of managing distributed training jobs

on parallel systems and shared resource environment.

(Amaral et al., 2017) discusses the importance of con-

sidering the underline network topology for optimiz-

ing communication overhead. It proposed a topology-

aware task mapping scheduling algorithm by consid-

ering three performance cost: tasks communication,

external resource interference, and resource fragmen-

tation. Gandiva (Xiao et al., 2018) optimizes the re-

source usage for deep learning training by two tech-

niques. One is to dynamically prioritize or kill a sub-

set of jobs according to the early feedback on train-

ing accuracy. The other is to time-slice GPUs efﬁ-

ciently across multiple jobs by predicting training per-

formance. Similar to Gandiva, Optimus (Peng et al.,

2018) also attempts to minimize job training time

based on online resource-performance models. Pro-

teus (Harlap et al., 2017) proposed to the strategy of

placing parameter servers and workers on cloud spot

instance for cost saving. OASiS (Bao et al., 2018)

is a online job scheduling algorithm that computes

the best job execution schedule upon the arrival of

each job, based on projected resource availability in

the future course and potential job utility to achieve.

Last but ot least, (Jeon et al., 2018) studies the effect

of gang scheduling, multi-tenant GPU sharing, and

failures during the execution of training workload on

GPU cluster. Same as our work, these studies are mo-

tivated by the opportunities and challenges of man-

aging deep learning training dynamically in a shared

resource environment. But different from us, these

studies aim to propose more sophisticate scheduling

and scaling algorithms based on performance predic-

tion models and domain-speciﬁc knowledge of deep

learning, while we focus more on providing a control

mechanism and software architecture to enable dy-

namic resource management on Kubernetes. There-

fore, these studies are complementary to our work,

and many of these algorithms can be implemented

into our control system.

6 CONCLUSIONS

The demand of distributed deep learning training is

growing rapidly in recent years. But deep learning

training is a resource-intensive and time-consuming

workload that needs to be carefully and dynamically

managed in a shared resource environment. In this

work, we investigate these performance issues by run-

ning distributed TensorFlow training jobs on a K8S

cluster, and address the problems by implementing

DRAGON, a K8S controller for dynamic schedul-

ing and scaling jobs according to system loading.

DRAGON enhances K8S with three capabilities: (1)

Task dependency aware gang scheduling to avoid idle

resources. (2) Locality aware task placement to min-

imize communication overhead. (3) Load aware job

scaling to improve cost efﬁciency. The evaluation of

our approach was conducted on both real testbed and

simulator. Comparing to the default K8S scheduler,

DRAGON signiﬁcantly improved resource utilization

by 20% ∼ 30% and reduced job elapsed time by over

65% to deliver higher system performance and lower

computation cost for distributed training. The design

of our controller also allows us to explore the im-

plementations of more sophisticate dynamic resource

management strategies in the future. The complete

source code of our implementation can be download

at (Chan-Yi Lin, 2019).

CLOSER 2019 - 9th International Conference on Cloud Computing and Services Science

576

REFERENCES

Abadi, M. et al. (2016). Tensorﬂow: A system for large-

scale machine learning. In Proceedings of the 12th

USENIX Conference on OSDI, pages 265–283.

Amaral, M., Polo, J., Carrera, D., Seelam, S. R., and Stein-

der, M. (2017). Topology-aware GPU scheduling for

learning workloads in cloud environments. In Pro-

ceedings of the SuperComputing (SC), pages 17:1–

17:12.

Bao, Y., Peng, Y., Wu, C., and Li, Z. (2018). Online job

scheduling in distributed machine learning clusters.

CoRR, abs/1801.00936.

Burns, B., Grant, B., Oppenheimer, D., Brewer, E., and

Wilkes, J. (2016). Borg, omega, and kubernetes. ACM

Queue, 14:70–93.

Chan-Yi Lin (2019). DRAGON: Deep Learning

with Auto-scale and Gang-schedule On Kubernetes.

https://github.com/ChanYiLin/tf-operator-Dragon/.

De Sa, C., Feldman, M., R

e, C., and Olukotun, K.

(2017). Understanding and optimizing asynchronous

low-precision stochastic gradient descent. In Proceed-

ings of the 44th Annual ISCA, pages 561–574.

Dean, J. et al. (2012). Large scale distributed deep net-

works. In Proceedings of the 25th International Con-

ference on Neural Information Processing Systems,

pages 1223–1231.

Goyal, P., Dollar, P., Girshick, R., Noordhuis, P.,

Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., and

He, K. (2017). Accurate, Large Minibatch SGD:

Training ImageNet in 1 Hour. In arXiv preprint

arXiv:1706.02677.

Harlap, A., Tumanov, A., Chung, A., Ganger, G. R., and

Gibbons, P. B. (2017). Proteus: agile ML elastic-

ity through tiered reliability in dynamic resource mar-

kets. In Proceedings of the Twelfth EuroSys, pages

589–604.

Hindman, B., Konwinski, A., Zaharia, M., Ghodsi, A.,

Joseph, A. D., Katz, R., Shenker, S., and Stoica, I.

(2011). Mesos: A Platform for Fine-grained Resource

Sharing in the Data Center. In Proceedings of the 8th

USENIX Conference on NSDI, pages 295–308.

IBM (2018). Fabric for deep learning (ffdl).

https://github.com/IBM/FfDL.

Jeon, M., Venkataraman, S., Phanishayee, A., Qian, J.,

Xiao, W., and Yang, F. (2018). Multi-tenant GPU

Clusters for Deep Learning Workloads: Analysis and

Implications. Microsoft Research Technical Report.

Jette, M. A., Yoo, A. B., and Grondona, M. (2002). Slurm:

Simple linux utility for resource management. In In

Proceedings of Job Scheduling Strategies for Parallel

Processing, pages 44–60. Springer-Verlag.

Krizhevsky, A. (2014). One weird trick for parallelizing

convolutional neural networks. CoRR, abs/1404.5997.

Kubeﬂow (2017). The machine learning toolkit for kuber-

netes. https://www.kubeﬂow.org/.

Li, M. et al. (2014). Scaling distributed machine learning

with the parameter server. In Proceedings of the 11th

USENIX Conference on OSDI, pages 583–598.

Mayer, R., Mayer, C., and Laich, L. (2017). The tensorﬂow

partitioning and scheduling problem: It’s the critical

path! In Proceedings of Workshop on Distributed In-

frastructures for Deep Learning, pages 1–6.

Microsoft (2016). Open platform for ai(openpai).

https://github.com/Microsoft/pai.

Mirhoseini, A., Pham, H., Le, Q. V., Steiner, B., Larsen,

R., Zhou, Y., Kumar, N., Norouzi, M., Bengio, S., and

Dean, J. (2017). Device placement optimization with

reinforcement learning. CoRR, abs/1706.04972.

Niu, F., Recht, B., Re, C., and Wright, S. J. (2011). Hog-

wild!: A lock-free approach to parallelizing stochastic

gradient descent. In Proceedings of the 24th Interna-

tional Conference on Neural Information Processing

Systems, pages 693–701.

Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E.,

DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and

Lerer, A. (2017). Automatic differentiation in pytorch.

Peng, Y., Bao, Y., Chen, Y., Wu, C., and Guo, C. (2018).

Optimus: an efﬁcient dynamic resource scheduler for

deep learning clusters. In Proceedings of the Thir-

teenth EuroSys Conference, pages 3:1–3:14.

RiseML (2017). Machine learning platform for kubernetes.

https://riseml.com/.

Sergeev, A. and Balso, M. D. (2018). Horovod: fast and

easy distributed deep learning in tensorﬂow. CoRR,

abs/1802.05799.

Vavilapalli et al. (2013). Apache Hadoop YARN: Yet An-

other Resource Negotiator. In Proceedings of the 4th

Annual Symposium on Cloud Computing, pages 5:1–

5:16.

Xiao, W., Bhardwaj, R., Ramjee, R., Sivathanu, M., Kwatra,

N., Han, Z., Patel, P., Peng, X., Zhao, H., Zhang, Q.,

Yang, F., and Zhou, L. (2018). Gandiva: Introspec-

tive Cluster Scheduling for Deep Learning. In 13th

USENIX OSDI, pages 595–610.

You, Y., Gitman, I., and Ginsburg, B. (2017). Scaling

SGD batch size to 32k for imagenet training. CoRR,

abs/1708.03888.

Yu, D. et al. (2014). An introduction to computational net-

works and the computational network toolkit. Mi-

crosoft Technical Report.

Zhang, W., Gupta, S., Lian, X., and Liu, J. (2016).

Staleness-aware async-SGD for Distributed Deep

Learning. In Proceedings of the Twenty-Fifth Interna-

tional Joint Conference on Artiﬁcial Intelligence, IJ-

CAI’16, pages 2350–2356.

DRAGON: A Dynamic Scheduling and Scaling Controller for Managing Distributed Deep Learning Jobs in Kubernetes Cluster

577