DRAGON: A Dynamic Scheduling and Scaling Controller for Managing Distributed Deep Learning Jobs in Kubernetes Cluster

Chan-Yi Lin, Ting-An Yeh, Jerry Chou

2019

Abstract

With the fast growing trend in deep learning driven AI services over the past decade, deep learning, especially the resource-intensive and time-consuming training jobs, have become one of the main workload in today’s production clusters. However, due to the complex workload characteristics of deep learning, and the dynamic natural of shared resource environment, managing the resource allocation and execution lifecycle of distributed training jobs in cluster can be challenging. This work aims to address these issues by developing and implementing a scheduling and scaling controller to dynamically manage distributed training jobs on a Kubernetes (K8S) cluster, which is a broadly used platform for managing containerized workloads and services. The objectives of our proposed approach is to enhance K8S with three capabilities: (1) Task dependency aware gang scheduling to avoid idle resources. (2) Locality aware task placement to minimize communication overhead. (3) Load aware job scaling to improve cost efficiency. Our approach is evaluated by real testbed and simulator using a set of TensorFlow jobs. Comparing to the default K8S scheduler, our approach successfully improved resource utilization by 20% ∼ 30% and reduced job elapsed time by over 65%.

Download


Paper Citation


in Harvard Style

Lin C., Yeh T. and Chou J. (2019). DRAGON: A Dynamic Scheduling and Scaling Controller for Managing Distributed Deep Learning Jobs in Kubernetes Cluster.In Proceedings of the 9th International Conference on Cloud Computing and Services Science - Volume 1: CLOSER, ISBN 978-989-758-365-0, pages 569-577. DOI: 10.5220/0007707605690577


in Bibtex Style

@conference{closer19,
author={Chan-Yi Lin and Ting-An Yeh and Jerry Chou},
title={DRAGON: A Dynamic Scheduling and Scaling Controller for Managing Distributed Deep Learning Jobs in Kubernetes Cluster},
booktitle={Proceedings of the 9th International Conference on Cloud Computing and Services Science - Volume 1: CLOSER,},
year={2019},
pages={569-577},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0007707605690577},
isbn={978-989-758-365-0},
}


in EndNote Style

TY - CONF

JO - Proceedings of the 9th International Conference on Cloud Computing and Services Science - Volume 1: CLOSER,
TI - DRAGON: A Dynamic Scheduling and Scaling Controller for Managing Distributed Deep Learning Jobs in Kubernetes Cluster
SN - 978-989-758-365-0
AU - Lin C.
AU - Yeh T.
AU - Chou J.
PY - 2019
SP - 569
EP - 577
DO - 10.5220/0007707605690577