loading
Documents

Research.Publish.Connect.

Paper

Authors: Chan-Yi Lin ; Ting-An Yeh and Jerry Chou

Affiliation: Computer Science Department, National Tsing Hua University, Computer Science Department, Hsinchu Taiwan (R.O.C) and Taiwan

ISBN: 978-989-758-365-0

Keyword(s): Deep Learning, Resource Orchestration, Deep Learning, Job Scheduling, Autoscaling.

Abstract: With the fast growing trend in deep learning driven AI services over the past decade, deep learning, especially the resource-intensive and time-consuming training jobs, have become one of the main workload in today’s production clusters. However, due to the complex workload characteristics of deep learning, and the dynamic natural of shared resource environment, managing the resource allocation and execution lifecycle of distributed training jobs in cluster can be challenging. This work aims to address these issues by developing and implementing a scheduling and scaling controller to dynamically manage distributed training jobs on a Kubernetes (K8S) cluster, which is a broadly used platform for managing containerized workloads and services. The objectives of our proposed approach is to enhance K8S with three capabilities: (1) Task dependency aware gang scheduling to avoid idle resources. (2) Locality aware task placement to minimize communication overhead. (3) Load aware job scaling t o improve cost efficiency. Our approach is evaluated by real testbed and simulator using a set of TensorFlow jobs. Comparing to the default K8S scheduler, our approach successfully improved resource utilization by 20% ∼ 30% and reduced job elapsed time by over 65%. (More)

PDF ImageFull Text

Download
Sign In Guest: Register as new SciTePress user now for free.

Sign In SciTePress user: please login.

PDF ImageMy Papers

You are not signed in, therefore limits apply to your IP address 54.161.118.57

In the current month:
Recent papers: 100 available of 100 total
2+ years older papers: 200 available of 200 total

Paper citation in several formats:
Lin, C.; Yeh, T. and Chou, J. (2019). DRAGON: A Dynamic Scheduling and Scaling Controller for Managing Distributed Deep Learning Jobs in Kubernetes Cluster.In Proceedings of the 9th International Conference on Cloud Computing and Services Science - Volume 1: CLOSER, ISBN 978-989-758-365-0, pages 569-577. DOI: 10.5220/0007707605690577

@conference{closer19,
author={Chan{-}Yi Lin. and Ting{-}An Yeh. and Jerry Chou.},
title={DRAGON: A Dynamic Scheduling and Scaling Controller for Managing Distributed Deep Learning Jobs in Kubernetes Cluster},
booktitle={Proceedings of the 9th International Conference on Cloud Computing and Services Science - Volume 1: CLOSER,},
year={2019},
pages={569-577},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0007707605690577},
isbn={978-989-758-365-0},
}

TY - CONF

JO - Proceedings of the 9th International Conference on Cloud Computing and Services Science - Volume 1: CLOSER,
TI - DRAGON: A Dynamic Scheduling and Scaling Controller for Managing Distributed Deep Learning Jobs in Kubernetes Cluster
SN - 978-989-758-365-0
AU - Lin, C.
AU - Yeh, T.
AU - Chou, J.
PY - 2019
SP - 569
EP - 577
DO - 10.5220/0007707605690577

Login or register to post comments.

Comments on this Paper: Be the first to review this paper.