DRAGON: A Dynamic Scheduling and Scaling Controller for Managing Distributed Deep Learning Jobs in Kubernetes Cluster

Chan-Yi Lin; Ting-An Yeh; Jerry Chou

Research.Publish.Connect.

*Please fill out at least one Field. *Value must be an number!

Title:
ISBN:
Year:
Acronym:
Subject:

Advanced Search Proceedings Search

If you're looking for an exact phrase use quotation marks on text fields.

*Please fill out at least one Field.

Title:
Author:
Affiliation:
Subject:

Advanced Search Papers Search

If you're looking for an exact phrase use quotation marks on text fields.

*Please fill out at least one Field.

Name:
Affiliation:
Country:
Conference:
Subject:

Advanced Search Authors Search

If you're looking for an exact phrase use quotation marks on text fields.

*Please fill out at least one Field.

Name:
Country:
Subject:

Advanced Search Affiliations Search

If you're looking for an exact phrase use quotation marks on text fields.

Proceedings

Proceedings Search *Please fill out at least one Field. *Value must be an number!

Title:
ISBN:
Year:
Acronym:
Subject:

Advanced Search Proceedings Search

If you're looking for an exact phrase use quotation marks on text fields.

Papers

Papers Search *Please fill out at least one Field.

Title:
Author:
Affiliation:
Subject:

Advanced Search Papers Search

If you're looking for an exact phrase use quotation marks on text fields.

Authors

Authors Search *Please fill out at least one Field.

Name:
Affiliation:
Country:
Conference:
Subject:

Advanced Search Authors Search

If you're looking for an exact phrase use quotation marks on text fields.

Advanced Search

Paper

DRAGON: A Dynamic Scheduling and Scaling Controller for Managing Distributed Deep Learning Jobs in Kubernetes Cluster

Topics: Cloud Middleware Frameworks; Microservices: Atomation Deployment and Management, Resource Allocation Elasticity, Service State and Resilience; Resource Management; Service Management; Service Platforms

In Proceedings of the 9th International Conference on Cloud Computing and Services Science CLOSER - Volume 1, 569-577, 2019 , Heraklion, Crete, Greece

Authors: Chan-Yi Lin ; Ting-An Yeh and Jerry Chou

Affiliation: Computer Science Department, National Tsing Hua University, Computer Science Department, Hsinchu Taiwan (R.O.C) and Taiwan

Keyword(s): Deep Learning, Resource Orchestration, Deep Learning, Job Scheduling, Autoscaling.

Abstract: With the fast growing trend in deep learning driven AI services over the past decade, deep learning, especially the resource-intensive and time-consuming training jobs, have become one of the main workload in today’s production clusters. However, due to the complex workload characteristics of deep learning, and the dynamic natural of shared resource environment, managing the resource allocation and execution lifecycle of distributed training jobs in cluster can be challenging. This work aims to address these issues by developing and implementing a scheduling and scaling controller to dynamically manage distributed training jobs on a Kubernetes (K8S) cluster, which is a broadly used platform for managing containerized workloads and services. The objectives of our proposed approach is to enhance K8S with three capabilities: (1) Task dependency aware gang scheduling to avoid idle resources. (2) Locality aware task placement to minimize communication overhead. (3) Load aware job scaling to improve cost efficiency. Our approach is evaluated by real testbed and simulator using a set of TensorFlow jobs. Comparing to the default K8S scheduler, our approach successfully improved resource utilization by 20% ∼ 30% and reduced job elapsed time by over 65%. (More)

CC BY-NC-ND 4.0

Guest: Register as new SciTePress user now for free.

SciTePress user: please login.

My Papers

You are not signed in, therefore limits apply to your IP address 18.222.107.64

In the current month:

Recent papers: 100 available of 100 total

2⁺ years older papers: 200 available of 200 total

Paper citation in several formats:

Lin, C.; Yeh, T. and Chou, J. (2019). DRAGON: A Dynamic Scheduling and Scaling Controller for Managing Distributed Deep Learning Jobs in Kubernetes Cluster. In Proceedings of the 9th International Conference on Cloud Computing and Services Science - CLOSER; ISBN 978-989-758-365-0; ISSN 2184-5042, SciTePress, pages 569-577. DOI: 10.5220/0007707605690577

@conference{closer19,
author={Chan{-}Yi Lin. and Ting{-}An Yeh. and Jerry Chou.},
title={DRAGON: A Dynamic Scheduling and Scaling Controller for Managing Distributed Deep Learning Jobs in Kubernetes Cluster},
booktitle={Proceedings of the 9th International Conference on Cloud Computing and Services Science - CLOSER},
year={2019},
pages={569-577},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0007707605690577},
isbn={978-989-758-365-0},
issn={2184-5042},
}

TY - CONF

JO - Proceedings of the 9th International Conference on Cloud Computing and Services Science - CLOSER
TI - DRAGON: A Dynamic Scheduling and Scaling Controller for Managing Distributed Deep Learning Jobs in Kubernetes Cluster
SN - 978-989-758-365-0
IS - 2184-5042
AU - Lin, C.
AU - Yeh, T.
AU - Chou, J.
PY - 2019
SP - 569
EP - 577
DO - 10.5220/0007707605690577
PB - SciTePress